https://github.com/daitangio/find

Python + SQLite search engine
https://github.com/daitangio/find

crawler indexer python search-engine

Last synced: 5 months ago
JSON representation

Python + SQLite search engine

Host: GitHub
URL: https://github.com/daitangio/find
Owner: daitangio
License: mit
Created: 2025-12-31T19:59:04.000Z (6 months ago)
Default Branch: master
Last Pushed: 2026-01-02T11:38:20.000Z (6 months ago)
Last Synced: 2026-01-05T07:58:41.376Z (6 months ago)
Topics: crawler, indexer, python, search-engine
Language: Python
Homepage:
Size: 28.3 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# So What?

Stop searching, start finding stuff

[![Pylint](https://github.com/daitangio/find/actions/workflows/pylint.yml/badge.svg)](https://github.com/daitangio/find/actions/workflows/pylint.yml)

Find is a super-minimal search engine based on SQLite Full Text Search capabilities and Python.
It is composed of two commands:

- [A Simple web crawler](./src/find/crawl.py) which uses asyncio to maximize index ingestion speed.
- [A Flask app to enable end-users to find](./src/find/app.py) things.

## Features
- Find supports caching of web pages (a lost feature of Google) and de-duplication if content is the same for some pages.Back link ranking tuning is in progress
- Respects robots.txt

# How to start

Create a virtualenv and install the project:

```sh
python3 -m venv .venv
. .venv/bin/activate
pip install -e .
```

Run your first crawl:

crawl --seed https://myhost.com --same-host

Run the web interface with:

findgui

# Why

I need to design a small search engine for my static web site. I asked to ChatGPT 5.2 to design it, then I refined the code.
Initial prompt was

Design a small python web application to implement a search engine.
The search must be performed on a SQLite database using
the SQLite Full Text Search (FTS5) extension.
Design the database model to be able to store simple html web pages.

# Design principles

Find is a compact,zero-conf & tiny solution to add a search engine to a pre-existing blog site.
It just works out of the box.

As a basic rule I will try to keep it below 2000 lines of code.

The project accepts pull requests: please open it adding a comment. Ensure the change passes the pylint checks.

# How

[SQLite has a full text search capability called FTS5](https://sqlite.org/fts5.html) which offers out of the box also stemming for english language.

ChatGPT for the crawler proposed asyncio I/O (aiohttp & aiosqlite libraries), which is a very good approach to scale the crawler: downloading web pages is a very I/O bound activity and it benefits from a non-blocking library.

Initial implementation has a locking problem: we solved it with a mono-writer database task.
SQLite is so fast you have an hard time to tune the writer queue: it is very difficult to saturate it.
To avoid data loss, I opted for a queue 4x the concurrency level.

The crawler has a default delay to avoid overloading the target site. For this reason, it is pointless to have too much concurrency if your default delay is high.

The overall project aims to be very compact (*less is more* mantra)
## Utility commands

### reindex
The reindex command can be used to re-index the database

# Next Step and Roadmap

1) The links table is collected but not used on the search right now. The idea is to use it to refine the PageRank. To have an idea try:

```sql
SELECT p.url, COUNT(*) AS out_links
FROM links l JOIN pages p ON p.id = l.from_page_id
GROUP BY p.id
ORDER BY out_links DESC
LIMIT 20;
```
2) Dockerfile+compose is needed to provide easy installation
3) Ability to partial reindex
3) Ability to classify categories and tags on the full text search can be useful for faceting and classification.
"Auto discovery" of the taxonomies can be further idea

## Docker compose and auto-index mode

Be happy!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/daitangio/find

Awesome Lists containing this project

README