{"id":39962804,"url":"https://github.com/daitangio/find","last_synced_at":"2026-01-18T21:31:55.118Z","repository":{"id":331272007,"uuid":"1125987011","full_name":"daitangio/find","owner":"daitangio","description":"Python + SQLite search engine","archived":false,"fork":false,"pushed_at":"2026-01-02T11:38:20.000Z","size":29,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-01-05T07:58:41.376Z","etag":null,"topics":["crawler","indexer","python","search-engine"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daitangio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-31T19:59:04.000Z","updated_at":"2026-01-02T11:38:24.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/daitangio/find","commit_stats":null,"previous_names":["daitangio/find"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/daitangio/find","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daitangio%2Ffind","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daitangio%2Ffind/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daitangio%2Ffind/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daitangio%2Ffind/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daitangio","download_url":"https://codeload.github.com/daitangio/find/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daitangio%2Ffind/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28551202,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-18T20:59:07.572Z","status":"ssl_error","status_checked_at":"2026-01-18T20:59:02.799Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","indexer","python","search-engine"],"created_at":"2026-01-18T21:31:55.063Z","updated_at":"2026-01-18T21:31:55.113Z","avatar_url":"https://github.com/daitangio.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# So What?\n\n\u003cp align=\"right\"\u003e\u003ci\u003eStop searching, start finding stuff\u003c/i\u003e\u003c/p\u003e\n\n[![Pylint](https://github.com/daitangio/find/actions/workflows/pylint.yml/badge.svg)](https://github.com/daitangio/find/actions/workflows/pylint.yml)\n\nFind is a super-minimal search engine based on SQLite Full Text Search capabilities and Python.\nIt is composed of two commands:\n\n- [A Simple web crawler](./src/find/crawl.py) which uses asyncio to maximize index ingestion speed.\n- [A Flask app to enable end-users to find](./src/find/app.py) things.\n\n## Features\n- Find supports caching of web pages (a lost feature of Google) and de-duplication if content is the same for some pages.Back link ranking tuning is in progress\n- Respects robots.txt\n\n# How to start\n\nCreate a virtualenv and install the project:\n\n```sh\n    python3 -m venv .venv\n    . .venv/bin/activate\n    pip install -e .\n```\n\nRun your first crawl:\n\n    crawl --seed https://myhost.com --same-host \n\nRun the web interface with:\n\n    findgui\n\n\n# Why\n\nI need to design a small search engine for my static web site. I asked to ChatGPT 5.2 to design it, then I refined the code.\nInitial prompt was\n\n    Design a small python web application to implement a search engine. \n    The search must be performed on a SQLite database using \n    the SQLite Full Text Search (FTS5) extension. \n    Design the database model to be able to store simple html web pages.\n\n# Design principles\n\nFind is a compact,zero-conf \u0026 tiny solution to add a search engine to a pre-existing blog site.\nIt just works out of the box.\n\nAs a basic rule I will try to keep it below 2000 lines of code.\n\nThe project accepts pull requests: please open it adding a comment. Ensure the change passes the pylint checks.\n\n# How\n\n[SQLite has a full text search capability called FTS5](https://sqlite.org/fts5.html) which offers out of the box also stemming for english language.\n\nChatGPT for the crawler proposed asyncio I/O (aiohttp \u0026 aiosqlite libraries), which is a very good approach to scale the crawler: downloading web pages is a very I/O bound activity and it benefits from a non-blocking library.\n\nInitial implementation has a locking problem: we solved it with a mono-writer database task. \nSQLite is so fast you have an hard time to tune the writer queue: it is very difficult to saturate it.\nTo avoid data loss, I opted for a queue 4x the concurrency level.\n\nThe crawler has a default delay to avoid overloading the target site. For this reason, it is pointless to have too much concurrency if your default delay is high.\n\nThe overall project aims to be very compact (*less is more* mantra)\n## Utility commands\n\n### reindex\nThe reindex command can be used to re-index the database\n\n# Next Step and Roadmap\n\n1) The links table is collected but not used on the search right now. The idea is to use it to refine the PageRank. To have an idea try:\n\n    ```sql\n    SELECT p.url, COUNT(*) AS out_links\n    FROM links l JOIN pages p ON p.id = l.from_page_id\n    GROUP BY p.id\n    ORDER BY out_links DESC\n    LIMIT 20;\n    ```\n2) Dockerfile+compose is needed to provide easy installation\n3) Ability to partial reindex\n3) Ability to classify categories and tags on the full text search can be useful for faceting and classification.\n\"Auto discovery\" of the taxonomies can be further idea\n\n## Docker compose and auto-index mode\n\nBe happy!\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaitangio%2Ffind","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaitangio%2Ffind","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaitangio%2Ffind/lists"}