{"id":34498320,"url":"https://github.com/aaronlifton/fastcrawl","last_synced_at":"2025-12-24T01:53:30.365Z","repository":{"id":323950634,"uuid":"1092657036","full_name":"aaronlifton/fastcrawl","owner":"aaronlifton","description":"an agentic atomics-based low-heap allocation web crawler written in Rust, that can crawl Wikipedia at a rate of ~75 pages/sec.","archived":false,"fork":false,"pushed_at":"2025-11-20T03:30:19.000Z","size":447,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-11-20T05:25:50.653Z","etag":null,"topics":["agentic-ai","agentic-framework","agentic-web-crawler","ai-agents","atomics","rust","stack-allocation","streaming-parser","web-crawler","wikipedia"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aaronlifton.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-09T03:26:12.000Z","updated_at":"2025-11-20T03:30:23.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/aaronlifton/fastcrawl","commit_stats":null,"previous_names":["aaronlifton/fastcrawl"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aaronlifton/fastcrawl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aaronlifton%2Ffastcrawl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aaronlifton%2Ffastcrawl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aaronlifton%2Ffastcrawl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aaronlifton%2Ffastcrawl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aaronlifton","download_url":"https://codeload.github.com/aaronlifton/fastcrawl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aaronlifton%2Ffastcrawl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27992829,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-23T02:00:07.087Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","agentic-framework","agentic-web-crawler","ai-agents","atomics","rust","stack-allocation","streaming-parser","web-crawler","wikipedia"],"created_at":"2025-12-24T01:53:29.253Z","updated_at":"2025-12-24T01:53:30.353Z","avatar_url":"https://github.com/aaronlifton.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fastcrawl\n\nFastcrawl is a polite, configurable web-crawler focused on continuous streaming extraction. It ships with a minimal\nWikipedia example (`examples/wiki.rs`) that demonstrates how to plug custom link filters and crawl controls into the\ncore runtime.\n\nCurrent fastest speed, with default controls of `max-depth` 4, `max-links-per-page` 16, `politeness-ms` 250,\n`partition-strategy` 'wiki-prefix' (instead of 'hash'), `partition-buckets` 26, `remote-batch-size` 32, and\n`duration-secs` 4 (it crawls for 4 seconds, but any enqued link is still awaited, so it ran for 26.61s) is **75.12\npages/sec**.\n\n## Metrics\n\nWhen running\n\n```\n  cargo run --example wiki --features multi_thread -- \\\n    --duration-secs 4 \\\n    --partition wiki-prefix \\\n    --partition-namespace \\\n    --partition-buckets 26 \\\n    --remote-batch-size 32\n```\n\nThe crawl metrics were:\n\n```\n--- crawl metrics (26.61s) ---\npages fetched: 1999\nurls fetched/sec: 75.12\nurls discovered: 4517\nurls enqueued: 1995\nduplicate skips: 2522\nfrontier rejects: 0\nhttp errors: 0\nurl parse errors: 0\nlocal shard enqueues: 7889\nremote shard links: 2739 (batches 344)\n```\n\n## Highlights\n\n- **Streaming-first parsing.** The default build runs entirely on a single Tokio thread with `lol_html`, harvesting\n  links as bytes arrive so memory stays bounded to the current response.\n- **Sharded multi-thread mode.** Enable the `multi_thread` feature to spin up several single-thread runtimes in\n  parallel. Each shard owns its own frontier and exchanges cross-shard links over Tokio channels, which keeps contention\n  low while scaling to multiple cores.\n- **Deterministic politeness.** `CrawlControls` exposes depth limits, per-domain allow lists, politeness delays, and\n  other knobs so you never need to edit the example binary to tweak behavior.\n- **Actionable metrics.** Every run prints pages fetched, URLs/sec, dedupe counts, and error totals so you can tune the\n  pipeline quickly.\n\n## Getting Started\n\n```sh\ngit clone https://github.com/aaronlifton/fastcrawl.git\ncd fastcrawl\ncargo run --example wiki\n```\n\nThat command launches the streaming single-thread runtime, seeded with a handful of Wikipedia URLs.\n\n## Runtime Modes\n\n### Single-thread (default)\n\n```\ncargo run --example wiki\n```\n\n- Builds without extra features.\n- Uses `tokio::runtime::Builder::new_current_thread()` plus a `LocalSet`, meaning every worker can hold `lol_html`\n  rewriters (which rely on `Rc\u003cRefCell\u003c_\u003e\u003e`).\n- Starts extracting links while the response body is still in flight—ideal for tight politeness windows or\n  memory-constrained environments.\n\n### Multi-thread (sharded streaming)\n\n```\ncargo run --example wiki --features multi_thread -- --duration-secs 60\n```\n\n- Spawns one OS thread per shard (stack size bumped to 8 MB to support deep streaming stacks).\n- Each shard runs the same streaming workers as single-thread mode but owns a unique frontier and Bloom filter.\n- Cross-shard discoveries are routed through bounded mpsc channels, so enqueue contention happens on a single consumer\n  instead of every worker.\n- Pass `--partition wiki-prefix` (default: `hash`) to keep Wikipedia articles with similar prefixes on the same shard.\n- Use `--partition-buckets \u003cn\u003e` (default `0`, meaning shard count) to control how many alphabetical buckets feed into\n  the shards, and `--partition-namespace` to keep namespaces like `Talk:` or `Help:` on stable shards.\n- Tune `--remote-batch-size \u003cn\u003e` (default 8) to control how many cross-shard links get buffered before the router sends\n  them; higher values reduce channel wakeups at the cost of slightly delayed enqueues on the destination shard.\n- Enable `--remote-channel-logs` only when debugging channel shutdowns; it reintroduces the verbose “remote shard …\n  closed” logs.\n\n## Customizing Crawls\n\n- `CrawlControls` (exposed via CLI/env vars) manage maximum depth, per-domain filters, link-per-page caps, politeness\n  delays, run duration, and more. See `src/controls.rs` for every option.\n- `UrlFilter` lets you inject arbitrary site-specific logic—`examples/wiki.rs` filters out non-article namespaces and\n  query patterns.\n- Metrics live in `src/runtime.rs` and can be extended if you need additional counters or telemetry sinks. Multi-thread\n  runs also report `local shard enqueues` vs `remote shard links (batches)` so you can gauge partition efficiency.\n\n## Corpus Normalization\n\nPass `--normalize` to stream every fetched page through the new `Normalizer` service. The pipeline writes newline-\ndelimited JSON (metadata + cleaned text blocks + embedding-ready chunks) to `--normalize-jsonl` (default:\n`normalized_pages.jsonl`) and respects additional knobs:\n\n```\ncargo run --example wiki \\\n  --features multi_thread -- \\\n  --duration-secs 30 \\\n  --partition wiki-prefix \\\n  --partition-namespace \\\n  --partition-buckets 26 \\\n  --remote-batch-size 32 \\\n  --normalize \\\n  --normalize-jsonl data/wiki.jsonl \\\n  --normalize-manifest-jsonl data/wiki_manifest.jsonl \\\n  --normalize-chunk-tokens 384 \\\n  --normalize-overlap-tokens 64\n```\n\nChunk and block bounds can be tuned via `--normalize-chunk-tokens`, `--normalize-overlap-tokens`, and\n`--normalize-max-blocks`. The JSON payload includes per-block heading context, content hashes, token estimates, and\nmetadata such as HTTP status, language hints, and shard ownership so downstream embedding/indexing jobs can ingest it\ndirectly. When `--normalize-manifest-jsonl` is set, the runtime loads any existing manifest at that path before\noverwriting it, then appends digest records (`url`, `checksum`, `last_seen_epoch_ms`, `changed`). Keeping that JSONL\nfile between runs unlocks true incremental diffs instead of just reporting changes that happened within a single\nprocess.\n\n## Embedding Pipeline\n\n`fastcrawl-embedder` replaces the toy bag-of-words demo with real embedding providers. Point it at the normalized JSONL\nstream and it batches chunk text into the configured backend (OpenAI by default):\n\nFirst run:\n\n-\n\n```sh\ncargo run --example wiki -- \\\n  --normalize \\\n  --normalize-jsonl data/wiki.jsonl \\\n  --normalize-manifest-jsonl data/wiki_manifest.jsonl \\\n  --normalize-shards 4 \\\n  --duration-secs 4\n```\n\nTip: when `--normalize-shards N` is greater than 1, the crawler stripes output into `data/wiki.jsonl.part{0..N-1}` for faster disk writes; concatenate them afterwards with `cat data/wiki.jsonl.part* \u003e data/wiki.jsonl` (manifest stays single and already deduped).\n\nOnce that finishes, run the embedder command:\n\n```sh\nOPENAI_API_KEY=sk-yourkey \\\ncargo run --bin embedder -- \\\n  --input data/wiki.jsonl \\\n  --manifest data/wiki_manifest.jsonl \\\n  --output data/wiki_embeddings.jsonl \\\n  --batch-size 64 \\\n  --only-changed\n```\n\nTo use Qdrant Cloud Inference instead of OpenAI:\n\n```sh\nQDRANT_API_KEY=secret \\\ncargo run --bin embedder -- \\\n  --provider qdrant \\\n  --qdrant-endpoint https://YOUR-CLUSTER.cloud.qdrant.io/inference/text \\\n  --qdrant-model qdrant/all-MiniLM-L6-v2 \\\n  --input data/wiki.jsonl \\\n  --manifest data/wiki_manifest.jsonl \\\n  --output data/wiki_embeddings.jsonl\n```\n\nImportant flags/env vars:\n\n- `--provider` selects `openai` (default) or `qdrant`.\n- `OPENAI_API_KEY` must be set (or `--openai-api-key` passed) when `--provider openai`.\n- `--openai-model` chooses any embedding-capable model (e.g. `text-embedding-3-large`).\n- `--openai-dimensions` optionally asks OpenAI to project to a smaller dimension.\n- `QDRANT_API_KEY`, `--qdrant-endpoint`, and `--qdrant-model` configure the Qdrant backend.\n- `--batch-size` (env `FASTCRAWL_EMBED_BATCH`) controls request fan-out (default 32, retries/backoff handled\n  automatically).\n- `--openai-threads` (alias `--worker-threads`, or `FASTCRAWL_OPENAI_THREADS`) fans batches out to multiple worker\n  threads so you can overlap network latency when OpenAI throttles.\n\nThe embedder still emits newline-delimited `EmbeddedChunkRecord`s compatible with downstream tooling. Set\n`--only-changed` alongside the manifest produced by normalization to skip chunks whose manifest `changed` flag stayed\nfalse, so re-embedding only happens when the crawler observed fresh content.\n\n### Automated refresh pipeline\n\nOnce you have normalized pages + a manifest, you can re-run freshness → embedding → pgvector load → FTS indexing in one\nshot via:\n\n```\nexport OPENAI_API_KEY=sk-yourkey\nexport DATABASE_URL=postgres://postgres:postgres@localhost:5432/fastcrawl\n./bin/refresh.sh\n```\n\nOverride defaults with `FASTCRAWL_REFRESH_*` env vars (see the script). The script assumes the wiki crawl +\nnormalization have already produced `data/wiki.jsonl` and `data/wiki_manifest.jsonl`.\n\n## pgvector Store\n\nShip the embeddings into Postgres with the bundled `fastcrawl-pgvector` binary. It ingests the JSONL produced above and\nupserts into a `vector` table (creating the `vector` extension/table automatically unless disabled). The repo now ships\na `docker-compose.yml` that launches a local Postgres instance with the `pgvector` extension preinstalled:\n\n```sh\ndocker compose up -d pgvector\n```\n\nOnce the container is healthy, point `DATABASE_URL` at it and run the loader:\n\n```fish\nset -gx DATABASE_URL postgres://postgres:postgres@localhost:5432/fastcrawl\n```\n\n```sh\nexport DATABASE_URL=postgres://postgres:postgres@localhost:5432/fastcrawl\n```\n\n```sh\ndocker-compose up;\n\ncargo run --bin pgvector_store -- \\\n  --input data/wiki_embeddings.jsonl \\\n  --schema public \\\n  --table wiki_chunks \\\n  --batch-size 256 \\\n  --upsert \\\n  --database-url=postgresql://postgres:postgres@localhost:5432\n```\n\nStop the container with `docker compose down` (pass `-v` to remove the persisted volume if you want a clean slate).\n\n`fastcrawl-pgvector` now provisions a generated `text_tsv` column plus a GIN index so Postgres full-text queries stay\nfast. If you already had a table before this change, run the helper to retrofit the column/index (and optionally\nANALYZE) without reloading vectors:\n\n```sh\ncargo run --bin fts_indexer -- \\\n  --database-url $DATABASE_URL \\\n  --schema public \\\n  --table wiki_chunks\n```\n\nColumns created by default:\n\n- `url TEXT`, `chunk_id BIGINT` primary key for provenance.\n- `text`, `section_path JSONB`, `token_estimate`, `checksum`, `last_seen_epoch_ms` for metadata.\n- `embedding VECTOR(\u003cdims\u003e)` where `\u003cdims\u003e` matches the first record’s vector length.\n- `text_tsv TSVECTOR` generated from the chunk text, indexed for lexical search.\n\nWith vectors in pgvector you can run similarity search straight from SQL, plug it into RAG services, or join additional\nmetadata tables to restrict retrieval.\n\n## Retrieval Evaluation Harness\n\nQuantify how well pgvector surfaces the right chunks before plugging them into an LLM:\n\n```\ncargo run --bin vector_eval -- \\\n  --cases data/wiki_eval.jsonl \\\n  --database-url postgres://postgres:postgres@localhost:5432/fastcrawl \\\n  --schema public \\\n  --table wiki_chunks \\\n  --top-k 5 \\\n  --dense-candidates 32 \\\n  --lexical-candidates 64 \\\n  --rrf-k 60 \\\n  --report-json data/wiki_eval_report.json \\\n  --openai-api-key $OPENAI_API_KEY\n```\n\nThe CLI embeds each query in `data/wiki_eval.jsonl`, performs the same dense+lexical fusion used by the HTTP retriever,\nand prints hit-rate / MRR / recall metrics. Tweak fusion behaviour with the candidate and `--rrf-k` knobs. The optional\n`--report-json` writes a structured summary for dashboards (including fused/dense/lexical scores per chunk).\n\n## Embedding Freshness Planner\n\nDetect drift between two manifest snapshots and emit a plan describing which URLs need re-embedding:\n\n```\ncargo run --bin freshness -- \\\n  --current-manifest data/wiki_manifest.jsonl \\\n  --previous-manifest data/wiki_manifest_prev.jsonl \\\n  --plan-output data/refresh_plan.jsonl \\\n  --ledger-output data/embedding_ledger.jsonl \\\n  --exec-plan 'cargo run --bin embedder -- --input data/wiki.jsonl --manifest data/wiki_manifest.jsonl --output data/wiki_embeddings.jsonl --only-changed --openai-api-key $OPENAI_API_KEY'\n```\n\n`fastcrawl-freshness` diffs the manifests, prints counts for new/changed/deleted URLs, writes JSONL plan entries,\nappends an audit ledger, and (optionally) runs a shell hook once the plan exists. Pass `--dry-run` to see stats without\ntouching files or invoking the hook.\n\n## Retrieval API\n\nExpose pgvector retrieval over HTTP for downstream services:\n\n```\ncargo run --bin retriever_api -- \\\n  --database-url postgres://postgres:postgres@localhost:5432/fastcrawl \\\n  --schema public \\\n  --table wiki_chunks \\\n  --bind 0.0.0.0:8080 \\\n  --openai-api-key $OPENAI_API_KEY \\\n  --dense-candidates 32 \\\n  --lexical-candidates 64\n```\n\nEndpoints:\n\n- `GET /healthz` – liveness probe.\n- `POST /v1/query` – `{ \"query\": \"When did Apollo 11 land?\", \"top_k\": 6 }` returns scored chunks filtered by the token\n  budget. Override the budget per request via `max_tokens`.\n\nSet one or more `--api-key` values (or `FASTCRAWL_API_KEY`) to require `X-API-Key` headers on every request. Combine\nthat with the built-in rate limiter (`--max-requests-per-minute` / `--rate-limit-burst`) before exposing the service\npublicly.\n\nThe retriever now performs hybrid search: dense candidates from pgvector plus lexical matches taken from Postgres full-\ntext search, fused via Reciprocal Rank Fusion, then trimmed by an optional token budget. Each chunk reports\n`dense_distance`, optional `lexical_score`, and the final `fused_score`/rank so clients can understand why it surfaced.\nTune the hybrid behaviour with `--dense-candidates`, `--lexical-candidates`, and `--rrf-k`. By default the server also\ncaches 1,024 query embeddings and enforces 60 requests/minute with a burst of 12; adjust via `--embedding-cache-size`,\n`--max-requests-per-minute`, and `--rate-limit-burst`.\n\nPair this API with the prompt templates in `personal_docs/prompt_templates/wiki_rag.md` to keep LLM formatting\nconsistent.\n\n## Ad-hoc QA CLI\n\nTo ask questions against the local retriever + OpenAI:\n\n```sh\nFASTCRAWL_RETRIEVER_URL=http://127.0.0.1:8080/v1/query \\\nOPENAI_API_KEY=sk-yourkey \\\ncargo run --bin rag_cli -- \\\n  --query \"What annual music festival takes place in Cambridge's Cherry Hinton Hall?\" \\\n  --top-k 5\n```\n\n**Answer**\n\n```\n--- Answer ---\nThe annual music festival that takes place in Cambridge's Cherry Hinton Hall is the Cambridge Folk Festival, which has been organized by the city council since its inc\neption in 1964[^chunk_id:21].\n\n- The Cambridge Summer Music Festival is another annual event, focusing on classical music held in the university's colleges and chapels[^chunk_id:21].\n- The Cambridge Shakespeare Festival features open-air performances of Shakespeare's works in the gardens of various colleges[^chunk_id:21].\n- The Cambridge Science Festival is the UK's largest free science festival, typically held annually in March[^chunk_id:21].\n\nConfidence level: High.\n```\n\n`fastcrawl-rag` streams the raw chunks (with fused/lexical scores) then prompts the OpenAI chat model (default\n`gpt-4o-mini`) to synthesize an answer with citations. Pass `--dry-run` to inspect context only, `--max-words` to\nenforce brevity, or tweak `--max-tokens` to bound the retriever token budget. The CLI expects `fastcrawl-retriever` to\nbe running against the indexed Postgres instance so the hybrid path matches production behavior.\n\nSwitch to Claude by adding\n`--llm-provider anthropic --anthropic-api-key ... --anthropic-model claude-3-sonnet-20240229`, or adjust\n`--max-completion-tokens` / `--temperature` to steer the answer style.\n\n### Dockerized retriever\n\nTo run the HTTP retriever in Docker (next to Postgres):\n\n```sh\nexport OPENAI_API_KEY=sk-yourkey\ndocker compose up -d retriever\n```\n\nThis builds `Dockerfile.retriever`, runs `fastcrawl-retriever` on port 8080, and points it at the `pgvector` service via\n`DATABASE_URL=postgres://postgres:postgres@pgvector:5432/fastcrawl`. Ensure `public.wiki_chunks` is populated (via\n`fastcrawl-pgvector`) before starting the retriever so queries return results immediately.\n\n## LLM-Oriented Next Steps\n\nFastcrawl is already a solid content harvester for downstream ML pipelines. Future work aimed at LLM/RAG workflows\nincludes:\n\n- [x] **Corpus normalization** – strip boilerplate, capture metadata, and chunk pages into consistent token windows.\n\n- [x] **Embedding pipeline** – push cleaned chunks through an embedding model and store vectors (pgvector/Qdrant/Milvus)\n      with provenance.\n- [x] **Incremental refresh** – schedule revisits, diff pages, and update embeddings so the knowledge base stays\n      current.\n- [x] **Training data generation** – turn chunks into instruction/QA pairs or causal LM samples; track licensing for\n      Wikipedia’s CC BY-SA requirements.\n- [x] **Retrieval-augmented answering** – wire the crawler to trigger re-indexing as new pages stream in, then expose a\n      lightweight API for LLMs to fetch relevant context on demand.\n\n6. **Policy-aware agent** – use crawl metrics (latency, politeness) to drive an autonomous agent that decides which\n   sections of the web to expand next based on embedding coverage gaps.\n\n## License\n\nCopyright © 2025 Aaron Lifton\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faaronlifton%2Ffastcrawl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faaronlifton%2Ffastcrawl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faaronlifton%2Ffastcrawl/lists"}