{"id":50439903,"url":"https://github.com/cafitac/ai-crawler","last_synced_at":"2026-05-31T18:31:02.505Z","repository":{"id":354392105,"uuid":"1223418071","full_name":"cafitac/ai-crawler","owner":"cafitac","description":"AI-driven network-first crawler compiler for authorized workflows","archived":false,"fork":false,"pushed_at":"2026-04-28T11:05:22.000Z","size":188,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-28T12:21:37.811Z","etag":null,"topics":["agents","ai","crawler","http","mcp","python","scraping"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cafitac.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-28T09:56:46.000Z","updated_at":"2026-04-28T11:05:28.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/cafitac/ai-crawler","commit_stats":null,"previous_names":["cafitac/ai-crawler"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/cafitac/ai-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cafitac%2Fai-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cafitac%2Fai-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cafitac%2Fai-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cafitac%2Fai-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cafitac","download_url":"https://codeload.github.com/cafitac/ai-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cafitac%2Fai-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33744444,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","ai","crawler","http","mcp","python","scraping"],"created_at":"2026-05-31T18:31:01.577Z","updated_at":"2026-05-31T18:31:02.499Z","avatar_url":"https://github.com/cafitac.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ai-crawler\n\n[![CI](https://github.com/cafitac/ai-crawler/actions/workflows/ci.yml/badge.svg)](https://github.com/cafitac/ai-crawler/actions/workflows/ci.yml)\n\nAI-driven network-first crawler compiler for authorized workflows.\n\n`ai-crawler` turns captured network evidence into reusable crawler recipes. The browser is used as a short-lived probe for API discovery, not as the crawling engine. Bulk collection runs through deterministic HTTP replay with `curl-cffi`.\n\n```text\nBrowser is not the crawler. Browser is the probe.\nAI is not the request loop. AI is the planner/debugger/recipe author.\n```\n\n## What it is\n\n`ai-crawler` is an early-stage Python OSS library and CLI for building crawler recipes from network evidence.\n\nIt focuses on:\n\n- Network-first API discovery and replay\n- Recipe generation, testing, repair, and deterministic execution\n- Simple CLI defaults for humans and AI harnesses\n- Python SDK facade for application integrations\n- stdio MCP server for Hermes, Claude Code, Codex, and other agents\n- Local-first tests with fake transports and fixture sites\n- Security boundaries: redaction, challenge detection, and no CAPTCHA/MFA/bot-challenge bypass logic\n\n## Install for local development\n\n```bash\ngit clone https://github.com/cafitac/ai-crawler.git\ncd ai-crawler\nuv sync --extra dev --extra http --extra mcp\n```\n\nIf you are already inside a local checkout:\n\n```bash\nuv sync --extra dev --extra http --extra mcp\n```\n\n## npm wrapper\n\nFor npm-first onboarding, the repo also ships a thin Node wrapper that delegates to the Python core:\n\n```bash\nnpx @cafitac/ai-crawler --help\nnpx @cafitac/ai-crawler auto evidence.json --json\nnpx @cafitac/ai-crawler mcp\n```\n\nWrapper behavior:\n\n- inside the repo checkout: runs the local Python core with `uv run --project \u003crepo\u003e ai-crawler ...`\n- outside the repo checkout: runs the published Python core via a git-pinned uvx spec when the wrapper package includes `gitHead`, otherwise falls back to `uvx --from \"git+https://github.com/cafitac/ai-crawler.git[all]\" ai-crawler ...`\n- override the published Python package spec with `AI_CRAWLER_PYTHON_SPEC`\n- override the uvx Python version with `AI_CRAWLER_UVX_PYTHON`\n\n## Quick start\n\nThe one-command path from URL to crawler artifacts is:\n\n```bash\nuv sync --extra browser --extra http\nuv run --extra browser --extra http ai-crawler compile https://example.com/products --goal \"collect products\" --json\n```\n\n`compile` opens the page briefly, records normalized network response events into `evidence.json`, generates a recipe, tests it, repairs extraction when possible, retests, and writes final JSONL output. The browser is only used for discovery; the generated recipe and final crawl use deterministic HTTP replay. By default, probe evidence keeps replay-friendly `fetch`/`xhr` 2xx/3xx responses and drops static assets, failed responses, and other browser noise.\n\nIf you want to inspect or edit evidence before compiling, split the flow:\n\n```bash\nuv run --extra browser ai-crawler probe https://example.com/products --goal \"collect products\"\nuv run --extra browser ai-crawler probe https://example.com/products --goal \"collect products\" --wait-ms 2500 --max-events 50 --include-resource-type fetch,xhr,document\nuv run --extra http ai-crawler auto evidence.json --json\n```\n\nIf you already have an evidence file, the main AI-harness command is:\n\n```bash\nai-crawler auto evidence.json --json\n```\n\nWith a local checkout:\n\n```bash\nuv run --extra http ai-crawler auto evidence.json --json\n```\n\nThis writes default artifacts:\n\n```text\nevidence.json            # browser probe evidence, if generated by probe\nrecipe.yaml              # initial generated recipe\nrepaired.recipe.yaml     # repaired/final recipe\ntest.jsonl               # initial diagnostic crawl output\ncrawl.jsonl              # final crawl output\nauto.report.json         # stable machine-readable report\n```\n\nThe JSON report includes:\n\n- final success/failure status\n- `command_type` (`compile` or `auto`)\n- `failure_phase` for quick triage (`probe`, `generate`, `final_test`, or empty on success)\n- ordered `phase_diagnostics` for `probe -\u003e generate -\u003e initial_test -\u003e repair -\u003e final_test`\n- recipe/output paths\n- initial and final crawl results\n- bounded/redacted diagnostic samples\n- failure classifications such as `success`, `extraction_failed`, `http_error`, `no_response`, `challenge_detected`, `probe_failed`, and `no_endpoint_candidates`\n\nIn `--json` mode, stdout is reserved for one machine-readable JSON object. Human-readable failures are written to stderr. Exit code `2` still writes `auto.report.json` so agents can inspect the failure.\n\n## Evidence format\n\nCreate evidence with a short browser probe:\n\n```bash\nuv run --extra browser ai-crawler probe https://example.com/products --goal \"collect products\" --output evidence.json\n```\n\nThe probe tuning options are available on both `probe` and `compile`:\n\n- `--wait-ms`: browser settle time after network idle (default: `1000`)\n- `--max-events`: maximum replay candidates retained after filtering (default: `200`)\n- `--include-resource-type`: comma-separated Playwright resource types to retain (default: `fetch,xhr`)\n\nMinimal evidence JSON:\n\n```json\n{\n  \"target_url\": \"https://example.com/products\",\n  \"goal\": \"collect products\",\n  \"events\": [\n    {\n      \"method\": \"GET\",\n      \"url\": \"https://example.com/api/products?page=1\",\n      \"status_code\": 200,\n      \"resource_type\": \"fetch\"\n    }\n  ]\n}\n```\n\nGenerate and run manually:\n\n```bash\nuv run --extra browser --extra http ai-crawler compile https://example.com/products --goal \"collect products\" --json\n```\n\nOr run each artifact step yourself:\n\n```bash\nuv run --extra http ai-crawler generate-recipe evidence.json\nuv run --extra http ai-crawler test-recipe recipe.yaml\nuv run --extra http ai-crawler repair-recipe recipe.yaml\nuv run --extra http ai-crawler test-recipe repaired.recipe.yaml --output crawl.jsonl\n```\n\n## MCP usage\n\nGenerate client config snippets for local uv-project usage. For copy-paste examples across CLI/MCP/SDK flows, also see `docs/harness-examples.md`.\n\n```bash\nuv run ai-crawler mcp-config --client hermes --project /path/to/ai-crawler\nuv run ai-crawler mcp-config --client claude-code --project /path/to/ai-crawler\nuv run ai-crawler mcp-config --client codex --project /path/to/ai-crawler\n```\n\nGenerate npm-first snippets for the published wrapper:\n\n```bash\nuv run ai-crawler mcp-config --client hermes --launcher npm\n```\n\nRun as a stdio MCP server:\n\n```bash\nuv run --extra mcp --extra http ai-crawler mcp\n```\n\nExposed tools:\n\n- `compile_url`\n- `auto_compile`\n- `generate_recipe`\n- `test_recipe`\n- `repair_recipe`\n\nIf you prefer npm-first installation for agent tooling, the wrapper can also launch the MCP server:\n\n```bash\nnpx @cafitac/ai-crawler mcp\n```\n\nHermes development snippet shape:\n\n```yaml\nmcp_servers:\n  ai-crawler:\n    command: \"uv\"\n    args: [\"run\", \"--project\", \"/path/to/ai-crawler\", \"--extra\", \"mcp\", \"--extra\", \"http\", \"ai-crawler\", \"mcp\"]\n    timeout: 300\n    connect_timeout: 60\n```\n\nHermes npm-first snippet shape:\n\n```yaml\nmcp_servers:\n  ai-crawler:\n    command: \"npx\"\n    args: [\"-y\", \"@cafitac/ai-crawler\", \"mcp\"]\n    timeout: 300\n    connect_timeout: 60\n```\n\n## Python SDK\n\nThe Python SDK remains the stable embedded/programmatic surface. The npm package is only a launcher wrapper around this Python core. See `docs/harness-examples.md` for copy-paste SDK, MCP, and published-wrapper examples.\n\nnpm publishing is automated with `.github/workflows/npm-publish.yml`.\n\n- push a tag matching the package version, for example `npm-v0.1.2`\n- or run the workflow manually with `workflow_dispatch`\n- the workflow validates that `package.json`, `pyproject.toml`, and `src/ai_crawler/__init__.py` agree on the release version before publish\n- tag-triggered publishes also validate that the pushed tag matches `npm-v\u003cpackage.json version\u003e`\n- use `docs/release-runbook.md` for the full version bump, tagging, and post-publish smoke checklist\n\nExample tag flow:\n\n```bash\ngit tag npm-v0.1.2\ngit push origin npm-v0.1.2\n```\n\n\n```python\nfrom ai_crawler import AICrawler\n\ncrawler = AICrawler()\nresult = crawler.auto(\"evidence.json\")\nprint(result.ok)\nprint(result.exit_code)\nprint(result.report)\n\ncompile_result = crawler.compile_url(\"https://example.com/products\", goal=\"collect products\")\nprint(compile_result.report[\"command_type\"])\n```\n\nFor tests or embedded usage, inject a fake fetcher:\n\n```python\ncrawler = AICrawler(fetcher=my_fake_fetcher)\n```\n\n## Verification\n\nFast local lint/type checks while iterating:\n\n```bash\nbash scripts/check-python.sh\n```\n\nFull project verification:\n\n```bash\nbash scripts/verify-ai-harness.sh\n```\n\nMCP `auto_compile` fixture smoke test:\n\n```bash\nuv run --extra http python scripts/smoke-mcp-auto-compile.py\n```\n\nThis starts a local fixture HTTP site and verifies `generate -\u003e test -\u003e repair -\u003e retest` without external internet, a real browser, or a real LLM.\n\n## Security and compliance boundary\n\n`ai-crawler` is intended for authorized crawling, internal QA/testing, research, owned or allowed web property monitoring, and data portability workflows.\n\nIt does not implement:\n\n- CAPTCHA solving\n- MFA bypass\n- Cloudflare/bot-challenge bypass\n- stealth fingerprint manipulation\n- evasion proxy rotation\n\nChallenge-like responses are classified and surfaced as requiring human/manual handoff where appropriate.\n\nSensitive values in diagnostic reports are redacted, including common bearer tokens, cookies, session IDs, API keys, and JSON-embedded token fields.\n\n## Documentation\n\nDevelopment docs live under `.dev/`:\n\n- `.dev/README.md`\n- `.dev/03-ai/auto-harness-contract.md`\n- `.dev/04-mcp/server.md`\n- `.dev/08-operations/security-and-compliance.md`\n- `.dev/08-operations/challenge-handling-policy.md`\n\n## Status\n\nAlpha. The deterministic recipe compiler, one-command `compile` flow, browser probe CLI, CLI, SDK facade, MCP server, redaction, failure classification, and fixture smoke tests are implemented. Real LLM provider integrations are intentionally optional/future layers behind adapter boundaries.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcafitac%2Fai-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcafitac%2Fai-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcafitac%2Fai-crawler/lists"}