https://github.com/cafitac/ai-crawler
AI-driven network-first crawler compiler for authorized workflows
https://github.com/cafitac/ai-crawler
agents ai crawler http mcp python scraping
Last synced: 24 days ago
JSON representation
AI-driven network-first crawler compiler for authorized workflows
- Host: GitHub
- URL: https://github.com/cafitac/ai-crawler
- Owner: cafitac
- License: mit
- Created: 2026-04-28T09:56:46.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-28T11:05:22.000Z (about 2 months ago)
- Last Synced: 2026-04-28T12:21:37.811Z (about 2 months ago)
- Topics: agents, ai, crawler, http, mcp, python, scraping
- Language: Python
- Size: 184 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ai-crawler
[](https://github.com/cafitac/ai-crawler/actions/workflows/ci.yml)
AI-driven network-first crawler compiler for authorized workflows.
`ai-crawler` turns captured network evidence into reusable crawler recipes. The browser is used as a short-lived probe for API discovery, not as the crawling engine. Bulk collection runs through deterministic HTTP replay with `curl-cffi`.
```text
Browser is not the crawler. Browser is the probe.
AI is not the request loop. AI is the planner/debugger/recipe author.
```
## What it is
`ai-crawler` is an early-stage Python OSS library and CLI for building crawler recipes from network evidence.
It focuses on:
- Network-first API discovery and replay
- Recipe generation, testing, repair, and deterministic execution
- Simple CLI defaults for humans and AI harnesses
- Python SDK facade for application integrations
- stdio MCP server for Hermes, Claude Code, Codex, and other agents
- Local-first tests with fake transports and fixture sites
- Security boundaries: redaction, challenge detection, and no CAPTCHA/MFA/bot-challenge bypass logic
## Install for local development
```bash
git clone https://github.com/cafitac/ai-crawler.git
cd ai-crawler
uv sync --extra dev --extra http --extra mcp
```
If you are already inside a local checkout:
```bash
uv sync --extra dev --extra http --extra mcp
```
## npm wrapper
For npm-first onboarding, the repo also ships a thin Node wrapper that delegates to the Python core:
```bash
npx @cafitac/ai-crawler --help
npx @cafitac/ai-crawler auto evidence.json --json
npx @cafitac/ai-crawler mcp
```
Wrapper behavior:
- inside the repo checkout: runs the local Python core with `uv run --project ai-crawler ...`
- outside the repo checkout: runs the published Python core via a git-pinned uvx spec when the wrapper package includes `gitHead`, otherwise falls back to `uvx --from "git+https://github.com/cafitac/ai-crawler.git[all]" ai-crawler ...`
- override the published Python package spec with `AI_CRAWLER_PYTHON_SPEC`
- override the uvx Python version with `AI_CRAWLER_UVX_PYTHON`
## Quick start
The one-command path from URL to crawler artifacts is:
```bash
uv sync --extra browser --extra http
uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --json
```
`compile` opens the page briefly, records normalized network response events into `evidence.json`, generates a recipe, tests it, repairs extraction when possible, retests, and writes final JSONL output. The browser is only used for discovery; the generated recipe and final crawl use deterministic HTTP replay. By default, probe evidence keeps replay-friendly `fetch`/`xhr` 2xx/3xx responses and drops static assets, failed responses, and other browser noise.
If you want to inspect or edit evidence before compiling, split the flow:
```bash
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products"
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --wait-ms 2500 --max-events 50 --include-resource-type fetch,xhr,document
uv run --extra http ai-crawler auto evidence.json --json
```
If you already have an evidence file, the main AI-harness command is:
```bash
ai-crawler auto evidence.json --json
```
With a local checkout:
```bash
uv run --extra http ai-crawler auto evidence.json --json
```
This writes default artifacts:
```text
evidence.json # browser probe evidence, if generated by probe
recipe.yaml # initial generated recipe
repaired.recipe.yaml # repaired/final recipe
test.jsonl # initial diagnostic crawl output
crawl.jsonl # final crawl output
auto.report.json # stable machine-readable report
```
The JSON report includes:
- final success/failure status
- `command_type` (`compile` or `auto`)
- `failure_phase` for quick triage (`probe`, `generate`, `final_test`, or empty on success)
- ordered `phase_diagnostics` for `probe -> generate -> initial_test -> repair -> final_test`
- recipe/output paths
- initial and final crawl results
- bounded/redacted diagnostic samples
- failure classifications such as `success`, `extraction_failed`, `http_error`, `no_response`, `challenge_detected`, `probe_failed`, and `no_endpoint_candidates`
In `--json` mode, stdout is reserved for one machine-readable JSON object. Human-readable failures are written to stderr. Exit code `2` still writes `auto.report.json` so agents can inspect the failure.
## Evidence format
Create evidence with a short browser probe:
```bash
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --output evidence.json
```
The probe tuning options are available on both `probe` and `compile`:
- `--wait-ms`: browser settle time after network idle (default: `1000`)
- `--max-events`: maximum replay candidates retained after filtering (default: `200`)
- `--include-resource-type`: comma-separated Playwright resource types to retain (default: `fetch,xhr`)
Minimal evidence JSON:
```json
{
"target_url": "https://example.com/products",
"goal": "collect products",
"events": [
{
"method": "GET",
"url": "https://example.com/api/products?page=1",
"status_code": 200,
"resource_type": "fetch"
}
]
}
```
Generate and run manually:
```bash
uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --json
```
Or run each artifact step yourself:
```bash
uv run --extra http ai-crawler generate-recipe evidence.json
uv run --extra http ai-crawler test-recipe recipe.yaml
uv run --extra http ai-crawler repair-recipe recipe.yaml
uv run --extra http ai-crawler test-recipe repaired.recipe.yaml --output crawl.jsonl
```
## MCP usage
Generate client config snippets for local uv-project usage. For copy-paste examples across CLI/MCP/SDK flows, also see `docs/harness-examples.md`.
```bash
uv run ai-crawler mcp-config --client hermes --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client claude-code --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client codex --project /path/to/ai-crawler
```
Generate npm-first snippets for the published wrapper:
```bash
uv run ai-crawler mcp-config --client hermes --launcher npm
```
Run as a stdio MCP server:
```bash
uv run --extra mcp --extra http ai-crawler mcp
```
Exposed tools:
- `compile_url`
- `auto_compile`
- `generate_recipe`
- `test_recipe`
- `repair_recipe`
If you prefer npm-first installation for agent tooling, the wrapper can also launch the MCP server:
```bash
npx @cafitac/ai-crawler mcp
```
Hermes development snippet shape:
```yaml
mcp_servers:
ai-crawler:
command: "uv"
args: ["run", "--project", "/path/to/ai-crawler", "--extra", "mcp", "--extra", "http", "ai-crawler", "mcp"]
timeout: 300
connect_timeout: 60
```
Hermes npm-first snippet shape:
```yaml
mcp_servers:
ai-crawler:
command: "npx"
args: ["-y", "@cafitac/ai-crawler", "mcp"]
timeout: 300
connect_timeout: 60
```
## Python SDK
The Python SDK remains the stable embedded/programmatic surface. The npm package is only a launcher wrapper around this Python core. See `docs/harness-examples.md` for copy-paste SDK, MCP, and published-wrapper examples.
npm publishing is automated with `.github/workflows/npm-publish.yml`.
- push a tag matching the package version, for example `npm-v0.1.2`
- or run the workflow manually with `workflow_dispatch`
- the workflow validates that `package.json`, `pyproject.toml`, and `src/ai_crawler/__init__.py` agree on the release version before publish
- tag-triggered publishes also validate that the pushed tag matches `npm-v`
- use `docs/release-runbook.md` for the full version bump, tagging, and post-publish smoke checklist
Example tag flow:
```bash
git tag npm-v0.1.2
git push origin npm-v0.1.2
```
```python
from ai_crawler import AICrawler
crawler = AICrawler()
result = crawler.auto("evidence.json")
print(result.ok)
print(result.exit_code)
print(result.report)
compile_result = crawler.compile_url("https://example.com/products", goal="collect products")
print(compile_result.report["command_type"])
```
For tests or embedded usage, inject a fake fetcher:
```python
crawler = AICrawler(fetcher=my_fake_fetcher)
```
## Verification
Fast local lint/type checks while iterating:
```bash
bash scripts/check-python.sh
```
Full project verification:
```bash
bash scripts/verify-ai-harness.sh
```
MCP `auto_compile` fixture smoke test:
```bash
uv run --extra http python scripts/smoke-mcp-auto-compile.py
```
This starts a local fixture HTTP site and verifies `generate -> test -> repair -> retest` without external internet, a real browser, or a real LLM.
## Security and compliance boundary
`ai-crawler` is intended for authorized crawling, internal QA/testing, research, owned or allowed web property monitoring, and data portability workflows.
It does not implement:
- CAPTCHA solving
- MFA bypass
- Cloudflare/bot-challenge bypass
- stealth fingerprint manipulation
- evasion proxy rotation
Challenge-like responses are classified and surfaced as requiring human/manual handoff where appropriate.
Sensitive values in diagnostic reports are redacted, including common bearer tokens, cookies, session IDs, API keys, and JSON-embedded token fields.
## Documentation
Development docs live under `.dev/`:
- `.dev/README.md`
- `.dev/03-ai/auto-harness-contract.md`
- `.dev/04-mcp/server.md`
- `.dev/08-operations/security-and-compliance.md`
- `.dev/08-operations/challenge-handling-policy.md`
## Status
Alpha. The deterministic recipe compiler, one-command `compile` flow, browser probe CLI, CLI, SDK facade, MCP server, redaction, failure classification, and fixture smoke tests are implemented. Real LLM provider integrations are intentionally optional/future layers behind adapter boundaries.
## License
MIT