An open API service indexing awesome lists of open source software.

https://github.com/cafitac/ai-crawler

AI-driven network-first crawler compiler for authorized workflows
https://github.com/cafitac/ai-crawler

agents ai crawler http mcp python scraping

Last synced: 24 days ago
JSON representation

AI-driven network-first crawler compiler for authorized workflows

Awesome Lists containing this project

README

          

# ai-crawler

[![CI](https://github.com/cafitac/ai-crawler/actions/workflows/ci.yml/badge.svg)](https://github.com/cafitac/ai-crawler/actions/workflows/ci.yml)

AI-driven network-first crawler compiler for authorized workflows.

`ai-crawler` turns captured network evidence into reusable crawler recipes. The browser is used as a short-lived probe for API discovery, not as the crawling engine. Bulk collection runs through deterministic HTTP replay with `curl-cffi`.

```text
Browser is not the crawler. Browser is the probe.
AI is not the request loop. AI is the planner/debugger/recipe author.
```

## What it is

`ai-crawler` is an early-stage Python OSS library and CLI for building crawler recipes from network evidence.

It focuses on:

- Network-first API discovery and replay
- Recipe generation, testing, repair, and deterministic execution
- Simple CLI defaults for humans and AI harnesses
- Python SDK facade for application integrations
- stdio MCP server for Hermes, Claude Code, Codex, and other agents
- Local-first tests with fake transports and fixture sites
- Security boundaries: redaction, challenge detection, and no CAPTCHA/MFA/bot-challenge bypass logic

## Install for local development

```bash
git clone https://github.com/cafitac/ai-crawler.git
cd ai-crawler
uv sync --extra dev --extra http --extra mcp
```

If you are already inside a local checkout:

```bash
uv sync --extra dev --extra http --extra mcp
```

## npm wrapper

For npm-first onboarding, the repo also ships a thin Node wrapper that delegates to the Python core:

```bash
npx @cafitac/ai-crawler --help
npx @cafitac/ai-crawler auto evidence.json --json
npx @cafitac/ai-crawler mcp
```

Wrapper behavior:

- inside the repo checkout: runs the local Python core with `uv run --project ai-crawler ...`
- outside the repo checkout: runs the published Python core via a git-pinned uvx spec when the wrapper package includes `gitHead`, otherwise falls back to `uvx --from "git+https://github.com/cafitac/ai-crawler.git[all]" ai-crawler ...`
- override the published Python package spec with `AI_CRAWLER_PYTHON_SPEC`
- override the uvx Python version with `AI_CRAWLER_UVX_PYTHON`

## Quick start

The one-command path from URL to crawler artifacts is:

```bash
uv sync --extra browser --extra http
uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --json
```

`compile` opens the page briefly, records normalized network response events into `evidence.json`, generates a recipe, tests it, repairs extraction when possible, retests, and writes final JSONL output. The browser is only used for discovery; the generated recipe and final crawl use deterministic HTTP replay. By default, probe evidence keeps replay-friendly `fetch`/`xhr` 2xx/3xx responses and drops static assets, failed responses, and other browser noise.

If you want to inspect or edit evidence before compiling, split the flow:

```bash
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products"
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --wait-ms 2500 --max-events 50 --include-resource-type fetch,xhr,document
uv run --extra http ai-crawler auto evidence.json --json
```

If you already have an evidence file, the main AI-harness command is:

```bash
ai-crawler auto evidence.json --json
```

With a local checkout:

```bash
uv run --extra http ai-crawler auto evidence.json --json
```

This writes default artifacts:

```text
evidence.json # browser probe evidence, if generated by probe
recipe.yaml # initial generated recipe
repaired.recipe.yaml # repaired/final recipe
test.jsonl # initial diagnostic crawl output
crawl.jsonl # final crawl output
auto.report.json # stable machine-readable report
```

The JSON report includes:

- final success/failure status
- `command_type` (`compile` or `auto`)
- `failure_phase` for quick triage (`probe`, `generate`, `final_test`, or empty on success)
- ordered `phase_diagnostics` for `probe -> generate -> initial_test -> repair -> final_test`
- recipe/output paths
- initial and final crawl results
- bounded/redacted diagnostic samples
- failure classifications such as `success`, `extraction_failed`, `http_error`, `no_response`, `challenge_detected`, `probe_failed`, and `no_endpoint_candidates`

In `--json` mode, stdout is reserved for one machine-readable JSON object. Human-readable failures are written to stderr. Exit code `2` still writes `auto.report.json` so agents can inspect the failure.

## Evidence format

Create evidence with a short browser probe:

```bash
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --output evidence.json
```

The probe tuning options are available on both `probe` and `compile`:

- `--wait-ms`: browser settle time after network idle (default: `1000`)
- `--max-events`: maximum replay candidates retained after filtering (default: `200`)
- `--include-resource-type`: comma-separated Playwright resource types to retain (default: `fetch,xhr`)

Minimal evidence JSON:

```json
{
"target_url": "https://example.com/products",
"goal": "collect products",
"events": [
{
"method": "GET",
"url": "https://example.com/api/products?page=1",
"status_code": 200,
"resource_type": "fetch"
}
]
}
```

Generate and run manually:

```bash
uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --json
```

Or run each artifact step yourself:

```bash
uv run --extra http ai-crawler generate-recipe evidence.json
uv run --extra http ai-crawler test-recipe recipe.yaml
uv run --extra http ai-crawler repair-recipe recipe.yaml
uv run --extra http ai-crawler test-recipe repaired.recipe.yaml --output crawl.jsonl
```

## MCP usage

Generate client config snippets for local uv-project usage. For copy-paste examples across CLI/MCP/SDK flows, also see `docs/harness-examples.md`.

```bash
uv run ai-crawler mcp-config --client hermes --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client claude-code --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client codex --project /path/to/ai-crawler
```

Generate npm-first snippets for the published wrapper:

```bash
uv run ai-crawler mcp-config --client hermes --launcher npm
```

Run as a stdio MCP server:

```bash
uv run --extra mcp --extra http ai-crawler mcp
```

Exposed tools:

- `compile_url`
- `auto_compile`
- `generate_recipe`
- `test_recipe`
- `repair_recipe`

If you prefer npm-first installation for agent tooling, the wrapper can also launch the MCP server:

```bash
npx @cafitac/ai-crawler mcp
```

Hermes development snippet shape:

```yaml
mcp_servers:
ai-crawler:
command: "uv"
args: ["run", "--project", "/path/to/ai-crawler", "--extra", "mcp", "--extra", "http", "ai-crawler", "mcp"]
timeout: 300
connect_timeout: 60
```

Hermes npm-first snippet shape:

```yaml
mcp_servers:
ai-crawler:
command: "npx"
args: ["-y", "@cafitac/ai-crawler", "mcp"]
timeout: 300
connect_timeout: 60
```

## Python SDK

The Python SDK remains the stable embedded/programmatic surface. The npm package is only a launcher wrapper around this Python core. See `docs/harness-examples.md` for copy-paste SDK, MCP, and published-wrapper examples.

npm publishing is automated with `.github/workflows/npm-publish.yml`.

- push a tag matching the package version, for example `npm-v0.1.2`
- or run the workflow manually with `workflow_dispatch`
- the workflow validates that `package.json`, `pyproject.toml`, and `src/ai_crawler/__init__.py` agree on the release version before publish
- tag-triggered publishes also validate that the pushed tag matches `npm-v`
- use `docs/release-runbook.md` for the full version bump, tagging, and post-publish smoke checklist

Example tag flow:

```bash
git tag npm-v0.1.2
git push origin npm-v0.1.2
```

```python
from ai_crawler import AICrawler

crawler = AICrawler()
result = crawler.auto("evidence.json")
print(result.ok)
print(result.exit_code)
print(result.report)

compile_result = crawler.compile_url("https://example.com/products", goal="collect products")
print(compile_result.report["command_type"])
```

For tests or embedded usage, inject a fake fetcher:

```python
crawler = AICrawler(fetcher=my_fake_fetcher)
```

## Verification

Fast local lint/type checks while iterating:

```bash
bash scripts/check-python.sh
```

Full project verification:

```bash
bash scripts/verify-ai-harness.sh
```

MCP `auto_compile` fixture smoke test:

```bash
uv run --extra http python scripts/smoke-mcp-auto-compile.py
```

This starts a local fixture HTTP site and verifies `generate -> test -> repair -> retest` without external internet, a real browser, or a real LLM.

## Security and compliance boundary

`ai-crawler` is intended for authorized crawling, internal QA/testing, research, owned or allowed web property monitoring, and data portability workflows.

It does not implement:

- CAPTCHA solving
- MFA bypass
- Cloudflare/bot-challenge bypass
- stealth fingerprint manipulation
- evasion proxy rotation

Challenge-like responses are classified and surfaced as requiring human/manual handoff where appropriate.

Sensitive values in diagnostic reports are redacted, including common bearer tokens, cookies, session IDs, API keys, and JSON-embedded token fields.

## Documentation

Development docs live under `.dev/`:

- `.dev/README.md`
- `.dev/03-ai/auto-harness-contract.md`
- `.dev/04-mcp/server.md`
- `.dev/08-operations/security-and-compliance.md`
- `.dev/08-operations/challenge-handling-policy.md`

## Status

Alpha. The deterministic recipe compiler, one-command `compile` flow, browser probe CLI, CLI, SDK facade, MCP server, redaction, failure classification, and fixture smoke tests are implemented. Real LLM provider integrations are intentionally optional/future layers behind adapter boundaries.

## License

MIT