{"id":47553820,"url":"https://github.com/0xMassi/webclaw","last_synced_at":"2026-04-04T03:01:05.759Z","repository":{"id":346536294,"uuid":"1178117518","full_name":"0xMassi/webclaw","owner":"0xMassi","description":"Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.","archived":false,"fork":false,"pushed_at":"2026-04-02T17:17:06.000Z","size":1137,"stargazers_count":421,"open_issues_count":0,"forks_count":55,"subscribers_count":7,"default_branch":"main","last_synced_at":"2026-04-03T06:49:16.970Z","etag":null,"topics":["ai","ai-agents","ai-scraping","cli","crawler","data-extraction","html-to-markdown","llm","markdown","mcp","mcp-server","rust","scraper","self-hosted","tls-fingerprinting","web-crawler","web-extraction","web-scraper","web-scraping","webscraping"],"latest_commit_sha":null,"homepage":"https://webclaw.io","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/0xMassi.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-10T17:50:15.000Z","updated_at":"2026-04-03T05:28:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"e39eb632-7e90-4216-befc-6edad84904dd","html_url":"https://github.com/0xMassi/webclaw","commit_stats":null,"previous_names":["0xmassi/webclaw"],"tags_count":21,"template":false,"template_full_name":null,"purl":"pkg:github/0xMassi/webclaw","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xMassi%2Fwebclaw","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xMassi%2Fwebclaw/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xMassi%2Fwebclaw/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xMassi%2Fwebclaw/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/0xMassi","download_url":"https://codeload.github.com/0xMassi/webclaw/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xMassi%2Fwebclaw/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31385935,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T01:22:39.193Z","status":"online","status_checked_at":"2026-04-04T02:00:07.569Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-agents","ai-scraping","cli","crawler","data-extraction","html-to-markdown","llm","markdown","mcp","mcp-server","rust","scraper","self-hosted","tls-fingerprinting","web-crawler","web-extraction","web-scraper","web-scraping","webscraping"],"created_at":"2026-03-29T08:00:46.478Z","updated_at":"2026-04-04T03:01:05.745Z","avatar_url":"https://github.com/0xMassi.png","language":"Rust","readme":"\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://webclaw.io\"\u003e\n    \u003cimg src=\".github/banner.png\" alt=\"webclaw\" width=\"700\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003e\n  The fastest web scraper for AI agents.\u003cbr/\u003e\n  \u003csub\u003e67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.\u003c/sub\u003e\n\u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/0xMassi/webclaw/stargazers\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/0xMassi/webclaw?style=for-the-badge\u0026logo=github\u0026logoColor=white\u0026label=Stars\u0026color=181717\" alt=\"Stars\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/0xMassi/webclaw/releases\"\u003e\u003cimg src=\"https://img.shields.io/github/v/release/0xMassi/webclaw?style=for-the-badge\u0026logo=rust\u0026logoColor=white\u0026label=Version\u0026color=B7410E\" alt=\"Version\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/0xMassi/webclaw/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-AGPL--3.0-10B981?style=for-the-badge\" alt=\"License\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.npmjs.com/package/create-webclaw\"\u003e\u003cimg src=\"https://img.shields.io/npm/dt/create-webclaw?style=for-the-badge\u0026logo=npm\u0026logoColor=white\u0026label=Installs\u0026color=CB3837\" alt=\"npm installs\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://discord.gg/KDfd48EpnW\"\u003e\u003cimg src=\"https://img.shields.io/badge/Discord-Join-5865F2?style=for-the-badge\u0026logo=discord\u0026logoColor=white\" alt=\"Discord\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://x.com/webclaw_io\"\u003e\u003cimg src=\"https://img.shields.io/badge/Follow-@webclaw__io-000000?style=for-the-badge\u0026logo=x\u0026logoColor=white\" alt=\"X / Twitter\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://webclaw.io\"\u003e\u003cimg src=\"https://img.shields.io/badge/Website-webclaw.io-0A0A0A?style=for-the-badge\u0026logo=safari\u0026logoColor=white\" alt=\"Website\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://webclaw.io/docs\"\u003e\u003cimg src=\"https://img.shields.io/badge/Docs-Read-3B82F6?style=for-the-badge\u0026logo=readthedocs\u0026logoColor=white\" alt=\"Docs\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/demo.gif\" alt=\"Claude Code: web_fetch gets 403, webclaw extracts successfully\" width=\"700\" /\u003e\n  \u003cbr/\u003e\n  \u003csub\u003eClaude Code's built-in web_fetch → 403 Forbidden. webclaw → clean markdown.\u003c/sub\u003e\n\u003c/p\u003e\n\n---\n\nYour AI agent calls `fetch()` and gets a 403. Or 142KB of raw HTML that burns through your token budget. **webclaw fixes both.**\n\nIt extracts clean, structured content from any URL using Chrome-level TLS fingerprinting — no headless browser, no Selenium, no Puppeteer. Output is optimized for LLMs: **67% fewer tokens** than raw HTML, with metadata, links, and images preserved.\n\n```\n                     Raw HTML                          webclaw\n┌──────────────────────────────────┐    ┌──────────────────────────────────┐\n│ \u003cdiv class=\"ad-wrapper\"\u003e         │    │ # Breaking: AI Breakthrough      │\n│ \u003cnav class=\"global-nav\"\u003e         │    │                                  │\n│ \u003cscript\u003ewindow.__NEXT_DATA__     │    │ Researchers achieved 94%         │\n│ ={...8KB of JSON...}\u003c/script\u003e    │    │ accuracy on cross-domain         │\n│ \u003cdiv class=\"social-share\"\u003e       │    │ reasoning benchmarks.            │\n│ \u003cbutton\u003eTweet\u003c/button\u003e           │    │                                  │\n│ \u003cfooter class=\"site-footer\"\u003e     │    │ ## Key Findings                  │\n│ \u003c!-- 142,847 characters --\u003e      │    │ - 3x faster inference            │\n│                                  │    │ - Open-source weights            │\n│         4,820 tokens             │    │         1,590 tokens             │\n└──────────────────────────────────┘    └──────────────────────────────────┘\n```\n\n---\n\n## Get Started (30 seconds)\n\n### For AI agents (Claude, Cursor, Windsurf, VS Code)\n\n```bash\nnpx create-webclaw\n```\n\nAuto-detects your AI tools, downloads the MCP server, and configures everything. One command.\n\n### Homebrew (macOS/Linux)\n\n```bash\nbrew tap 0xMassi/webclaw\nbrew install webclaw\n```\n\n### Prebuilt binaries\n\nDownload from [GitHub Releases](https://github.com/0xMassi/webclaw/releases) for macOS (arm64, x86_64) and Linux (x86_64, aarch64).\n\n### Cargo (from source)\n\n```bash\ncargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli\ncargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp\n```\n\n### Docker\n\n```bash\ndocker run --rm ghcr.io/0xmassi/webclaw https://example.com\n```\n\n### Docker Compose (with Ollama for LLM features)\n\n```bash\ncp env.example .env\ndocker compose up -d\n```\n\n---\n\n## Why webclaw?\n\n| | webclaw | Firecrawl | Trafilatura | Readability |\n|---|:---:|:---:|:---:|:---:|\n| **Extraction accuracy** | **95.1%** | — | 80.6% | 83.5% |\n| **Token efficiency** | **-67%** | — | -55% | -51% |\n| **Speed (100KB page)** | **3.2ms** | ~500ms | 18.4ms | 8.7ms |\n| **TLS fingerprinting** | Yes | No | No | No |\n| **Self-hosted** | Yes | No | Yes | Yes |\n| **MCP (Claude/Cursor)** | Yes | No | No | No |\n| **No browser required** | Yes | No | Yes | Yes |\n| **Cost** | Free | $$$$ | Free | Free |\n\n**Choose webclaw if** you want fast local extraction, LLM-optimized output, and native AI agent integration.\n\n---\n\n## What it looks like\n\n```bash\n$ webclaw https://stripe.com -f llm\n\n\u003e URL: https://stripe.com\n\u003e Title: Stripe | Financial Infrastructure for the Internet\n\u003e Language: en\n\u003e Word count: 847\n\n# Stripe | Financial Infrastructure for the Internet\n\nStripe is a suite of APIs powering online payment processing\nand commerce solutions for internet businesses of all sizes.\n\n## Products\n- Payments — Accept payments online and in person\n- Billing — Manage subscriptions and invoicing\n- Connect — Build a marketplace or platform\n...\n```\n\n```bash\n$ webclaw https://github.com --brand\n\n{\n  \"name\": \"GitHub\",\n  \"colors\": [{\"hex\": \"#59636E\", \"usage\": \"Primary\"}, ...],\n  \"fonts\": [\"Mona Sans\", \"ui-monospace\"],\n  \"logos\": [{\"url\": \"https://github.githubassets.com/...\", \"kind\": \"svg\"}]\n}\n```\n\n```bash\n$ webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50\n\nCrawling... 50/50 pages extracted\n---\n# Page 1: https://docs.rust-lang.org/\n...\n# Page 2: https://docs.rust-lang.org/book/\n...\n```\n\n---\n\n## MCP Server — 10 tools for AI agents\n\n\u003ca href=\"https://glama.ai/mcp/servers/0xMassi/webclaw\"\u003e\u003cimg src=\"https://glama.ai/mcp/servers/0xMassi/webclaw/badge\" alt=\"webclaw MCP server\" /\u003e\u003c/a\u003e\n\nwebclaw ships as an MCP server that plugs into Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Antigravity, Codex CLI, and any MCP-compatible client.\n\n```bash\nnpx create-webclaw    # auto-detects and configures everything\n```\n\nOr manual setup — add to your Claude Desktop config:\n\n```json\n{\n  \"mcpServers\": {\n    \"webclaw\": {\n      \"command\": \"~/.webclaw/webclaw-mcp\"\n    }\n  }\n}\n```\n\nThen in Claude: *\"Scrape the top 5 results for 'web scraping tools' and compare their pricing\"* — it just works.\n\n### Available tools\n\n| Tool | Description | Requires API key? |\n|------|-------------|:-:|\n| `scrape` | Extract content from any URL | No |\n| `crawl` | Recursive site crawl | No |\n| `map` | Discover URLs from sitemaps | No |\n| `batch` | Parallel multi-URL extraction | No |\n| `extract` | LLM-powered structured extraction | No (needs Ollama) |\n| `summarize` | Page summarization | No (needs Ollama) |\n| `diff` | Content change detection | No |\n| `brand` | Brand identity extraction | No |\n| `search` | Web search + scrape results | Yes |\n| `research` | Deep multi-source research | Yes |\n\n8 of 10 tools work locally — no account, no API key, fully private.\n\n---\n\n## Features\n\n### Extraction\n\n- **Readability scoring** — multi-signal content detection (text density, semantic tags, link ratio)\n- **Noise filtering** — strips nav, footer, ads, modals, cookie banners (Tailwind-safe)\n- **Data island extraction** — catches React/Next.js JSON payloads, JSON-LD, hydration data\n- **YouTube metadata** — structured data from any YouTube video\n- **PDF extraction** — auto-detected via Content-Type\n- **5 output formats** — markdown, text, JSON, LLM-optimized, HTML\n\n### Content control\n\n```bash\nwebclaw URL --include \"article, .content\"       # CSS selector include\nwebclaw URL --exclude \"nav, footer, .sidebar\"    # CSS selector exclude\nwebclaw URL --only-main-content                  # Auto-detect main content\n```\n\n### Crawling\n\n```bash\nwebclaw URL --crawl --depth 3 --max-pages 100   # BFS same-origin crawl\nwebclaw URL --crawl --sitemap                    # Seed from sitemap\nwebclaw URL --map                                # Discover URLs only\n```\n\n### LLM features (Ollama / OpenAI / Anthropic)\n\n```bash\nwebclaw URL --summarize                          # Page summary\nwebclaw URL --extract-prompt \"Get all prices\"    # Natural language extraction\nwebclaw URL --extract-json '{\"type\":\"object\"}'   # Schema-enforced extraction\n```\n\n### Change tracking\n\n```bash\nwebclaw URL -f json \u003e snap.json                  # Take snapshot\nwebclaw URL --diff-with snap.json                # Compare later\n```\n\n### Brand extraction\n\n```bash\nwebclaw URL --brand                              # Colors, fonts, logos, OG image\n```\n\n### Proxy rotation\n\n```bash\nwebclaw URL --proxy http://user:pass@host:port   # Single proxy\nwebclaw URLs --proxy-file proxies.txt            # Pool rotation\n```\n\n---\n\n## Benchmarks\n\nAll numbers from real tests on 50 diverse pages. See [benchmarks/](benchmarks/) for methodology and reproduction instructions.\n\n### Extraction quality\n\n```\nAccuracy      webclaw     ███████████████████ 95.1%\n              readability ████████████████▋   83.5%\n              trafilatura ████████████████    80.6%\n              newspaper3k █████████████▎      66.4%\n\nNoise removal webclaw     ███████████████████ 96.1%\n              readability █████████████████▊  89.4%\n              trafilatura ██████████████████▏ 91.2%\n              newspaper3k ███████████████▎    76.8%\n```\n\n### Speed (pure extraction, no network)\n\n```\n10KB page     webclaw     ██                   0.8ms\n              readability █████                2.1ms\n              trafilatura ██████████           4.3ms\n\n100KB page    webclaw     ██                   3.2ms\n              readability █████                8.7ms\n              trafilatura ██████████           18.4ms\n```\n\n### Token efficiency (feeding to Claude/GPT)\n\n| Format | Tokens | vs Raw HTML |\n|--------|:------:|:-----------:|\n| Raw HTML | 4,820 | baseline |\n| readability | 2,340 | -51% |\n| trafilatura | 2,180 | -55% |\n| **webclaw llm** | **1,590** | **-67%** |\n\n### Crawl speed\n\n| Concurrency | webclaw | Crawl4AI | Scrapy |\n|:-----------:|:-------:|:--------:|:------:|\n| 5 | **9.8 pg/s** | 5.2 pg/s | 7.1 pg/s |\n| 10 | **18.4 pg/s** | 8.7 pg/s | 12.3 pg/s |\n| 20 | **32.1 pg/s** | 14.2 pg/s | 21.8 pg/s |\n\n---\n\n## Architecture\n\n```\nwebclaw/\n  crates/\n    webclaw-core     Pure extraction engine. Zero network deps. WASM-safe.\n    webclaw-fetch    HTTP client + TLS fingerprinting (wreq/BoringSSL). Crawler. Batch ops.\n    webclaw-llm      LLM provider chain (Ollama -\u003e OpenAI -\u003e Anthropic)\n    webclaw-pdf      PDF text extraction\n    webclaw-mcp      MCP server (10 tools for AI agents)\n    webclaw-cli      CLI binary\n```\n\n`webclaw-core` takes raw HTML as a `\u0026str` and returns structured output. No I/O, no network, no allocator tricks. Can compile to WASM.\n\n---\n\n## Configuration\n\n| Variable | Description |\n|----------|-------------|\n| `WEBCLAW_API_KEY` | Cloud API key (enables bot bypass, JS rendering, search, research) |\n| `OLLAMA_HOST` | Ollama URL for local LLM features (default: `http://localhost:11434`) |\n| `OPENAI_API_KEY` | OpenAI API key for LLM features |\n| `ANTHROPIC_API_KEY` | Anthropic API key for LLM features |\n| `WEBCLAW_PROXY` | Single proxy URL |\n| `WEBCLAW_PROXY_FILE` | Path to proxy pool file |\n\n---\n\n## Cloud API (optional)\n\nFor bot-protected sites, JS rendering, and advanced features, webclaw offers a hosted API at [webclaw.io](https://webclaw.io).\n\nThe CLI and MCP server work locally first. Cloud is used as a fallback when:\n- A site has bot protection (Cloudflare, DataDome, WAF)\n- A page requires JavaScript rendering\n- You use search or research tools\n\n```bash\nexport WEBCLAW_API_KEY=wc_your_key\n\n# Automatic: tries local first, cloud on bot detection\nwebclaw https://protected-site.com\n\n# Force cloud\nwebclaw --cloud https://spa-site.com\n```\n\n### SDKs\n\n```bash\nnpm install @webclaw/sdk                  # TypeScript/JavaScript\npip install webclaw                        # Python\ngo get github.com/0xMassi/webclaw-go      # Go\n```\n\n---\n\n## Use cases\n\n- **AI agents** — Give Claude/Cursor/GPT real-time web access via MCP\n- **Research** — Crawl documentation, competitor sites, news archives\n- **Price monitoring** — Track changes with `--diff-with` snapshots\n- **Training data** — Prepare web content for fine-tuning with token-optimized output\n- **Content pipelines** — Batch extract + summarize in CI/CD\n- **Brand intelligence** — Extract visual identity from any website\n\n---\n\n## Community\n\n- [Discord](https://discord.gg/KDfd48EpnW) — questions, feedback, show what you built\n- [GitHub Issues](https://github.com/0xMassi/webclaw/issues) — bug reports and feature requests\n\n## Contributing\n\nWe welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n- [Good first issues](https://github.com/0xMassi/webclaw/issues?q=label%3A%22good+first+issue%22)\n- [Architecture docs](CONTRIBUTING.md#architecture)\n\n## Acknowledgments\n\nTLS and HTTP/2 browser fingerprinting is powered by [wreq](https://github.com/0x676e67/wreq) and [http2](https://github.com/0x676e67/http2) by [@0x676e67](https://github.com/0x676e67), who pioneered browser-grade HTTP/2 fingerprinting in Rust.\n\n## License\n\n[AGPL-3.0](LICENSE)\n","funding_links":[],"categories":["Applications"],"sub_categories":["Web"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xMassi%2Fwebclaw","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F0xMassi%2Fwebclaw","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xMassi%2Fwebclaw/lists"}