{"id":37228879,"url":"https://github.com/poodle64/supacrawl","last_synced_at":"2026-03-04T11:00:42.087Z","repository":{"id":331969713,"uuid":"1116468318","full_name":"poodle64/supacrawl","owner":"poodle64","description":"Zero-infrastructure web scraping for the terminal","archived":false,"fork":false,"pushed_at":"2026-02-22T06:19:26.000Z","size":307,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-22T09:01:37.749Z","etag":null,"topics":["cli","crawler","llm","markdown","playwright","python","scraper","terminal","web-scraping"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/poodle64.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-14T22:57:02.000Z","updated_at":"2026-02-22T06:19:31.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/poodle64/supacrawl","commit_stats":null,"previous_names":["poodle64/supacrawl"],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/poodle64/supacrawl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poodle64%2Fsupacrawl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poodle64%2Fsupacrawl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poodle64%2Fsupacrawl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poodle64%2Fsupacrawl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/poodle64","download_url":"https://codeload.github.com/poodle64/supacrawl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poodle64%2Fsupacrawl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30078400,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T08:01:56.766Z","status":"ssl_error","status_checked_at":"2026-03-04T08:00:42.919Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","crawler","llm","markdown","playwright","python","scraper","terminal","web-scraping"],"created_at":"2026-01-15T03:30:01.785Z","updated_at":"2026-03-04T11:00:42.078Z","avatar_url":"https://github.com/poodle64.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n[![Python](https://img.shields.io/badge/Python-3.12+-3776ab?logo=python\u0026logoColor=white)](https://python.org)\n[![Playwright](https://img.shields.io/badge/Playwright-2ea44f?logo=playwright\u0026logoColor=white)](https://playwright.dev)\n[![MCP](https://img.shields.io/badge/MCP-Compatible-8A2BE2)](https://modelcontextprotocol.io)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n\n\u003ch1\u003eSupacrawl\u003c/h1\u003e\n\n\u003cem\u003eZero-infrastructure web scraping for the terminal and AI assistants.\u003c/em\u003e\n\n\u003c/div\u003e\n\n## Why Supacrawl?\n\nThere are excellent web scraping tools available. Supacrawl takes a different approach: a CLI tool designed for individual developers who want to scrape from the terminal.\n\n- **Zero infrastructure**: `pip install` and go, no Docker/databases/Redis\n- **Terminal-first**: Designed for shell workflows and pipelines\n- **MCP server**: Give AI assistants direct access to web scraping\n- **Clean markdown**: Playwright renders JS, outputs readable markdown\n- **LLM-ready**: Built-in extraction with Ollama, OpenAI, or Anthropic\n- **Anti-bot protection**: Three-tier engine system (Playwright, Patchright, Camoufox) with automatic HTTP/2 fallback\n- **PDF parsing**: Auto-detect PDF URLs, extract text with optional OCR\n- **Mobile emulation**: Scrape as any mobile device using Playwright device descriptors\n\n```bash\npip install supacrawl\nplaywright install chromium\n```\n\n## Quick Start\n\n```bash\n# Scrape a page to markdown\nsupacrawl scrape https://example.com\n\n# Crawl a website\nsupacrawl crawl https://docs.python.org/3/ -o ./python-docs --limit 50\n\n# Discover URLs without fetching\nsupacrawl map https://example.com\n\n# Web search\nsupacrawl search \"python web scraping 2024\"\n\n# LLM extraction (requires LLM config)\nsupacrawl llm-extract https://example.com/pricing -p \"Extract pricing tiers\"\n\n# Autonomous agent for complex tasks\nsupacrawl agent \"Find the pricing for all plans on example.com\"\n```\n\n## MCP Server\n\nSupacrawl includes an embedded [Model Context Protocol](https://modelcontextprotocol.io) (MCP) server, giving AI assistants like Claude, Cursor, and VS Code Copilot direct access to web scraping.\n\n### Install\n\n```bash\npip install supacrawl[mcp]\nplaywright install chromium\n```\n\n### Add to your MCP client\n\n**Claude Desktop**: edit `~/Library/Application Support/Claude/claude_desktop_config.json`:\n\n```json\n{\n  \"mcpServers\": {\n    \"supacrawl\": {\n      \"command\": \"supacrawl-mcp\",\n      \"args\": [\"--transport\", \"stdio\"]\n    }\n  }\n}\n```\n\n**Claude Code**: add to `.mcp.json` in your project root:\n\n```json\n{\n  \"mcpServers\": {\n    \"supacrawl\": {\n      \"command\": \"supacrawl-mcp\",\n      \"args\": [\"--transport\", \"stdio\"]\n    }\n  }\n}\n```\n\n**Cursor / VS Code**: add to your editor's MCP settings with the same config.\n\n### Available Tools\n\n| Tool                 | Description                                         |\n| -------------------- | --------------------------------------------------- |\n| `supacrawl_scrape`   | Scrape a URL to markdown, HTML, screenshot, or PDF  |\n| `supacrawl_crawl`    | Crawl multiple pages from a site                    |\n| `supacrawl_map`      | Discover URLs on a website without fetching content |\n| `supacrawl_search`   | Web search with multi-provider fallback             |\n| `supacrawl_extract`  | Scrape pages for LLM-powered structured extraction  |\n| `supacrawl_summary`  | Scrape a page for LLM-powered summarisation         |\n| `supacrawl_diagnose` | Diagnose scraping issues (CDN, bot detection, etc.) |\n| `supacrawl_health`   | Server health check and capability report           |\n\nThe CLI's `agent` command is intentionally omitted. When used via MCP, your LLM orchestrates the primitives directly; it *is* the agent. For standalone agentic workflows, use `supacrawl agent` from the CLI.\n\nThe server also exposes MCP **resources** (format references, search providers, capabilities) and **prompts** (workflow guides for scraping, extraction, research, and error handling).\n\n### Environment Variables\n\nPass environment variables via your MCP client config to customise behaviour:\n\n```json\n{\n  \"mcpServers\": {\n    \"supacrawl\": {\n      \"command\": \"supacrawl-mcp\",\n      \"args\": [\"--transport\", \"stdio\"],\n      \"env\": {\n        \"SUPACRAWL_ENGINE\": \"camoufox\",\n        \"SUPACRAWL_STEALTH\": \"true\",\n        \"SUPACRAWL_LOCALE\": \"en-AU\",\n        \"SUPACRAWL_TIMEZONE\": \"Australia/Sydney\",\n        \"BRAVE_API_KEY\": \"your-key-here\",\n        \"TAVILY_API_KEY\": \"your-key-here\",\n        \"SUPACRAWL_SEARCH_PROVIDERS\": \"brave,tavily\"\n      }\n    }\n  }\n}\n```\n\nAll [configuration](#configuration) environment variables apply. The MCP server also supports `SUPACRAWL_LOG_LEVEL` (default: `INFO`). Search providers fall back automatically when one hits a rate limit or quota.\n\n### Troubleshooting\n\nIf scrapes return empty or minimal content, use `supacrawl_diagnose` to identify the cause (CDN protection, JS framework, bot detection). Common fixes: set `wait_for=3000` for JS-heavy sites (enables SPA stability polling), use `wait_until=\"load\"` or `\"networkidle\"` if resources must fully load, enable `SUPACRAWL_STEALTH=true` for bot-protected sites, or try `only_main_content=false` if the wrong content is extracted.\n\n### Optional Extras\n\n```bash\npip install supacrawl[mcp,stealth]    # Patchright anti-bot evasion (Tier 2)\npip install supacrawl[mcp,camoufox]   # Camoufox for Akamai/Cloudflare (Tier 3)\npip install supacrawl[mcp,captcha]    # 2Captcha CAPTCHA solving\n```\n\n## Commands\n\n| Command             | Description                                     |\n| ------------------- | ----------------------------------------------- |\n| `scrape \u003curl\u003e`      | Scrape single page to markdown                  |\n| `crawl \u003curl\u003e`       | Crawl website, save to directory                |\n| `map \u003curl\u003e`         | Discover URLs from sitemap/links                |\n| `search \u003cquery\u003e`    | Web search with multi-provider fallback         |\n| `llm-extract \u003curl\u003e` | Extract structured data with LLM                |\n| `agent \u003cprompt\u003e`    | Autonomous agent for complex tasks              |\n| `cache`             | Cache management (clear, stats, prune)          |\n\nRun `supacrawl \u003ccommand\u003e --help` for options.\n\n## Output\n\nCrawl produces a flat directory of markdown files:\n\n```\noutput/\n├── manifest.json          # URLs crawled (for resume)\n├── index.md\n├── about.md\n└── docs_getting-started.md\n```\n\nEach markdown file includes YAML frontmatter with source URL and metadata.\n\n## Configuration\n\n### Core Settings\n\n| Variable             | Default      | Description                                            |\n| -------------------- | ------------ | ------------------------------------------------------ |\n| `SUPACRAWL_HEADLESS` | `true`       | Set `false` to see browser                             |\n| `SUPACRAWL_TIMEOUT`  | `30000`      | Page load timeout (ms)                                 |\n| `SUPACRAWL_ENGINE`   | `playwright` | Browser engine: `playwright`, `patchright`, `camoufox` |\n| `SUPACRAWL_PROXY`    | -            | Proxy URL (http/socks5)                                |\n\n### LLM Features\n\nRequired for `llm-extract`, `agent`, and `--summarize`:\n\n| Variable                 | Description                             |\n| ------------------------ | --------------------------------------- |\n| `SUPACRAWL_LLM_PROVIDER` | `ollama`, `openai`, or `anthropic`      |\n| `SUPACRAWL_LLM_MODEL`    | Model name (e.g., `qwen3:8b`)           |\n| `OPENAI_API_KEY`         | For OpenAI provider                     |\n| `ANTHROPIC_API_KEY`      | For Anthropic provider                  |\n| `OLLAMA_HOST`            | Ollama URL (default: `localhost:11434`) |\n\n### Search\n\nSupacrawl supports multiple search providers with automatic fallback. If the primary provider hits a rate limit or quota, the next provider in the chain is tried automatically.\n\n| Variable                       | Default | Description                                                                                      |\n| ------------------------------ | ------- | ------------------------------------------------------------------------------------------------ |\n| `BRAVE_API_KEY`                | -       | Brave Search API key (recommended). Free tier: ~1,000 searches/month. Get one at [brave.com/search/api](https://brave.com/search/api/) |\n| `TAVILY_API_KEY`               | -       | [Tavily](https://tavily.com/) API key. Supports web and news search                             |\n| `SERPER_API_KEY`               | -       | [Serper.dev](https://serper.dev/) API key. Google Search results                                 |\n| `SERPAPI_API_KEY`              | -       | [SerpAPI](https://serpapi.com/) API key. Google Search results                                   |\n| `EXA_API_KEY`                  | -       | [Exa.ai](https://exa.ai/) API key. Neural search for web and news                               |\n| `SUPACRAWL_SEARCH_PROVIDERS`   | `brave` | Comma-separated provider chain with fallback order (e.g., `brave,tavily,serper`)                 |\n| `SUPACRAWL_SEARCH_RATE_LIMIT`  | -       | Override default rate limit (requests/second). Provider defaults: Brave 1/s, DuckDuckGo 0.5/s   |\n\nProviders are tried in order. Set API keys for each provider you want to use; providers without keys are skipped. If no keys are configured, DuckDuckGo is used as a last-resort fallback.\n\n\u003e **Note**: DuckDuckGo is a deprecated fallback. It has no official API and actively blocks automated access with CAPTCHA challenges. Configure at least one API-keyed provider for reliable search.\n\n### Caching\n\nSupacrawl caches scraped content locally for faster repeated requests. Enable with `--max-age`:\n\n```bash\n# Cache for 1 hour\nsupacrawl scrape https://example.com --max-age 3600\n```\n\n| Variable              | Default              | Description     |\n| --------------------- | -------------------- | --------------- |\n| `SUPACRAWL_CACHE_DIR` | `~/.supacrawl/cache` | Cache directory |\n\n**Cache Management:**\n\n```bash\nsupacrawl cache stats   # View cache size and entry count\nsupacrawl cache prune   # Remove expired entries\nsupacrawl cache clear   # Clear all cache (with confirmation)\n```\n\n**Cache Behaviour:**\n- No automatic eviction; run `cache prune` periodically to clean expired entries\n- No size limits; cache grows unbounded, use `cache clear` if disk space is a concern\n- Files stored as `\u003chash\u003e.json` where hash is SHA256 of normalised URL\n\n### Optional Extras\n\n```bash\npip install supacrawl[stealth]    # Patchright for anti-bot evasion (Tier 2)\npip install supacrawl[camoufox]   # Camoufox for Akamai/Cloudflare bypass (Tier 3)\npip install supacrawl[captcha]    # 2Captcha for CAPTCHA solving\npip install supacrawl[pdf-ocr]    # OCR support for scanned PDFs\n```\n\nSelect the browser engine with `--engine` (playwright, patchright, camoufox) or set `SUPACRAWL_ENGINE` as a default. Use `--stealth` for Tier 2, `--engine camoufox` for Tier 3, and `--solve-captcha` for CAPTCHA-protected sites. CAPTCHA solving requires `CAPTCHA_API_KEY` environment variable.\n\nCopy `.env.example` to `.env` to configure.\n\n### System-Managed Playwright Browsers\n\nDistributions like NixOS and Guix provide pre-built Playwright browser binaries. To use them, pin the Python package to match your system's browser version and set `PLAYWRIGHT_BROWSERS_PATH`:\n\n```bash\npip install 'supacrawl' 'playwright==1.52.0'  # match your distro's version\nexport PLAYWRIGHT_BROWSERS_PATH=/nix/store/...-playwright-driver-browsers\n```\n\nSkip `playwright install`; your system already provides the binaries.\n\n## Development\n\n```bash\n# From source\nconda env create -f environment.yaml \u0026\u0026 conda activate supacrawl\npip install -e .[dev]\nplaywright install chromium\n\n# Quality checks\nruff check src/ \u0026\u0026 mypy src/\npytest -q -m \"not e2e\"\n```\n\n## Architecture\n\n```\n┌─────────────────────────────────────────────────────────────────────────────┐\n│                         Where Supacrawl Fits                                │\n├─────────────────┬─────────────────┬─────────────────┬───────────────────────┤\n│   Collection    │   Processing    │    Storage      │        Query          │\n├─────────────────┼─────────────────┼─────────────────┼───────────────────────┤\n│                 │                 │                 │                       │\n│   supacrawl ────┼──► ragify ──────┼──► Qdrant ──────┼──► Claude Code        │\n│                 │    LangChain    │    Chroma       │    Custom Agents      │\n│   • scrape      │    LlamaIndex   │    Pinecone     │    RAG Apps           │\n│   • crawl       │                 │    Weaviate     │                       │\n│   • search      │   • chunk       │                 │                       │\n│   • extract     │   • embed       │   • store       │   • retrieve          │\n│                 │                 │   • index       │   • generate          │\n│                 │                 │                 │                       │\n└─────────────────┴─────────────────┴─────────────────┴───────────────────────┘\n\nSupacrawl does one thing well: get clean markdown from the web.\n```\n\n## Comparison\n\n|                          | Supacrawl                           | crawl4ai                 | Firecrawl (self-hosted)        | Firecrawl (cloud) |\n| ------------------------ | ----------------------------------- | ------------------------ | ------------------------------ | ----------------- |\n| **Infrastructure**       | `pip install`                       | `pip install`            | Docker + PostgreSQL + Redis    | Hosted API        |\n| **MCP Server**           | Built-in (`[mcp]` extra)            | Not included             | Not included                   | Yes               |\n| **Web Search**           | Built-in (6 providers with fallback)| Not included             | Via SearXNG                    | Yes               |\n| **LLM Providers**        | Ollama, OpenAI, Anthropic           | Any via LiteLLM          | OpenAI (Ollama experimental)   | OpenAI            |\n| **Intelligent Crawling** | Yes (agent command)                 | Yes (adaptive crawling)  | No                             | Yes (/agent)      |\n| **Stealth/Anti-bot**     | Yes (3-tier: Patchright + Camoufox) | Yes (undetected browser) | No (Fire-engine is cloud-only) | Yes (Fire-engine) |\n| **PDF Parsing**          | Yes (text + OCR)                    | No                       | No                             | No                |\n| **CAPTCHA Solving**      | Yes (2Captcha)                      | Optional (CapSolver)     | No                             | No                |\n| **Caching**              | Local files                         | Built-in                 | PostgreSQL                     | Managed           |\n| **Licence**              | MIT                                 | Apache-2.0               | AGPL-3.0                       | AGPL-3.0          |\n| **Cost**                 | Free                                | Free                     | Free                           | Pay-per-use       |\n\n**Supacrawl** is minimal and focused. **crawl4ai** is a feature-rich framework with adaptive crawling and chunking. **Firecrawl** is an API server for applications needing a scraping backend.\n\n## Licence\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpoodle64%2Fsupacrawl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpoodle64%2Fsupacrawl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpoodle64%2Fsupacrawl/lists"}