{"id":46192627,"url":"https://github.com/ceoimperiumprojects/imperium-crawl","last_synced_at":"2026-04-29T23:13:10.879Z","repository":{"id":341490243,"uuid":"1170296343","full_name":"ceoimperiumprojects/imperium-crawl","owner":"ceoimperiumprojects","description":"The most powerful open-source CLI toolkit for web scraping. 28 tools — stealth, ARIA snapshots, AI extraction, API discovery, YouTube, Reddit, Instagram, RSS, media download, session encryption. Zero API keys.","archived":false,"fork":false,"pushed_at":"2026-03-06T23:04:45.000Z","size":625,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-07T12:47:10.637Z","etag":null,"topics":["anti-bot","api-discovery","brave-search","captcha-solver","crawling","download","instagram","media-downloader","open-source","reddit","rss","scraper","social-media","stealth","typescript","web-scraping","websocket","youtube"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ceoimperiumprojects.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-02T00:51:38.000Z","updated_at":"2026-03-06T23:06:24.000Z","dependencies_parsed_at":"2026-03-08T07:00:34.611Z","dependency_job_id":null,"html_url":"https://github.com/ceoimperiumprojects/imperium-crawl","commit_stats":null,"previous_names":["ceoimperiumprojects/imperium-crawl"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/ceoimperiumprojects/imperium-crawl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceoimperiumprojects%2Fimperium-crawl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceoimperiumprojects%2Fimperium-crawl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceoimperiumprojects%2Fimperium-crawl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceoimperiumprojects%2Fimperium-crawl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ceoimperiumprojects","download_url":"https://codeload.github.com/ceoimperiumprojects/imperium-crawl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceoimperiumprojects%2Fimperium-crawl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30248480,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-08T05:41:50.788Z","status":"ssl_error","status_checked_at":"2026-03-08T05:41:39.075Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anti-bot","api-discovery","brave-search","captcha-solver","crawling","download","instagram","media-downloader","open-source","reddit","rss","scraper","social-media","stealth","typescript","web-scraping","websocket","youtube"],"created_at":"2026-03-03T01:08:13.534Z","updated_at":"2026-04-29T23:13:10.866Z","avatar_url":"https://github.com/ceoimperiumprojects.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"assets/hero-banner.png\" alt=\"imperium-crawl — 3-level auto-escalating stealth engine\" width=\"800\" /\u003e\n\n# imperium-crawl\n\n**The most powerful open-source CLI tool for web scraping, crawling, and data extraction.**\n\n39 tools. Zero API keys required. One `npx` command.\n\n[![npm version](https://img.shields.io/npm/v/imperium-crawl.svg)](https://www.npmjs.com/package/imperium-crawl)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)\n[![Tests](https://img.shields.io/badge/tests-580%20passing-brightgreen.svg)]()\n[![npm downloads](https://img.shields.io/npm/dm/imperium-crawl.svg)](https://www.npmjs.com/package/imperium-crawl)\n\n\u003c/div\u003e\n\n---\n\n## What's new in 2.5.1\n\n**Browser-based image extraction overhaul** — 100% coverage on any website:\n\n- **Full browser rendering (L3)** for image discovery — JavaScript, lazy-load, shadow DOM, same-origin iframes\n- **7 image sources**: `\u003cimg\u003e`, `\u003cpicture\u003e`, CSS `background-image`, shadow DOM, JSON-LD, inline scripts, iframes\n- **Precise targeting**: `--selector`, `--index`, `--alt-match`, `--min-width`, `--max-width`\n- **Auto-click** \"Load more\" / \"Gallery\" buttons with multilingual keyword matching\n- **Referer injection** fixes 403 errors on image CDN anti-hotlink protection\n- **New `auto_click` action** in `interact` tool for standalone browser automation\n\n```bash\n# Download ALL images from any page (100% coverage)\nimperium-crawl download \u003curl\u003e --images --output ./slike\n\n# Target exactly the 3rd image\nimperium-crawl download \u003curl\u003e --images --index 3\n\n# Auto-click \"Prikaži više\" + scan iframes\nimperium-crawl download \u003curl\u003e --images --auto-click --iframe-scan\n```\n\nSee [CHANGELOG.md](./CHANGELOG.md) for the full release notes.\n\n---\n\n## Quick Start\n\nGet running in 30 seconds.\n\n**CLI** (zero install):\n\n```bash\nnpx -y imperium-crawl scrape --url https://example.com\n```\n\n**Global install:**\n\n```bash\nnpm install -g imperium-crawl\n```\n\n**Install from a local tarball** (e.g. pre-release testing):\n\n```bash\nnpm install -g ./imperium-crawl-2.5.2.tgz\n```\n\n\u003e That's it. 33 of 39 tools work with zero API keys. Add optional keys later to unlock search, AI extraction, and CAPTCHA solving.\n\n---\n\n## Power Examples\n\nReal results. Copy-paste and try.\n\n### Scrape through Cloudflare\n\n```bash\nimperium-crawl scrape --url https://blog.cloudflare.com\n```\n\n```\nLevel 1 (headers) → blocked\nLevel 2 (TLS fingerprint) → blocked\nLevel 3 (browser + stealth) → success ✅\n→ Full markdown content extracted, 213K characters\n→ Next visit: skips straight to Level 3 (learned)\n```\n\n### Discover hidden APIs on any website\n\n```bash\nimperium-crawl discover-apis --url https://weather.com\n```\n\n```\nFound 11 hidden API endpoints:\n  • api.weather.com — main weather API (exposed API key!)\n  • mParticle analytics endpoints\n  • Taboola content recommendation API\n  • OneTrust consent management API\n  • DAA/AdChoices opt-out endpoints\n→ Call any endpoint directly with query_api — 10x faster than DOM scraping\n```\n\n### AI extraction in plain English\n\n```bash\nimperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \\\n  --schema \"extract product name, price, rating, and review count\"\n```\n\n```json\n{\n  \"product_name\": \"Apple AirPods Pro 2\",\n  \"price\": \"$189.99\",\n  \"rating\": \"4.7 out of 5\",\n  \"review_count\": \"45,297\"\n}\n```\n\n### Extract ALL images from any page (100% coverage)\n\n```bash\nimperium-crawl download https://www.njuskalo.hr/nekretnine/stan-Zagreb --images --output ./slike\n```\n\n```\nDiscovered 23 unique images\n  ✅ njuskalo.hr-001.jpg — 142KB\n  ✅ njuskalo.hr-002.jpg — 89KB\n  ✅ njuskalo.hr-003.jpg — 256KB\n→ 23/23 downloaded. Total: 4.2MB\n```\n\n**Target a specific image:**\n\n```bash\nimperium-crawl download https://olx.ba/artikal/12345 \\\n  --images --selector \"img.gallery-main\" --output ./oglas.jpg\n```\n\n**Auto-click \"Load more\" + iframe scan:**\n\n```bash\nimperium-crawl download https://www.leboncoin.fr/ad/12345 \\\n  --images --auto-click --iframe-scan --limit 50\n```\n\n### Batch scrape with resume\n\n```bash\nimperium-crawl batch-scrape \\\n  --urls '[\"https://bbc.com\",\"https://cnn.com\",\"https://reuters.com\",\"https://techcrunch.com\"]' \\\n  --concurrency 3\n```\n\n```\nScraping 4 URLs (concurrency: 3)...\n  ✅ bbc.com — 47K chars\n  ✅ cnn.com — 52K chars\n  ✅ reuters.com — 38K chars\n  ✅ techcrunch.com — 61K chars\n→ 4/4 succeeded. Job ID: abc123 (resume with --job-id if interrupted)\n```\n\n---\n\n## Why imperium-crawl?\n\n🔓 **Zero API Keys Required**\n33 of 39 tools work out of the box. No accounts, no tokens, no credit cards. Just `npx` and go.\n\n🛡️ **3-Level Auto-Escalating Stealth**\nHeaders → TLS fingerprinting → headless browser + CAPTCHA solving. Automatically escalates until it gets through.\n\n🧠 **Self-Improving**\nAdaptive learning engine remembers what works per domain. Second visit is 3x faster. The more you use it, the smarter it gets.\n\n🧰 **39 Tools, 2 Modes**\nCLI tool or interactive TUI. Scraping, crawling, search, extraction, API discovery, WebSocket monitoring, browser automation, batch processing.\n\n📜 **14 Built-in Recipes**\nPre-built workflows for common tasks — news extraction, e-commerce scraping, API reverse engineering, and more.\n\n⚡ **Skills System**\nTeach it once, run forever. Auto-detect patterns on any page, save as reusable skills, get fresh data on demand.\n\n---\n\n## vs. The Competition\n\n| Feature | **imperium-crawl** | Firecrawl | Crawl4AI | Browserbase | Puppeteer |\n|---------|:------------------:|:---------:|:--------:|:-----------:|:---------:|\n| Price | **Free forever** | $19+/month | Free | $0.01/min | Free |\n| Total tools | **39** | 5 | 2 | 4 | N/A |\n| Stealth levels | **3 (auto-escalate)** | Cloud-based | 1 | Cloud-based | None |\n| Anti-bot detection | **7 systems** | Partial | Partial | Partial | None |\n| TLS fingerprinting | **JA3/JA4** | No | No | No | No |\n| CAPTCHA auto-solving | **Yes** | No | No | No | No |\n| API discovery | **Yes** | No | No | No | No |\n| WebSocket monitoring | **Yes** | No | No | No | No |\n| AI-powered extraction | **Yes** | No | No | No | No |\n| Adaptive learning | **Yes** | No | No | No | No |\n| Batch processing | **Yes** | No | No | No | No |\n| ARIA Snapshots | **Yes** | No | No | No | No |\n| Session Encryption | **Yes** | No | No | No | No |\n| Self-hosted | **Yes** | No | Yes | No | Yes |\n| Requires external service | **No** | Yes | No | Yes | No |\n\n---\n\n## Stealth Engine\n\n```\nRequest → [L1: Headers + UA rotation]\n              │\n              ├─ success → Done\n              ↓ fail\n          [L2: TLS Fingerprint (JA3/JA4)]\n              │\n              ├─ success → Done\n              ↓ fail\n          [L3: Browser + Fingerprint Injection + CAPTCHA]\n              │\n              ├─ success → Done\n              ↓\n          [Learning Engine records optimal level for next time]\n```\n\n### Stealth Levels\n\n| Level | Method | What It Defeats |\n|-------|--------|-----------------|\n| **1** | `header-generator` — Bayesian realistic headers + UA rotation | Basic bot detection, simple WAFs |\n| **2** | `impit` — browser-identical TLS fingerprints (JA3/JA4) | Cloudflare, Akamai, TLS fingerprinting WAFs |\n| **3** | `rebrowser-playwright` + `fingerprint-injector` + auto CAPTCHA | JavaScript challenges, SPAs, advanced anti-bot, CAPTCHAs |\n\n### Anti-Bot System Detection\n\nAutomatically identifies which anti-bot system a site uses and chooses the optimal strategy:\n\n| System | Detection Method |\n|--------|-----------------|\n| **Cloudflare** | `cf_clearance` cookies, `cf-mitigated` header, challenge page title |\n| **Akamai** | `_abck`, `bm_sz` cookies |\n| **PerimeterX / HUMAN** | `_px` cookies, `_pxhd` headers |\n| **DataDome** | `datadome` cookies, `datadome` response header |\n| **Kasada** | `x-kpsdk-*` headers |\n| **AWS WAF** | `aws-waf-token` cookie |\n| **F5 / Shape Security** | `TS` prefix cookies |\n\n### Smart Rendering Cache\n\nOnce imperium-crawl determines a domain needs Level 3 (browser), it caches that decision for 1 hour. Subsequent requests to the same domain skip straight to browser rendering — no wasted time on failed lower levels.\n\n---\n\n## Adaptive Learning Engine\n\nimperium-crawl **learns from every request** and gets smarter over time. No configuration needed — fully automatic.\n\nEvery time you scrape a website, the engine records which stealth level worked, which anti-bot system was detected, whether a proxy was needed, response timing, and success/failure. Next time you hit the same domain, it **predicts the optimal configuration** — skipping failed levels and going straight to what works.\n\n```\nFirst visit to cloudflare.com:\n  Level 1 → blocked ❌\n  Level 2 → blocked ❌\n  Level 3 → success ✅ (Cloudflare detected)\n  → Engine records: cloudflare.com needs Level 3\n\nSecond visit to cloudflare.com:\n  → Engine predicts: Level 3, confidence 85%, Cloudflare\n  → Skips Level 1 and 2 entirely — goes straight to browser\n  → 3x faster than first visit\n```\n\n### Smart Features\n\n- **Time decay** — Knowledge older than 7 days loses weight, adapts when sites change defenses\n- **Confidence scoring** — Low data = start from level 1. High confidence = skip to optimal level\n- **Auto-prune** — Domains unused for 30 days are cleaned up. Max 2,000 domains stored\n- **Atomic persistence** — Knowledge saved via atomic write (tmp → rename). Never corrupts\n\n\u003e **The more you use it, the faster it gets.**\n\n---\n\n## All 39 Tools\n\n### 📄 Scraping (no API key needed)\n\n| Tool | What It Does |\n|------|-------------|\n| **scrape** | URL to clean Markdown/HTML with 3-level auto-escalating stealth. Structured data (JSON-LD, OpenGraph, Microdata), metadata, and links. |\n| **crawl** | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring. |\n| **map** | Discover all URLs on a domain via sitemap.xml + page link extraction. |\n| **extract** | CSS selectors to structured JSON. Point at any repeating pattern and get clean data. |\n| **readability** | Mozilla Readability article extraction — title, author, content, publish date. |\n| **screenshot** | Full-page or viewport PNG screenshots via headless Chromium. |\n\n### 🔍 Search (requires free Brave API key)\n\n| Tool | What It Does |\n|------|-------------|\n| **search** | Web search via Brave Search API. |\n| **news_search** | News-specific search with freshness ranking. |\n| **image_search** | Image search with thumbnails and source URLs. |\n| **video_search** | Video search across platforms. |\n\n### ⚡ Skills (no API key needed)\n\n| Tool | What It Does |\n|------|-------------|\n| **create_skill** | Analyze any page, auto-detect repeating patterns, generate CSS selectors, save as reusable skill. |\n| **run_skill** | Run a saved skill for fresh structured data. Supports pagination. |\n| **list_skills** | List all saved skills with configurations. |\n\n### 🔓 API Discovery \u0026 Real-Time (no API key needed, requires Playwright)\n\n| Tool | What It Does |\n|------|-------------|\n| **discover_apis** | Navigate to any page, intercept XHR/fetch calls, map hidden REST/GraphQL endpoints. Auto-detects GraphQL, filters noise, returns response previews. |\n| **query_api** | Call any API endpoint directly with stealth headers. Bypass DOM rendering for 10x faster data access. |\n| **monitor_websocket** | Capture real-time WebSocket messages — financial tickers, chat feeds, live dashboards. |\n\n### 🧠 AI Extraction (requires LLM API key)\n\n| Tool | What It Does |\n|------|-------------|\n| **ai_extract** | Describe what you want in natural language or JSON schema. 3 providers (Anthropic, OpenAI, MiniMax). The `extract` tool also supports `llm_fallback: true` for hybrid CSS→AI extraction. |\n\n### 🖱️ Interaction (no API key needed, requires Playwright)\n\n| Tool | What It Does |\n|------|-------------|\n| **interact** | Browser automation with 20 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate, drag, upload, storage, cookies, pdf, auth_login, refresh, **auto_click**). Ref targeting via ARIA snapshot, session encryption, action policy, domain filter, network interception, device emulation. **auto_click** finds and clicks \"load more\" / \"gallery\" buttons with multilingual keyword matching. |\n| **snapshot** | ARIA-based page snapshot with interactive element refs. Use refs in interact for precise targeting. Annotated screenshots. |\n\n### 📱 Social Media (no API key needed)\n\n| Tool | What It Does |\n|------|-------------|\n| **youtube** | Search videos, get video details, comments, transcripts, chapters, and channel info. Parses `ytInitialData` — no API key needed. Add `OPENAI_API_KEY` to unlock Whisper AI transcription for videos without captions. |\n| **reddit** | Search Reddit, browse subreddits, get posts and comments via Reddit's public JSON API. |\n| **instagram** | Search profiles, get detailed profile info with engagement metrics, and discover influencers by niche/location. Search/discover require `BRAVE_API_KEY`. |\n\n### 📥 Media \u0026 Feeds (no API key needed)\n\n| Tool | What It Does |\n|------|-------------|\n| **download** | Download media files from any URL — images, video, YouTube, TikTok, bulk. **v2.5.1**: Browser-based image extraction with 100% coverage (lazy-load, shadow DOM, iframes, JSON-LD, CSS backgrounds). Target specific images via `--selector`, `--index`, `--alt-match`. Auto-click \"load more\" buttons. Referer injection fixes 403 on CDNs. |\n| **batch_download** | Download multiple files (PDFs, images, documents) in parallel with session cookie support. Uses L1 HTTP fetch — 10x faster than browser-based downloads. Ideal for bulk file retrieval from authenticated sessions. |\n| **rss** | Fetch and parse RSS/Atom feeds. Filter by date, output as JSON or Markdown. |\n\n### 📦 Batch Processing (no API key needed)\n\n| Tool | What It Does |\n|------|-------------|\n| **batch_scrape** | Parallel URL scraping with configurable concurrency, soft failure, and resume via job_id. Optional AI extraction per URL. |\n| **list_jobs** | List all batch jobs with status and progress. |\n| **job_status** | Full results for a specific batch job including per-URL outcomes. |\n| **delete_job** | Clean up completed or failed batch jobs. |\n\n### 🧠 Knowledge Engine (no API key needed)\n\n| Tool | What It Does |\n|------|-------------|\n| **knowledge** | Dump adaptive knowledge engine stats — per-domain success rates, optimal stealth levels, anti-bot detection history, rate limits. Use to debug scraping issues and understand problematic domains. |\n\n### 📄 Documents (no API key needed)\n\n| Tool | What It Does |\n|------|-------------|\n| **pdf_extract** | Extract text, pages, tables, and metadata from a local or remote PDF. Native text-layer strategy via `pdfjs-dist`. OCR + Claude Vision fallbacks deferred to v2.6.0. Use for sustainability reports, invoices, regulatory PDFs. |\n\n```bash\nimperium-crawl pdf-extract --input ./report.pdf --output ./extracted.json\nimperium-crawl pdf-extract --input https://example.com/report.pdf --max-pages 20\n```\n\n### 👀 Change Tracking (no API key needed)\n\n| Tool | What It Does |\n|------|-------------|\n| **watch** | One-shot change detector: scrape a URL, hash its content (readability / markdown / full), compare against the last snapshot, fire a webhook on change. Pair with cron for periodic monitoring. |\n| **monitor** | Portfolio-level change tracker across many URLs grouped by topic. Reads a JSON config, runs `watch` on each URL, emits a markdown digest filtered by minimum change percentage. |\n\n```bash\n# Watch a single URL — run periodically via cron\nimperium-crawl watch --url https://carbonchain.com/pricing \\\n  --output-dir ./data/watch \\\n  --webhook https://hooks.example.com/on-change\n\n# Monitor many URLs grouped by topic, emit a daily digest\nimperium-crawl monitor --config ./monitor.json --output-dir ./data/monitor\n```\n\n`monitor.json`:\n```json\n{\n  \"topics\": [\n    {\n      \"name\": \"Competitor pricing\",\n      \"urls\": [\"https://carbonchain.com/pricing\", \"https://spherasolutions.com/cbam\"]\n    }\n  ]\n}\n```\n\n### 🔁 Imperium Flows (no API key needed; browser workflows may require Playwright)\n\n| Tool | What It Does |\n|------|-------------|\n| **record_flow** | Record a headed browser workflow as a generic flow family/variant. Stores smart selector metadata and reusable input placeholders. |\n| **run_flow** | Run a saved flow with runtime JSON input, CAPTCHA policy, browser mode, and evidence collection. |\n| **serve_flow** | Expose saved flows through a local HTTP API. Requires bearer auth when bound publicly. |\n| **list_flows** | List project-local and global flow definitions. |\n| **inspect_flow** | Inspect a saved flow JSON definition. |\n| **validate_flow** | Validate a flow schema and report inputs, steps, and storage path. |\n\n```bash\nimperium-crawl record-flow --family generic-search --variant site-a --url https://example.com\nimperium-crawl run-flow generic-search/site-a --input '{\"query\":\"example\"}'\nimperium-crawl serve-flow generic-search --port 8787\n```\n\n---\n\n## Setup\n\n### API Keys\n\n| Key | What It Unlocks | Where to Get It |\n|-----|----------------|-----------------|\n| `BRAVE_API_KEY` | 4 search tools (web, news, image, video) | [brave.com/search/api](https://brave.com/search/api/) (free tier available) |\n| `TWOCAPTCHA_API_KEY` | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | [2captcha.com](https://2captcha.com/) |\n| `LLM_API_KEY` | AI-powered data extraction (`ai_extract` tool) | Anthropic, OpenAI, or MiniMax API key |\n| `OPENAI_API_KEY` | Whisper AI transcription — transcribe any YouTube video, even without captions | [platform.openai.com](https://platform.openai.com/) |\n| `CHROME_PROFILE_PATH` | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir |\n| `PROXY_URL` | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |\n\n### Enable Full Stealth (Level 3)\n\n```bash\nnpm i rebrowser-playwright\nnpx playwright install chromium\n```\n\n---\n\n## CLI Usage\n\n**With subcommand** = runs that tool. **No args in TTY** = interactive TUI. **No args in pipe** = shows help.\n\n```bash\n# Scrape a website to markdown\nimperium-crawl scrape --url https://bbc.com/news\n\n# Crawl with depth control\nimperium-crawl crawl --url https://blog.cloudflare.com --max-depth 2 --max-pages 5\n\n# AI-powered extraction — plain English\nimperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \\\n  --schema \"extract product name, price, rating, and review count\"\n\n# Discover hidden APIs\nimperium-crawl discover-apis --url https://weather.com\n\n# Batch scrape in parallel\nimperium-crawl batch-scrape --urls '[\"https://site1.com\",\"https://site2.com\"]' --concurrency 3\n\n# Interactive setup wizard\nimperium-crawl setup\n```\n\n### Output Formats\n\n```bash\nimperium-crawl scrape --url https://example.com                          # JSON (default)\nimperium-crawl scrape --url https://example.com --output-format markdown  # Markdown\nimperium-crawl scrape --url https://example.com --output-format csv       # CSV\nimperium-crawl scrape --url https://example.com --pretty                  # Pretty JSON\nimperium-crawl scrape --url https://example.com --output result.json      # Write to file\n```\n\n### TUI Mode\n\n```bash\nimperium-crawl tui\n```\n\nInteractive slash-command terminal with parameter prompts, table rendering, markdown display, and session state. Use `/save` to export results and `/again` to re-run the last command.\n\n### Explore REPL\n\nInteractively explore a site in a headed browser, then save the session as a reusable skill:\n\n```bash\nimperium-crawl explore https://example.com\n```\n\n```\n\u003e navigate https://example.com/login\n\u003e type \"#email\" \"user@example.com\"\n\u003e type \"#password\" \"{{env:MY_PASSWORD}}\"\n\u003e click \"#submit\"\n\u003e snapshot\n\u003e save-skill my-login\n✅ Saved skill: my-login (4 actions, 1 parameter detected)\n```\n\nCommands: `navigate`, `click`, `type`, `select`, `hover`, `press`, `scroll`, `wait`, `screenshot`, `snapshot`, `evaluate`, `save-skill`, `history`, `undo`, `status`, `help`, `exit`\n\n---\n\n## Skills \u0026 Recipes\n\nSkills let you teach imperium-crawl how to extract data from any website, then re-run for fresh content whenever you want.\n\n**Create a skill:**\n```\ncreate_skill({\n  url: \"https://techcrunch.com/category/artificial-intelligence\",\n  name: \"tc-ai-news\",\n  description: \"Latest AI news from TechCrunch\"\n})\n```\n\n**Run a skill:**\n```\nrun_skill({ name: \"tc-ai-news\" })\n→ Returns fresh structured data with all detected fields\n```\n\nSkills are saved in `~/.imperium-crawl/skills/` as JSON files — human-readable, editable, portable.\n\n### Skill Parameters\n\nUse template variables in skills — resolved at run time:\n\n```bash\n# In skill JSON actions:\n{ \"value\": \"{{input:query}}\" }           # passed via --params or prompted\n{ \"value\": \"{{env:SITE_PASSWORD}}\" }     # from environment variable\n{ \"value\": \"{{computed:date_today}}\" }   # auto-computed (date_today, timestamp, random_string, year, month, day)\n\n# Run with params:\nimperium-crawl run-skill my-search --params '{\"query\": \"machine learning\"}'\n```\n\n### Skill Chains\n\nChain skills together — output of one step becomes input to the next:\n\n```json\n{\n  \"type\": \"chain\",\n  \"name\": \"search-and-extract\",\n  \"steps\": [\n    { \"skill\": \"search-results\", \"output\": \"search\" },\n    { \"skill\": \"extract-details\", \"input\": { \"url\": \"$search.results[0].url\" }, \"output\": \"details\" }\n  ]\n}\n```\n\nVariable syntax: `$step_name.field.nested[0]` — simple dot-path access, no eval.\n\n### Built-in Recipes\n\n| Recipe | What It Does |\n|--------|-------------|\n| `hn-top-stories` | Hacker News front page — titles, scores, comment counts |\n| `github-trending` | GitHub trending repos — stars, language, description |\n| `job-listings-greenhouse` | Greenhouse job boards — title, team, location |\n| `ecommerce-product` | Product name, price, rating, reviews, images |\n| `product-reviews` | Review text, ratings, author, date from product pages |\n| `crypto-websocket` | Live crypto prices via WebSocket monitoring |\n| `news-article-reader` | Article title, author, date, content from news sites |\n| `reddit-posts` | Subreddit posts — title, score, comments, flair |\n| `seo-page-audit` | SEO signals — meta tags, headings, structured data |\n| `social-media-mentions` | Brand mentions across social platforms |\n| `influencer-niche-discovery` | Find influencers by niche + location via Instagram |\n| `influencer-hashtag-scout` | Discover influencers through hashtag analysis |\n| `influencer-competitor-spy` | Find influencers from competitor brand mentions |\n| `influencer-content-scout` | Analyze content patterns of niche influencers |\n\nSee [`SKILL/`](./SKILL/) for detailed workflow guides and agent integration.\n\n---\n\n## API Discovery Workflow\n\nTurn any website into an API. No documentation needed.\n\n```\n1. discover_apis({ url: \"https://weather.com\" })\n   → Found 11 hidden API endpoints:\n     • Main weather API (api.weather.com) with exposed API key\n     • mParticle analytics endpoints\n     • Taboola content recommendation API\n     • OneTrust consent management API\n\n2. query_api({ url: \"https://api.weather.com/v3/...\", method: \"GET\" })\n   → Direct API call, bypasses DOM entirely — 10x faster, structured JSON\n\n3. monitor_websocket({ url: \"https://binance.com/en/trade/BTC_USDT\", duration_seconds: 10 })\n   → Captures real-time WebSocket messages — live BTC price feed\n```\n\n---\n\n## AI Agent Guide\n\nimperium-crawl ships with [`SKILL/`](./SKILL/) — a structured guide that teaches AI agents how to use all 39 tools effectively. Includes proven workflows, decision trees, error recovery, and advanced patterns.\n\n### Two Ways to Connect\n\n| Method | Setup | Works With |\n|--------|-------|-----------|\n| **CLI + SKILL/** | `npm i -g imperium-crawl` + SKILL.md in agent context | **Any agent with bash access** — Claude Code, Cursor, OpenClaw, ChatGPT, custom agents |\n| **TUI** | `imperium-crawl tui` — interactive terminal | Direct human use, demos, debugging |\n\n### Per-Agent Setup\n\n| AI Agent | How to Add SKILL/ |\n|----------|-------------------|\n| **Claude Code** | Copy `SKILL.md` to project root — auto-detected |\n| **Cursor / Windsurf** | Add `SKILL.md` to project rules or system prompt |\n| **OpenClaw / custom agents** | Include SKILL.md in system prompt or context window |\n| **ChatGPT / GPT agents** | Paste SKILL.md content into custom instructions |\n\n---\n\n## Resilience\n\n- **Exponential backoff with full jitter** — AWS-recommended retry pattern, no thundering herd\n- **Per-domain circuit breaker** — 5 failures opens circuit for 60s, then half-open probing with auto recovery\n- **URL normalization** — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params\n- **Proxy support** — single proxy or rotating pool with http/https/socks4/socks5\n- **Browser pool** — keyed by proxy URL, auto-eviction, configurable pool size\n- **robots.txt** — respected by default (configurable)\n- **Graceful shutdown** — 10s timeout on browser cleanup to prevent hung processes\n\n---\n\n## Real-World Test Results\n\nEvery tool tested against production websites with real anti-bot defenses:\n\n| Tool | Target | Result |\n|------|--------|--------|\n| 📄 **scrape** | BBC News | Full markdown, stealth level 3 auto-escalation |\n| 🕸️ **crawl** | Cloudflare Blog | 213K characters crawled with depth control |\n| 🗺️ **map** | BBC | Full URL discovery via sitemap + link extraction |\n| 🕷️ **extract** | Amazon (AirPods Pro 2) | Product title, 45,297 reviews, brand extracted |\n| 📖 **readability** | Medium article | Clean — title, author, content, publish date |\n| 📸 **screenshot** | ProductHunt | Captured Cloudflare Turnstile challenge page |\n| 🔍 **search** | Brave Web | Web results with snippets and URLs |\n| 📰 **news_search** | Brave News | News results with freshness ranking |\n| 🖼️ **image_search** | Brave Image | Images with thumbnails and source URLs |\n| 🎬 **video_search** | Brave Video | Video results across platforms |\n| 🛠️ **create_skill** | Hacker News | Auto-detected 30 stories with CSS selectors |\n| ▶️ **run_skill** | Saved skill | Fresh structured data from saved config |\n| 📋 **list_skills** | — | Lists all skills with configurations |\n| 🔓 **discover_apis** | Airbnb Paris | **34 hidden APIs** — DataDome, Google Maps key, internal APIs |\n| ⚡ **query_api** | jsonplaceholder | Direct JSON API call with stealth headers |\n| 📡 **monitor_websocket** | Binance BTC/USDT | 3 WebSocket connections, 23 live messages — BTC price live |\n| 🧠 **ai_extract** | Amazon product | AI extracted name, price, rating, review count |\n| 🎯 **snapshot** | GitHub, Wikipedia | ARIA tree with 107/113 refs, annotated screenshots |\n| 🖱️ **interact** | Login flow | Click → type → submit — ref targeting, session encryption, 18 action types |\n| 📦 **batch_scrape** | 10 news sites | Parallel, concurrency 3, soft failure, 9/10 succeeded |\n| 📋 **list_jobs** | — | Batch jobs with status and progress |\n| 📊 **job_status** | Batch job | Full per-URL results with timing |\n| 🗑️ **delete_job** | Completed job | Cleaned up job data from disk |\n| 🧠 **knowledge** | Local knowledge file | Per-domain stats: stealth levels, success rates, anti-bot systems detected |\n| 🎬 **youtube** | \"web scraping tutorial\" | Search results, video details, comments, transcripts — no API key |\n| 💬 **reddit** | r/webscraping | Subreddit posts, comments, search — public JSON API |\n| 📸 **instagram** | @nike profile | Profile details, engagement rate, recent posts — internal API |\n| 📥 **download** | YouTube video, web page images | Auto-detect URL type, download media files — images, video, og:image |\n| 📡 **rss** | Hacker News RSS | Parsed feed items with title, link, date, author, categories |\n\n\u003e **39 tools. 34 hidden APIs on Airbnb. Live BTC feed. Reusable browser flows. Zero API keys for scraping.**\n\n---\n\n## Environment Variables\n\n| Variable | Required | Description |\n|----------|----------|-------------|\n| `BRAVE_API_KEY` | No | Brave Search API key (enables 4 search tools) |\n| `TWOCAPTCHA_API_KEY` | No | 2Captcha API key (enables auto CAPTCHA solving) |\n| `LLM_API_KEY` | No | Anthropic, OpenAI, or MiniMax API key (enables `ai_extract`) |\n| `LLM_PROVIDER` | No | `anthropic`, `openai`, or `minimax` (default: `anthropic`). **Recommended: `minimax` with MiniMax-M1** — best price/performance for extraction |\n| `LLM_MODEL` | No | Override default LLM model |\n| `OPENAI_API_KEY` | No | OpenAI API key for Whisper transcription (transcribe any YouTube video without captions) |\n| `SESSION_ENCRYPTION_KEY` | No | 32-byte hex key for encrypting session files at rest |\n| `PROXY_URL` | No | Single proxy URL (http/https/socks4/socks5) |\n| `PROXY_URLS` | No | Comma-separated proxy URLs for rotation |\n| `BROWSER_POOL_SIZE` | No | Max pooled browser instances (default: 3) |\n| `RESPECT_ROBOTS` | No | Respect robots.txt (default: `true`) |\n| `CHROME_PROFILE_PATH` | No | Chrome user data dir for authenticated sessions |\n| `NO_COLOR` | No | Disable colored output |\n| `CI` | No | Auto-detected; disables TTY features |\n\n---\n\n## Development\n\n```bash\ngit clone https://github.com/ceoimperiumprojects/imperium-crawl\ncd imperium-crawl\nnpm install\nnpm run build\nnpm run dev         # Watch mode (rebuild on changes)\nnpm test            # 546 tests\nnpm start           # Start CLI (shows help or TUI)\n```\n\n---\n\n## Contributing\n\nContributions welcome! Whether it's a bug fix, new tool, or documentation improvement — open an issue or PR.\n\n```bash\n# Fork the repo, then:\ngit clone https://github.com/YOUR_USERNAME/imperium-crawl\ncd imperium-crawl\nnpm install\ngit checkout -b my-feature\n# Make changes...\nnpm test\ngit push origin my-feature\n# Open a PR\n```\n\n---\n\n## License\n\nMIT — use it however you want. Free forever.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceoimperiumprojects%2Fimperium-crawl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fceoimperiumprojects%2Fimperium-crawl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceoimperiumprojects%2Fimperium-crawl/lists"}