{"id":50714615,"url":"https://github.com/alex-on-ai/WebReaper","last_synced_at":"2026-06-14T22:01:12.945Z","repository":{"id":61167329,"uuid":"480985506","full_name":"alex-on-ai/WebReaper","owner":"alex-on-ai","description":"AI-native web scraper. Single binary with a bundled Claude Code skill. MIT-licensed alternative to Firecrawl.","archived":false,"fork":false,"pushed_at":"2026-06-06T12:26:35.000Z","size":42262,"stargazers_count":135,"open_issues_count":1,"forks_count":32,"subscribers_count":3,"default_branch":"master","last_synced_at":"2026-06-06T14:25:03.390Z","etag":null,"topics":["ai-agents-automation","claude-code","crawler","dotnet","firecrawl-alternative","llm","markdown","mcp","parser","parsing","scraper","scraping","scraping-api","scraping-web","scraping-websites","webcrawler","webscraping"],"latest_commit_sha":null,"homepage":"https://webreaper.ai","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alex-on-ai.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-04-12T21:59:25.000Z","updated_at":"2026-06-06T12:26:39.000Z","dependencies_parsed_at":"2024-10-29T18:50:01.287Z","dependency_job_id":null,"html_url":"https://github.com/alex-on-ai/WebReaper","commit_stats":{"total_commits":525,"total_committers":4,"mean_commits":131.25,"dds":"0.013333333333333308","last_synced_commit":"a2efb8f67a94b5931f4e9fa8b7978d3f0ed2e656"},"previous_names":["pavlovtech/exoscraper","pavlovtech/exoscan","alex-on-ai/webreaper"],"tags_count":16,"template":false,"template_full_name":null,"purl":"pkg:github/alex-on-ai/WebReaper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alex-on-ai%2FWebReaper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alex-on-ai%2FWebReaper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alex-on-ai%2FWebReaper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alex-on-ai%2FWebReaper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alex-on-ai","download_url":"https://codeload.github.com/alex-on-ai/WebReaper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alex-on-ai%2FWebReaper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34339195,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-14T02:00:07.365Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents-automation","claude-code","crawler","dotnet","firecrawl-alternative","llm","markdown","mcp","parser","parsing","scraper","scraping","scraping-api","scraping-web","scraping-websites","webcrawler","webscraping"],"created_at":"2026-06-09T18:00:32.662Z","updated_at":"2026-06-14T22:01:12.932Z","avatar_url":"https://github.com/alex-on-ai.png","language":"C#","funding_links":[],"categories":["C\\#","C#"],"sub_categories":[],"readme":"![logo](https://user-images.githubusercontent.com/6662454/221978697-3f35564a-f442-46e6-9182-f2604a17e1f6.png)\n\n# WebReaper\n\n[![NuGet](https://img.shields.io/nuget/v/WebReaper)](https://www.nuget.org/packages/WebReaper)\n[![CI](https://github.com/alex-on-ai/WebReaper/actions/workflows/CI.yml/badge.svg)](https://github.com/alex-on-ai/WebReaper/actions/workflows/CI.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE.txt)\n\nAI-native web scraper. Single binary with a bundled Claude Code skill.\n\n## Install\n\n**macOS / Linux (Homebrew):**\n\n```bash\nbrew install alex-on-ai/webreaper/webreaper\n```\n\n**Any POSIX shell (install.sh):**\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/alex-on-ai/WebReaper/master/scripts/install.sh | sh\n```\n\n**.NET library:**\n\n```bash\ndotnet add package WebReaper\n```\n\nWindows binaries are on the [GitHub Releases page](https://github.com/alex-on-ai/WebReaper/releases/latest); `winget` and `Scoop` are on the v10.1 roadmap.\n\n**Updating:** `brew upgrade webreaper` (Homebrew), or re-run the install.sh line with `--upgrade` appended (`| sh -s -- --upgrade`), or `dotnet add package WebReaper` (library). When a newer release exists, `scrape` / `crawl` / `map` print a one-line upgrade hint on stderr (interactive terminals only, never in a pipe or CI); disable that check with `WEBREAPER_NO_UPDATE_CHECK=1`.\n\n## 30-second demo\n\n```bash\n$ webreaper scrape https://news.ycombinator.com\n# Hacker News\n\n- [Show HN: ...](https://news.ycombinator.com/item?id=...)\n- [Ask HN: ...](https://news.ycombinator.com/item?id=...)\n...\n\n$ webreaper init\nWrote WebReaper Agent Skill to .claude/skills/webreaper/SKILL.md\n\nTry it out:\n  webreaper scrape https://example.com\n  webreaper map https://example.com\n```\n\nAfter `webreaper init`, the next Claude Code session picks up the skill and routes scraping intents (*\"give me the markdown of X\"*, *\"what blog posts are on Y\"*, *\"scrape the top 5 articles\"*) to `webreaper` automatically.\n\n## Table of contents\n\n- [Why WebReaper](#why-webreaper)\n- [Quick start](#quick-start)\n- [Bot protection that just works](#bot-protection-that-just-works)\n- [Power your agent](#power-your-agent)\n- [AI features](#ai-features)\n- [Use cases](#use-cases)\n- [Packages](#packages)\n- [Compared to Firecrawl, Crawl4AI, and Crawlee](#compared-to-firecrawl-crawl4ai-and-crawlee)\n- [API overview](#api-overview)\n- [Architecture and interfaces](docs/architecture.md)\n- [Repository structure](#repository-structure)\n- [License](#license)\n\n## Why WebReaper\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd width=\"33%\" valign=\"top\"\u003e\u003cstrong\u003e🪶 Drop on PATH, run.\u003c/strong\u003e\u003cbr\u003e\u003cbr\u003eNo Docker, no Postgres, no signup. ~12 MB binary.\u003c/td\u003e\n\u003ctd width=\"33%\" valign=\"top\"\u003e\u003cstrong\u003e🤖 AI-native by composition.\u003c/strong\u003e\u003cbr\u003e\u003cbr\u003eMarkdown by default. Schema extraction, LLM fallback, self-healing selectors, autonomous agent. Stack with \u003ccode\u003e.With…()\u003c/code\u003e.\u003c/td\u003e\n\u003ctd width=\"33%\" valign=\"top\"\u003e\u003cstrong\u003e🔌 Bring any LLM.\u003c/strong\u003e\u003cbr\u003e\u003cbr\u003eOpenAI, Anthropic, Ollama, Azure OpenAI, llamafile, via \u003ccode\u003eMicrosoft.Extensions.AI\u003c/code\u003e.\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd width=\"33%\" valign=\"top\"\u003e\u003cstrong\u003e🛡 Bot-checks handled automatically.\u003c/strong\u003e\u003cbr\u003e\u003cbr\u003eA blocked page climbs HTTP → browser → stealth on its own, per page and host-sticky. Challenge pages are dropped, never returned as data. No flag needed for the browser fallback.\u003c/td\u003e\n\u003ctd width=\"33%\" valign=\"top\"\u003e\u003cstrong\u003e📡 Distributed when needed.\u003c/strong\u003e\u003cbr\u003e\u003cbr\u003eSwap scheduler, tracker, sink to Redis, MongoDB, SQLite, Azure Service Bus, Cosmos. Same code.\u003c/td\u003e\n\u003ctd width=\"33%\" valign=\"top\"\u003e\u003cstrong\u003e📜 MIT, not AGPL.\u003c/strong\u003e\u003cbr\u003e\u003cbr\u003eEmbed in commercial software, fork, modify, redistribute. Firecrawl's AGPL requires open-sourcing your service or paying for a commercial license.\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n## Quick start\n\n### CLI\n\n```bash\n# One page as Markdown\nwebreaper scrape https://example.com\n\n# Save Markdown to a file\nwebreaper scrape https://example.com --output page.md\n\n# Discover URLs on a site\nwebreaper map https://example.com --search /blog/ --max-urls 50\n\n# Crawl a whole site recursively (every on-domain page) to JSON Lines\nwebreaper crawl https://example.com \u003e pages.jsonl\n\n# Structured fields with a JSON schema (output: JSON; multi-page: JSON Lines)\nwebreaper scrape https://example.com --schema schema.json\n\n# Schema-free extraction with an LLM (bring your own OpenAI-compatible endpoint)\nwebreaper scrape https://example.com --prompt \"title and author\" \\\n  --model gpt-4o-mini --llm-url https://api.openai.com/v1\n\n# Whole site, fields, cheaply: infer a schema once, then extract the rest\nwebreaper crawl https://example.com --infer \"product name and price\" \\\n  --model gpt-4o-mini --llm-url https://api.openai.com/v1 --output-dir ./out\n\n# JS-rendered single-page app\nwebreaper scrape https://example.com --browser\n\n# Bot-protected site: a plain scrape already auto-climbs HTTP -\u003e browser on a\n# block; --stealth starts at a stealth backend (--auto-stealth = no prompt, for CI)\nwebreaper scrape https://example.com --stealth\n\n# Install the Claude Code skill\nwebreaper init\n```\n\nThe CLI is built Native-AOT (ADR-0043), ships as a single binary on every tagged GitHub release across six RIDs (`linux-x64`, `linux-arm64`, `osx-x64`, `osx-arm64`, `win-x64`, `win-arm64`), and is block-aware with automatic browser/stealth escalation (ADR-0083). The macOS binaries are Apple codesigned and notarized (ADR-0071); Homebrew installs run without Gatekeeper warnings on a clean machine.\n\n### Library\n\n```csharp\nusing WebReaper.Builders;\n\nvar engine = await ScraperEngineBuilder\n    .Crawl(\"https://news.ycombinator.com\")\n    .AsMarkdown()\n    .WriteToConsole()\n    .BuildAsync();\n\nawait engine.RunAsync();\n```\n\nThat is HTTP-only, no extra packages, no schema. For structured fields, swap `AsMarkdown()` for `Extract(schema)`; for JS-rendered pages, `Crawl` for `CrawlWithBrowser` plus a transport satellite ([`WebReaper.Playwright`](#packages) or [`WebReaper.Cdp`](#packages)). The full surface is in the [API overview](#api-overview).\n\n## Bot protection that just works\n\nMost scrapers make you opt into a browser up front and guess when a site is blocking you. WebReaper detects the block and escalates on its own, one page at a time:\n\n```\nHTTP  -\u003e  browser (Chromium)  -\u003e  stealth (CloakBrowser)\n          (climbs only when a page actually looks blocked)\n```\n\n- **Automatic.** A plain `scrape` or `crawl` starts on a fast HTTP fetch and climbs to a real browser only when a page looks blocked (a challenge status, response header, or body marker). No flag needed for the browser fallback.\n- **Per page, host-sticky.** The first confirmed block on a host lifts that host's floor, so the rest of a whole-site crawl starts at the working tier instead of re-paying the failed one. You pay for the climb once, not per page.\n- **No garbage in your data.** A page still blocked at the top tier is dropped, never written to your output, and the run exits non-zero so an unattended job knows. Clean data or a clear signal, never a challenge page masquerading as content.\n- **You decide on stealth.** `--stealth` starts at the stealth backend; `--auto-stealth` (or `WEBREAPER_AUTO_STEALTH=1`) enables it unattended; `--no-auto-stealth` caps the climb at a vanilla browser. The ~220 MB stealth backend downloads only if you opt in.\n\nSelf-hosted, single binary, no cloud round-trip ([ADR-0083](docs/adr/0083-escalating-page-loader.md)).\n\n## Power your agent\n\nOnboarding an agent from zero? Hand it [docs/AI-ONBOARDING.md](docs/AI-ONBOARDING.md) — install → verify → choose-your-path on one page, no credentials anywhere.\n\n### Claude Code skill\n\n```bash\nwebreaper init\n```\n\nWrites a polished `SKILL.md` to `.claude/skills/webreaper/`. Claude Code loads it on the next session and routes scraping intents to the CLI: *\"scrape the top 5 stories on HN and summarize each\"*, *\"give me the markdown of this article\"*, *\"this Cloudflare-protected site is blocking me\"*. The skill describes when to prefer `webreaper` over the built-in `WebFetch` (artifacts and structured data go through `webreaper`; conversational answers stay on `WebFetch`).\n\n### MCP server\n\nTwo MCP servers expose the same tools (`scrape`, `map`, `extract`, `extract_with_prompt`, `extract_inferred`, `crawl`) for clients that speak the Model Context Protocol:\n\n- **[`WebReaper.Mcp`](https://www.nuget.org/packages/WebReaper.Mcp)** (ADR-0049) speaks **stdio**, for local clients that spawn a process: Cursor, Claude Desktop, Copilot Studio.\n- **[`WebReaper.Mcp.AspNetCore`](https://www.nuget.org/packages/WebReaper.Mcp.AspNetCore)** (ADR-0086) speaks **Streamable HTTP**, for clients that connect to a URL. The headline case is **n8n**, whose MCP Client node is URL-only and cannot reach a stdio server. Run the Chromium-baked image, point n8n's MCP Client node at the URL with a bearer token, and call the tools from a workflow:\n\n  ```bash\n  docker run -p 8080:8080 -e WEBREAPER_MCP_TOKEN=your-secret \\\n    ghcr.io/alex-on-ai/webreaper-mcp-http:latest\n  ```\n\n  See the [n8n quickstart](docs/mcp-http-quickstart.md).\n\nBoth are thin facades over the library API; the primary agent surface remains the CLI.\n\n### CLI inside any agent harness\n\nThe single binary works inside any shell-spawning agent: LangChain `ShellTool`, OpenAI Assistants code-interpreter, GitHub Actions, internal scripts. Zero runtime to install; one syscall to invoke.\n\n## AI features\n\n### 1. LLM-ready Markdown, no schema\n\n```csharp\nawait ScraperEngineBuilder\n    .Crawl(\"https://example.com\")\n    .AsMarkdown()                   // ADR-0040 + ADR-0063\n    .WriteToConsole()\n    .BuildAsync()\n    .Result.RunAsync();\n```\n\nThe Markdown extractor (`HtmlToMarkdown` primitive, ADR-0063; `MarkdownContentExtractor` adapter, ADR-0040) emits `{title, markdown}` per page. Pass the output straight into a follow-up LLM prompt.\n\n### 2. Source-gen schemas with compile-time guards\n\n```csharp\nusing WebReaper.Extraction.Attributes;\n\n[ScrapeSchema]\npublic partial class Article\n{\n    [ScrapeField(\"h1\")]                                              public string? Title { get; set; }\n    [ScrapeField(\".views\", Type = SchemaFieldType.Integer)]          public int Views { get; set; }\n    [ScrapeField(\".tag\", IsList = true)]                             public List\u003cstring\u003e Tags { get; set; } = new();\n}\n\n// Emitted at compile time, reflection-free, AOT-clean:\n//   public static Schema Schema { get; }\n//   public static Article Materialize(JsonObject json)\n\nvar engine = await ScraperEngineBuilder\n    .Crawl(\"https://example.com/post\")\n    .Extract(Article.Schema)\n    .Subscribe(p =\u003e HandleArticle(Article.Materialize(p.Data)))\n    .BuildAsync();\n```\n\nThe [`WebReaper.Extraction.Generators`](https://www.nuget.org/packages/WebReaper.Extraction.Generators) Roslyn analyzer (ADR-0045) emits the schema and a `Materialize` function. Schema typos are compile errors; the generated path uses no reflection so it's AOT-clean.\n\n### 3. LLM safety net for deterministic extraction\n\nThree composable patterns where the LLM only fires when the deterministic path can't deliver. Same proposer-validator shape across all three.\n\n```csharp\nusing WebReaper.AI;\n\n// (a) Fire LLM only when a field returns empty (ADR-0046)\n.WithLlmFallback(chatClient)\n\n// (b) Repair a broken selector once per (Schema, field), cache forever (ADR-0047)\n.WithLlmSelfHealing(chatClient)\n\n// (c) No schema at all: infer it from the URL, re-infer on validator failure (ADR-0067 + ADR-0069)\n.UseAi(chatClient, AiPolicyMode.Inferred)\n```\n\nStable pages cost zero LLM calls; broken pages cost one call per (page, field) and cache. Schema inference is one call per site (cached for the run). The `WebReaper.AI` satellite is built on `Microsoft.Extensions.AI` so any `IChatClient` works.\n\n### 4. Autonomous agent: page selection by goal\n\n```csharp\nusing WebReaper.AI;\n\nvar result = await LlmAgent.RunAsync(\n    \"https://example.com\",\n    goal: \"Find the contact email and phone number for the support team.\",\n    chatClient);\n```\n\n`AgentEngine` (ADR-0051) runs a sequential `decide → persist → execute` loop over a closed-sum `AgentDecision` (`Extract | Follow | Act | Stop`). The brain picks each step from the bounded `AgentState` view; the engine validates (visited-link enforcement, MaxSteps cap), persists (`IAgentRunStore`), and dispatches. Durable resume across process restarts.\n\n### 5. Semantic page actions\n\n```csharp\nScraperEngineBuilder\n    .CrawlWithBrowser(url, actions =\u003e actions\n        .Do(PageAction.SemanticAct(\"click 'sign in'\"))   // ADR-0050\n        .Do(PageAction.WaitForNetworkIdle())\n        .Build())\n    .WithLlmActionResolver(chatClient)\n    // ...\n```\n\nA natural-language `PageAction.SemanticAct(intent)` is one of the ten closed-sum arms (ADR-0050 added it; ADR-0074 added `Fill` / `Press` / `ScrollIntoView` for form interactions). The transport resolves it once via the registered `IActionResolver` (LLM-backed by default), dispatches the concrete arm, and caches the resolution per crawl by intent string. First page pays the LLM, subsequent same-intent pages dispatch the cached arm with no LLM call.\n\n### Runnable end-to-end demo\n\n[`Examples/WebReaper.AiNativeShowcase`](Examples/WebReaper.AiNativeShowcase/Program.cs) wires every feature in this section:\n\n```bash\ndotnet run --project Examples/WebReaper.AiNativeShowcase -- markdown\ndotnet run --project Examples/WebReaper.AiNativeShowcase -- sourcegen\ndotnet run --project Examples/WebReaper.AiNativeShowcase -- llm\ndotnet run --project Examples/WebReaper.AiNativeShowcase -- router\ndotnet run --project Examples/WebReaper.AiNativeShowcase -- changetrack\n```\n\n## Use cases\n\n- **Build LLM context from blog or docs sites.** `webreaper map` plus `webreaper scrape` per URL, piped into a prompt or a vector DB.\n- **Monitor competitor pricing or status pages for changes.** Schedule the CLI with cron or a worker, store records in MongoDB or SQLite, plug in `.WithChangeTracking()` (ADR-0048) so the sink fires only on diff. Hash-based dedup; cron-friendly.\n- **Run an autonomous research agent.** `LlmAgent.RunAsync(url, goal, chatClient)` decides which links to follow until the goal is met. Durable resume across restarts.\n- **Scrape Cloudflare-protected catalogs.** A plain `scrape` auto-climbs HTTP to a browser on a block; add `--stealth` (or `--auto-stealth` for unattended runs) to escalate to a stealth backend. Blocked pages are dropped, never emitted as challenge-page garbage.\n- **Generate clean datasets from semi-structured pages.** `[ScrapeSchema]` POCO plus the source generator; reflection-free, AOT-compiles into a native binary.\n- **Embed a scraping primitive in your own app.** `dotnet add package WebReaper`; the public registration seam lets you plug Redis, Cosmos DB, your own sink.\n\n## Packages\n\nThe release ships fifteen packages (one core, fourteen satellites), all versioned in lockstep. The core stays dependency-light and Native-AOT-publishable with zero warnings; satellites bring their own SDK dependencies and quarantine them off the core graph (ADR-0009).\n\n| Package | Add it for | Key builder calls |\n|---|---|---|\n| **WebReaper** | Core. HTTP crawl and parse, in-memory and file scheduler / visited-link tracker / cookie and config storage, Console / CSV / JSON-Lines sinks, Markdown extractor, schema fold. Dependency-light, Native-AOT-ready, Newtonsoft-free. | `Crawl` `Extract` `AsMarkdown` `Follow` `Paginate` `Sweep` `WriteToJsonFile` `WriteToCsvFile` `WriteToConsole` `Subscribe` |\n| **WebReaper.Cdp** | Raw CDP `IPageLoadTransport` (ADR-0052). AOT-clean (no PuppeteerSharp / Playwright dependency); System.Net.WebSockets plus System.Text.Json source-gen. Bedrock for the stealth pattern. | `.WithCdpPageLoader(cdpUrl)` (BYO) or `.WithCdpPageLoader(CdpLaunchOptions)` (launch managed Chromium) |\n| **WebReaper.Playwright** | Microsoft.Playwright-backed transport (ADR-0053). Multi-browser (Chromium default; Firefox / WebKit opt-in). All ten `PageAction` arms supported. Use for modern multi-browser needs; pair with `WebReaper.Cdp` for AOT or stealth. | `.WithPlaywrightPageLoader()` |\n| **WebReaper.Stealth.CloakBrowser** | First stealth-backend satellite (ADR-0054). Auto-downloads CloakBrowser on first use; composes on `WebReaper.Cdp`. Disposable via the ADR-0058 engine teardown chain. | `.WithCloakBrowser()` |\n| **WebReaper.AI** | LLM extraction, LLM action resolver, LLM brain, LLM self-healing, LLM schema inferrer (ADR-0044 / 0050 / 0051 / 0067). Built on `Microsoft.Extensions.AI`; bring your own `IChatClient`. | `.WithLlmFallback` `.WithLlmSelfHealing` `.WithLlmExtractor` `.WithLlmAgentBrain` `.WithLlmActionResolver` `.WithLlmSchemaInferrer` `.UseAi(client)` |\n| **WebReaper.AI.Http** | AOT-safe OpenAI-compatible `IChatClient` (ADR-0084): raw `HttpClient` plus System.Text.Json source-gen, no provider SDK. The bring-your-own client the AOT CLI and the MCP servers use for prompt extraction. | `new OpenAiCompatibleChatClient(baseUrl, model, apiKey)` |\n| **WebReaper.Extraction.Attributes** | The `[ScrapeSchema]` / `[ScrapeField]` marker types. Standalone, no runtime cost. | `[ScrapeSchema]` `[ScrapeField(\"selector\")]` |\n| **WebReaper.Extraction.Generators** | Roslyn source generator that emits `static Schema` plus reflection-free `static Materialize(JsonObject)` (ADR-0045). `DevelopmentDependency=true`; does not propagate at runtime. | compile-time only |\n| **WebReaper.Mcp** | MCP server `Exe` exposing scrape / map / extract / extract_with_prompt / extract_inferred / crawl as MCP tools over **stdio** (ADR-0049). Interop adapter for local MCP clients (Cursor, Claude Desktop). | the package _is_ the executable |\n| **WebReaper.Mcp.AspNetCore** | MCP server over **Streamable HTTP** (ADR-0086) for URL-based clients like n8n; same tools as `WebReaper.Mcp`, plus bearer-token auth and a `WEBREAPER_CDP_URL` browser sidecar. Also ships as a Chromium-baked container image. | the package _is_ the executable; or `docker run ghcr.io/alex-on-ai/webreaper-mcp-http` |\n| **WebReaper.Mongo** | MongoDB result sink and MongoDB-backed config / cookie storage. | `.WriteToMongoDb(...)` `.WithMongoDbConfigStorage(...)` `.WithMongoDbCookieStorage(...)` |\n| **WebReaper.Redis** | Redis scheduler, visited-link tracker, result sink, config / cookie storage. | `.WithRedisScheduler(...)` `.TrackVisitedLinksInRedis(...)` `.WriteToRedis(...)` `.WithRedisConfigStorage(...)` `.WithRedisCookieStorage(...)` |\n| **WebReaper.AzureServiceBus** | Distributed scheduler over an Azure Service Bus queue. | `.WithAzureServiceBusScheduler(...)` |\n| **WebReaper.Cosmos** | Azure Cosmos DB result sink. | `.WriteToCosmosDb(...)` |\n| **WebReaper.Sqlite** | Local **durable** scheduler and visited-link tracker on an embedded SQLite store; resume is a query, no position file. Opt-in robust-local tier (no server, unlike Redis). | `.WithSqliteScheduler(...)` `.TrackVisitedLinksInSqlite(...)` |\n\n`WebReaper.Cli` (the AOT single-binary; ADR-0043) is not a NuGet package; it ships as platform binaries on every GitHub release (Native-AOT plus `dotnet tool install` are mutually incompatible on one target). Install via Homebrew or `install.sh`, or build from source.\n\n## Compared to Firecrawl, Crawl4AI, and Crawlee\n\n|  | WebReaper | Firecrawl | Crawl4AI | WebFetch (Claude) |\n|---|---|---|---|---|\n| **License** | MIT | AGPL-3.0 (plus commercial) | Apache 2.0 | bundled with Claude |\n| **Install** | one binary, ~12 MB | Docker + Postgres + Redis (self-host) or hosted | Docker + Python + Playwright | nothing to install |\n| **Cost** | free | metered API plus free tier | free | included with Claude |\n| **BYO LLM** | any `IChatClient` | no (their model) | yes (LiteLLM) | Claude only |\n| **Autonomous agent** | `Agent.RunAsync()` durable, in-process | `/agent` endpoint (cloud only) | code it yourself | not available |\n| **Whole-site crawl** | `webreaper crawl` / `.Sweep()`: recursive, on-domain, sitemap-seeded, streams JSON Lines | `crawl` (cloud or self-host) | deep-crawl strategies (code it yourself) | no (single fetch) |\n| **Page actions** | 10 declarative arms: `Click`, `Wait`, `Fill`, `Press`, `ScrollToEnd`, `ScrollIntoView`, `WaitForSelector`, `WaitForNetworkIdle`, `EvaluateExpression`, `SemanticAct` (natural-language) | 9 actions: `wait`, `click`, `write`, `press`, `scroll`, `executeJavascript`, plus 3 observation (`screenshot`, `pdf`, `scrape`) | JS hooks; no closed-sum vocabulary | none (single-fetch only) |\n| **Bot-protected** | automatic HTTP → browser → stealth climb, per page, host-sticky, self-hosted | cloud yes; self-host degraded (no Fire-engine) | BYO | no |\n| **Claude Code skill** | `webreaper init` bundled | community `firecrawl-claude-code-skill` wraps the cloud API | none official | not applicable |\n\n[Crawlee](https://github.com/apify/crawlee) (Apify's Node/Python library) is also worth knowing; it covers similar ground to the WebReaper library API but doesn't ship a binary, a Claude Code skill, or a built-in LLM safety net. Use it if you're already in the Apify ecosystem.\n\nThe closest reference is Firecrawl: same AI-native positioning, opposite distribution shape. Firecrawl optimises for the hosted-API flow; WebReaper optimises for the local-binary flow. If you want a managed cloud with someone else's proxies and infra, Firecrawl is the buy. If you want a binary that runs locally with your own LLM key and no metering, WebReaper is the build.\n\n## API overview\n\nThe library is a fluent builder over a small set of seams. For the deep seam-by-seam reference (interfaces, main entities, custom sinks), see [`docs/architecture.md`](docs/architecture.md).\n\n### Schema extraction\n\n```csharp\nusing WebReaper.Builders;\n\nvar engine = await ScraperEngineBuilder\n    .Crawl(\"https://www.alexpavlov.dev/blog\")\n    .Extract(new()\n    {\n        new(\"title\", \".text-3xl.font-bold\"),\n        new(\"text\", \".max-w-max.prose.prose-dark\")\n    })\n    .Follow(\"a.text-gray-900.transition\")\n    .WriteToJsonFile(\"output.json\")\n    .PageCrawlLimit(10)\n    .WithParallelismDegree(30)\n    .LogToConsole()\n    .BuildAsync();\n\nawait engine.RunAsync();\n```\n\nEach `new(\"field\", \"css-selector\")` is a leaf; nest schemas for objects, set `IsList = true` for arrays, set `Attr = \"href\"` to read an HTML attribute instead of inner text.\n\n### Collect records in-process\n\nGet the scraped records back in your own process, no file and no custom sink. Pass `Subscribe` (ADR-0038) the `Add` of a thread-safe collection, then read them after the run:\n\n```csharp\nusing System.Collections.Concurrent;\nusing WebReaper.Builders;\nusing WebReaper.Sinks.Models;\n\nvar records = new ConcurrentBag\u003cParsedData\u003e();\n\nvar engine = await ScraperEngineBuilder\n    .Crawl(\"https://news.ycombinator.com\")\n    .AsMarkdown()\n    .Subscribe(records.Add)            // called concurrently; collect into a thread-safe type\n    .BuildAsync();\n\nawait engine.RunAsync();\n\nforeach (var record in records)\n    Console.WriteLine($\"{record.Url}: {record.Data[\"markdown\"]?.GetValue\u003cstring\u003e()?.Length ?? 0} chars\");\n```\n\nThe Crawl driver fans out to sinks concurrently, so a `ConcurrentBag` is the right collector, not a bare `List`. For structured fields, read your schema keys out of `record.Data` instead of `\"markdown\"`.\n\n### Parsing dynamic pages (SPA)\n\nFor JS-rendered pages, swap `Crawl` for `CrawlWithBrowser` and register a browser transport. Two satellites are available.\n\n**[`WebReaper.Playwright`](https://www.nuget.org/packages/WebReaper.Playwright)** is the modern default (ADR-0053): multi-browser, all ten `PageAction` arms.\n\n```csharp\nusing WebReaper.Builders;\nusing WebReaper.Playwright;\n\nawait ScraperEngineBuilder\n    .CrawlWithBrowser(\"https://example.com\")\n    .Extract(new() { new(\"title\", \"h1\") })\n    .WithPlaywrightPageLoader()\n    .BuildAsync();\n```\n\n**[`WebReaper.Cdp`](https://www.nuget.org/packages/WebReaper.Cdp)** is the AOT-clean alternative (ADR-0052): raw CDP over System.Net.WebSockets, AOT-publishable. Use for AOT consumers or as the base for stealth backends.\n\n```csharp\nusing WebReaper.Cdp;\n\n.WithCdpPageLoader(new CdpLaunchOptions { Headless = true })\n// or .WithCdpPageLoader(cdpUrl: \"http://localhost:9222\") for BYO browser\n```\n\nFor bot-protected sites, layer **[`WebReaper.Stealth.CloakBrowser`](https://www.nuget.org/packages/WebReaper.Stealth.CloakBrowser)** (ADR-0054) on top of `WebReaper.Cdp`:\n\n```csharp\nusing WebReaper.Stealth.CloakBrowser;\n\n.WithCloakBrowser()    // auto-downloads CloakBrowser on first use, ~220 MB\n```\n\nFor visible-browser debugging, add `.HeadlessMode(false)`.\n\n### Running JavaScript and page actions\n\nDrive the page as it loads. Pass an actions lambda.\n\n```csharp\nusing WebReaper.Builders;\nusing WebReaper.Playwright;\nusing WebReaper.Domain.PageActions;\n\nawait ScraperEngineBuilder\n    .CrawlWithBrowser(\"https://www.reddit.com/r/dotnet/\", actions =\u003e actions\n        .ScrollToEnd()\n        .Build())\n    .Extract(new() { new(\"title\", \"h1\") })\n    .WithPlaywrightPageLoader()\n    .BuildAsync();\n```\n\n`PageActionBuilder` exposes `Click`, `Wait`, `ScrollToEnd`, `ScrollIntoView`, `WaitForSelector`, `WaitForNetworkIdle`, `EvaluateExpression`, `Fill`, `Press`, `SemanticAct`, `Repeat` / `RepeatWithDelay`, and `Build()`. `Fill(selector, value)` / `Press(key)` / `ScrollIntoView(selector)` (ADR-0074) carry an implicit 30 s auto-wait and use the React-friendly native-setter trick on the CDP transport, so controlled components in React / Vue / Svelte observe the change. `SemanticAct` (ADR-0050) accepts a natural-language intent and resolves it via the registered `IActionResolver` (see [AI features §5](#5-semantic-page-actions)).\n\n### Persist progress locally\n\nSurvive `kill -9` and resume across restarts. Two adapters: file-backed (zero deps, in core) and SQLite-backed (durable, opt-in via `WebReaper.Sqlite`).\n\n```csharp\nusing WebReaper.Builders;\nusing WebReaper.Sqlite;\n\nawait ScraperEngineBuilder\n    .Crawl(\"https://example.com\")\n    .Extract(new() { new(\"name\", \"h1\") })\n    .Follow(\".forumlink\u003ea\")\n    .Paginate(\"a.torTopic\", \".pg\")\n    .WriteToJsonFile(\"result.json\")\n    .WithSqliteScheduler(\"crawl/state.db\")        // resume is a query, not a position file\n    .TrackVisitedLinksInSqlite(\"crawl/state.db\")  // the table is the set\n    .BuildAsync();\n```\n\nPass `dataCleanupOnStart: true` to any sink, tracker, or scheduler method to wipe its store at start (note: `WriteToJsonFile` defaults this to `true`; the others default to `false`).\n\n### Authorization\n\nIf the site needs cookies, call `SetCookies` and fill the container. You perform the login yourself.\n\n```csharp\nusing System.Net;\nusing WebReaper.Builders;\n\nawait ScraperEngineBuilder\n    .Crawl(\"https://example.com/protected\")\n    .Extract(new() { new(\"name\", \"h1\") })\n    .SetCookies(cookies =\u003e\n    {\n        cookies.Add(new Cookie(\"AuthToken\", \"123\"));\n    })\n    .BuildAsync();\n```\n\n### Distributed and serverless\n\nSwap the scheduler, config storage, and link tracker to Redis or Azure Service Bus; multiple workers or serverless functions share one crawl. [`Examples/WebReaper.AzureFuncs`](Examples/WebReaper.AzureFuncs) shows the serverless shape (two functions: `StartScraping` seeds the work, `WebReaperSpider` is the distributed Crawl driver). [`Examples/WebReaper.DistributedScraperWorkerService`](Examples/WebReaper.DistributedScraperWorkerService) shows the worker-service shape.\n\n`DistributedSpiderBuilder.BuildSpider()` returns a bare `ISpider` without a Crawl seed (ADR-0009 / ADR-0025: \"two seams, not one bug\" split). The worker's config is persisted separately by the start endpoint.\n\n### Storage and scheduler backends\n\nEvery backend is a swappable seam. In-memory is the default; file-backed lives in core; the rest come from satellites.\n\n| Seam | Core (in-memory default + file) | Satellite options |\n|---|---|---|\n| Scheduler | in-memory, `WithTextFileScheduler` | `WithSqliteScheduler`, `WithRedisScheduler`, `WithAzureServiceBusScheduler` |\n| Visited-link tracker | in-memory, `TrackVisitedLinksInFile` | `TrackVisitedLinksInSqlite`, `TrackVisitedLinksInRedis` |\n| Config storage | in-memory, `WithFileConfigStorage` | `WithMongoDbConfigStorage`, `WithRedisConfigStorage` |\n| Cookie storage | in-memory, `WithFileCookieStorage` | `WithMongoDbCookieStorage`, `WithRedisCookieStorage` |\n| Agent run store | in-memory, file (ADR-0051) | `WithSqliteAgentRunStore`, `WithRedisAgentRunStore`, `WithMongoAgentRunStore`, `WithCosmosAgentRunStore` |\n| Result sink | `WriteToConsole`, `WriteToCsvFile`, `WriteToJsonFile` | `WriteToMongoDb`, `WriteToRedis`, `WriteToCosmosDb` |\n| Page loader transport | HTTP (default) | `WithPlaywrightPageLoader`, `WithCdpPageLoader`, `WithCloakBrowser` |\n\nCustom sinks, the full interface index, and the main domain entities live in [`docs/architecture.md`](docs/architecture.md).\n\n## Repository structure\n\n| Project | Description |\n|---|---|\n| `WebReaper` | The core library (the `WebReaper` NuGet package). |\n| `WebReaper.Cdp` | Raw CDP transport satellite (ADR-0052). |\n| `WebReaper.Playwright` | Microsoft.Playwright transport satellite (ADR-0053). |\n| `WebReaper.Stealth.CloakBrowser` | First stealth-backend satellite (ADR-0054). |\n| `WebReaper.AI` | LLM extraction, action resolver, agent brain, self-healing, schema inferrer. |\n| `WebReaper.Extraction.Attributes` | `[ScrapeSchema]` / `[ScrapeField]` marker types (ADR-0045). |\n| `WebReaper.Extraction.Generators` | Roslyn source generator (ADR-0045). |\n| `WebReaper.Mcp` | MCP server satellite (ADR-0049). |\n| `WebReaper.Mongo` | MongoDB sink plus config / cookie storage. |\n| `WebReaper.Redis` | Redis scheduler, tracker, sink, config / cookie storage. |\n| `WebReaper.AzureServiceBus` | Azure Service Bus distributed scheduler. |\n| `WebReaper.Cosmos` | Azure Cosmos DB sink. |\n| `WebReaper.Sqlite` | Local durable scheduler and visited-link tracker over embedded SQLite. |\n| `WebReaper.Cli` | AOT single-binary CLI (ADR-0043). |\n| `Examples/WebReaper.ConsoleApplication` | Using WebReaper in a console application. |\n| `Examples/WebReaper.AiNativeShowcase` | Runnable demos for every AI feature in this README. |\n| `Examples/WebReaper.SchemaInferenceShowcase` | Demos for ADR-0067 / 0068 / 0069 schema inference. |\n| `Examples/WebReaper.ScraperWorkerService` | Using WebReaper in a .NET Worker Service. |\n| `Examples/WebReaper.DistributedScraperWorkerService` | Distributed crawl across workers sharing crawl state. |\n| `Examples/WebReaper.AzureFuncs` | Serverless crawl with Azure Functions plus Azure Service Bus. |\n| `Examples/BrownsfashionScraper` | A real-world e-commerce scraper example. |\n| `Misc/WebReaper.ProxyProviders` | Example proxy-provider implementations. |\n\n## License\n\nWebReaper is MIT-licensed (ADR-0017). All NuGet packages plus the `WebReaper.Cli` binary ship under the same terms. Use it commercially, embed it in proprietary software, fork it, modify it, redistribute it; the only ask is that you keep the copyright notice.\n\nPrior to the 10.0.0 wave, WebReaper was GPL-3.0-or-later. The relicense is strictly more permissive: every existing user is unaffected; new users who couldn't embed under GPL now can. Historical contributors are credited in [`CONTRIBUTORS.md`](CONTRIBUTORS.md). See [`docs/adr/0017-relicense-gpl-mit.md`](docs/adr/0017-relicense-gpl-mit.md) for the analysis and contributor consent path.\n\nContributions are welcome under the same MIT terms; sign-off via DCO ([`CONTRIBUTING.md`](CONTRIBUTING.md)).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falex-on-ai%2FWebReaper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falex-on-ai%2FWebReaper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falex-on-ai%2FWebReaper/lists"}