https://github.com/webpeel/webpeel

The web data platform for AI agents. Fetch, search, crawl, extract, monitor, screenshot — one API. 29 domain extractors, 65-98% token savings, MCP server included.
https://github.com/webpeel/webpeel

ai-agent claude crawl cursor extract firecrawl-alternative llm markdown mcp-server monitor playwright rag screenshot search token-efficiency typescript web-data web-fetcher web-scraper

Last synced: 4 months ago
JSON representation

The web data platform for AI agents. Fetch, search, crawl, extract, monitor, screenshot — one API. 29 domain extractors, 65-98% token savings, MCP server included.

Host: GitHub
URL: https://github.com/webpeel/webpeel
Owner: webpeel
License: other
Created: 2026-02-12T18:18:25.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-03-23T07:48:19.000Z (4 months ago)
Last Synced: 2026-03-23T10:57:35.218Z (4 months ago)
Topics: ai-agent, claude, crawl, cursor, extract, firecrawl-alternative, llm, markdown, mcp-server, monitor, playwright, rag, screenshot, search, token-efficiency, typescript, web-data, web-fetcher, web-scraper
Language: TypeScript
Homepage: https://webpeel.dev
Size: 58.3 MB
Stars: 3
Watchers: 0
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

README

          


  

    

  





  

  

  

  

  



The web data layer for AI agents.
Fetch, search, crawl, extract, screenshot — one call, zero boilerplate.




  Quick Start ·

  Agent Integrations ·

  Docs ·

  Playground ·

  Get API Key





  



---

## The Problem

Every AI agent that touches the web rebuilds the same brittle stack: HTTP fetch → headless browser → anti-bot bypass → HTML cleanup → markdown conversion → token budgeting. Each layer fails differently. Sites change. Cloudflare rotates challenges. Your agent gets empty strings at 2 AM and your pipeline breaks.

**WebPeel replaces that entire stack with one function call.** It handles engine selection, anti-bot escalation, domain-specific extraction, and token optimization so your agent gets clean, structured data every time — without managing browsers, proxies, or parsing logic.

---

## Quick Start

```bash

# Zero-install — just run it

npx webpeel "https://example.com"

# Search the web

npx webpeel search "latest AI agent frameworks"

# Crawl an entire site

npx webpeel crawl docs.example.com --max-pages 50

# Screenshot any page

npx webpeel screenshot "https://stripe.com/pricing" --full-page

# Ask a question about any page

npx webpeel ask "https://arxiv.org/abs/2401.00001" "What is the main contribution?"

```

Or install globally:

```bash

npm install -g webpeel

```

**Use as a library:**

```typescript

import { peel } from 'webpeel';

const result = await peel('https://news.ycombinator.com');

console.log(result.markdown);   // Clean markdown, ready for your LLM

console.log(result.metadata);   // Title, tokens saved, timing, etc.

```

**Use via API:**

```bash

curl "https://api.webpeel.dev/v1/fetch?url=https://stripe.com/pricing" \

  -H "Authorization: Bearer $WEBPEEL_API_KEY"

```

```json

{

  "url": "https://stripe.com/pricing",

  "markdown": "# Stripe Pricing\n\n**Integrated per-transaction fees**...",

  "metadata": {

    "title": "Pricing & Fees | Stripe",

    "tokens": 420,

    "tokensOriginal": 8200,

    "savingsPct": 94.9

  }

}

```

[Get your free API key →](https://app.webpeel.dev/signup) · No credit card required · 500 requests/week free

---

## Why WebPeel

### 🧠 55+ Domain Extractors — Not Just HTML-to-Markdown

Generic scrapers convert raw HTML to markdown and call it a day. WebPeel has **purpose-built extractors** for 55+ domains — Reddit, GitHub, YouTube, Amazon, ArXiv, Hacker News, Wikipedia, StackOverflow, Zillow, Polymarket, ESPN, and more. Each extractor understands the site's structure and returns clean, structured data without browser rendering.

### ⚡ 65–98% Token Savings

Domain extractors strip navigation, ads, sidebars, and boilerplate *before* content reaches your agent. Less context consumed = lower costs, faster inference, and longer agent chains.

| Site | Raw HTML tokens | WebPeel tokens | Savings |

|------|:--------------:|:--------------:|:-------:|

| News article | 18,000 | 640 | **96%** |

| Reddit thread | 24,000 | 890 | **96%** |

| Wikipedia page | 31,000 | 2,100 | **93%** |

| GitHub README | 5,200 | 1,800 | **65%** |

| E-commerce product | 14,000 | 310 | **98%** |

### 🔄 6-Layer Engine Escalation

WebPeel doesn't just try one method — it automatically escalates through 6 engines until it gets a good result:

```

Simple HTTP → Domain API → Browser render → Stealth browser → Cloaked browser → Search cache fallback

```

No manual `--render` flags for most sites. WebPeel knows which sites need JavaScript, which need stealth, and which have anti-bot protection — and picks the right engine automatically.

### 🔌 Firecrawl-Compatible Migration Path

Already using Firecrawl-style workflows? WebPeel supports compatible `/v1/scrape`, `/v2/scrape`, `/v1/crawl`, `/v1/search`, and `/v1/map` endpoints, which makes migration dramatically easier than rebuilding your pipeline from scratch.

---

## Agent-Native Integrations

### MCP Server (Claude, Cursor, Windsurf, VS Code)

Give any MCP-compatible AI the ability to browse, search, and extract from the web.

```json

{

  "mcpServers": {

    "webpeel": {

      "command": "npx",

      "args": ["-y", "webpeel", "mcp"],

      "env": { "WEBPEEL_API_KEY": "wp_your_key_here" }

    }

  }

}

```

**7 MCP tools exposed:** `webpeel_read` · `webpeel_find` · `webpeel_see` · `webpeel_extract` · `webpeel_monitor` · `webpeel_act` · `webpeel_crawl`

[Full MCP setup guide →](https://webpeel.dev/docs/mcp)

### LangChain

```typescript

import { WebPeelLoader } from 'webpeel/integrations/langchain';

const loader = new WebPeelLoader({ url: 'https://example.com', render: true });

const docs = await loader.load();

```

### LlamaIndex

```typescript

import { WebPeelReader } from 'webpeel/integrations/llamaindex';

const reader = new WebPeelReader();

const docs = await reader.loadData('https://example.com');

```

### Python SDK

```bash

pip install webpeel

```

```python

from webpeel import WebPeel

wp = WebPeel(api_key="wp_...")

result = wp.fetch("https://example.com")

print(result.markdown)

```

---

## Full Feature Set

| Capability | CLI | API | Details |

|-----------|:---:|:---:|---------|

| **Fetch & extract** | `webpeel "url"` | `GET /v1/fetch` | Clean markdown from any URL |

| **Web search** | `webpeel search "query"` | `GET /v1/search` | DuckDuckGo (free) or Brave (BYOK) |

| **Smart search** | — | `POST /v1/search/smart` | AI-powered structured results |

| **Crawl sites** | `webpeel crawl "url"` | `POST /v1/crawl` | Depth/page limits, rate control |

| **Screenshots** | `webpeel screenshot "url"` | `POST /v1/screenshot` | Full-page, multi-viewport, visual diff, filmstrip |

| **Structured extraction** | `--extract-schema` | `POST /v1/extract` | JSON schema → structured data |

| **Q&A** | `webpeel ask "url" "q"` | `POST /v1/answer` | Answer questions about any page |

| **Deep research** | — | `POST /v1/deep-research` | Multi-query autonomous research |

| **Content monitoring** | `webpeel monitor "url"` | `POST /v1/watch` | Change detection with webhooks |

| **Browser sessions** | — | `POST /v1/session` | Persistent sessions for login flows |

| **Browser actions** | `--action 'click:.btn'` | actions field | Click, type, scroll, wait |

| **Batch scrape** | `webpeel batch file` | `POST /v1/batch/scrape` | Parallel multi-URL processing |

| **URL discovery** | `webpeel map "url"` | `POST /v1/map` | Sitemap and link discovery |

| **YouTube transcripts** | auto-detected | auto-detected | Multiple export formats |

| **PDF extraction** | auto-detected | auto-detected | Text, tables, structure |

| **Research agent** | — | `POST /v1/agent` | Autonomous multi-step research |

---

## Use Cases for Agent Builders

**RAG pipelines** — Fetch docs, articles, or entire sites as clean markdown ready for chunking, embedding, and retrieval.

**Price monitoring** — Track product pages across major commerce sites with structured extraction and change detection.

**Competitive intel** — Monitor competitor pages, pricing tables, and job boards. Visual diff screenshots catch layout changes CSS selectors would miss.

**Research agents** — Give Claude, Codex, Cursor, or your own agent grounded web access through the API or MCP server.

**Lead enrichment** — Pull company details, public links, and page structure from business sites without writing per-site parsers.

**Content aggregation** — Crawl and extract from communities, docs sites, and publications with domain-native extractors that understand each site's structure.

---

## Architecture

```

Your Agent

    ↓

 WebPeel (npm / API / MCP)

    ↓

┌─────────────────────────────────┐

│  Engine Ranker                  │

│  HTTP → Domain API → Browser   │

│  → Stealth → Cloaked → Cache   │

├─────────────────────────────────┤

│  55+ Domain Extractors          │

│  reddit · github · youtube      │

│  amazon · arxiv · zillow · ...  │

├─────────────────────────────────┤

│  Content Pipeline               │

│  Readability → Turndown →       │

│  Token budgeting → Chunking     │

└─────────────────────────────────┘

    ↓

 Clean markdown / structured JSON

```

---

## Reliability

WebPeel is built for production agent workflows, not just one-off demos.

- **Automated evals in-repo** — smart search and fetch eval suites ship with the codebase

- **Post-deploy gate** — critical checks run before calling a deploy healthy

- **Engine fallback chain** — when one fetch method fails, WebPeel escalates instead of giving up

- **Multiple surfaces, one core** — CLI, API, SDK, and MCP all ride the same extraction pipeline

---

## Security

- **SSRF protection** — blocks localhost, private IPs, metadata endpoints, `file://` schemes

- **Helmet.js** — HSTS, X-Frame-Options, nosniff, XSS protection on all responses

- **Webhook signing** — HMAC-SHA256 on all outbound webhooks

- **API key hashing** — SHA-256 with granular scopes

- **Rate limiting** — sliding window, per-tier

- **Audit logging** — every API call logged with IP, key, and action

- **GDPR compliant** — `DELETE /v1/account` for full data erasure

[Security policy →](https://webpeel.dev/security) · [SLA (99.9% uptime) →](https://webpeel.dev/sla)

---

## Why teams choose WebPeel instead of stitching a stack together

| Approach | What it gives you | Where it breaks down |

|---|---|---|

| Raw HTTP + HTML parsing | Cheap, simple fetches | Falls apart on JS-heavy sites, anti-bot pages, and noisy HTML |

| Pure browser automation | Maximum control | Expensive, slow, fragile, and high-maintenance for large-scale use |

| Search-only APIs | Great discovery | Weak page extraction, limited structured output, limited downstream actions |

| Single-purpose scrapers | Fast on one job | You end up composing 4–6 tools for real agent workflows |

| **WebPeel** | Fetch + search + crawl + extraction + screenshots + monitoring in one layer | Opinionated toward agent workflows rather than generic scraping |

---

## Links

📖 [Documentation](https://webpeel.dev/docs) · 💰 [Pricing](https://webpeel.dev/pricing) · 🎮 [Playground](https://webpeel.dev/playground) · 📝 [Blog](https://webpeel.dev/blog) · 💬 [Discussions](https://github.com/webpeel/webpeel/discussions) · 🚀 [Releases](https://github.com/webpeel/webpeel/releases) · 📊 [Status](https://webpeel.dev/status) · 🔒 [Security](https://webpeel.dev/security) · 📋 [Changelog](https://webpeel.dev/changelog)

---

## Contributing

Pull requests welcome. Please open an issue first to discuss major changes.

```bash

git clone https://github.com/webpeel/webpeel.git

cd webpeel && npm install

npm run build && npm test

```

---

## License

[WebPeel SDK License](LICENSE) — free for personal and commercial use with attribution.



  Get started free →