https://github.com/raintree-technology/docpull
Crawl any website and convert it to clean, AI-ready Markdown — async Python CLI with MCP support, crawl profiles, caching, and RAG-optimized output
https://github.com/raintree-technology/docpull
ai-training-data cli crawler developer-tools documentation llm markdown mcp pypi python rag web-scraping
Last synced: about 2 months ago
JSON representation
Crawl any website and convert it to clean, AI-ready Markdown — async Python CLI with MCP support, crawl profiles, caching, and RAG-optimized output
- Host: GitHub
- URL: https://github.com/raintree-technology/docpull
- Owner: raintree-technology
- License: mit
- Created: 2025-11-07T22:03:11.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-04-15T21:38:53.000Z (2 months ago)
- Last Synced: 2026-04-15T23:27:16.251Z (2 months ago)
- Topics: ai-training-data, cli, crawler, developer-tools, documentation, llm, markdown, mcp, pypi, python, rag, web-scraping
- Language: Python
- Homepage: https://docpull.raintree.technology/
- Size: 1.69 MB
- Stars: 20
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
- Security: .github/SECURITY.md
Awesome Lists containing this project
README
# docpull
**Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.**
[](https://www.python.org/downloads/)
[](https://badge.fury.io/py/docpull)
[](https://pepy.tech/project/docpull)
[](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
docpull uses async HTTP (not Playwright) to fetch server-rendered pages,
extracts main content, and writes clean Markdown with source-URL frontmatter —
in seconds, with a small install footprint. It won't render JavaScript, but for
the large class of docs that don't need it (API references, Python/Go stdlib,
most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a
fast, auditable, sandbox-friendly way to pipe documentation into an LLM context,
a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and
CRLF-injection protections are on by default — a necessity when an AI agent
is choosing the URLs.
## Install
```bash
pip install docpull
# Optional extras
pip install 'docpull[llm]' # tiktoken for token-accurate chunking
pip install 'docpull[trafilatura]' # alternative extractor for noisy pages
pip install 'docpull[mcp]' # run as an MCP server for AI agents
pip install 'docpull[all]' # everything above
```
## Quick start
```bash
# Crawl and save Markdown
docpull https://docs.example.com
# One page, no crawl — the fast path for agents
docpull https://docs.example.com/guide --single
# LLM-ready NDJSON with 4k-token chunks streamed to stdout
docpull https://docs.example.com --profile llm --stream | jq .
# Mirror a site for offline use
docpull https://docs.example.com --profile mirror --cache
```
## Framework-aware extraction
docpull inspects each page before running the generic extractor and can pull
content directly from framework data feeds:
| Framework | Strategy |
|-----------|----------|
| Next.js | Parses `__NEXT_DATA__` JSON |
| Mintlify | `__NEXT_DATA__` with Mintlify tagging |
| OpenAPI | Renders `openapi.json` / `swagger.json` into Markdown |
| Docusaurus| Detected and tagged; generic extractor produces Markdown |
| Sphinx | Detected and tagged; generic extractor produces Markdown |
JS-only SPAs with no server-rendered content are detected and skipped with a
clear reason (or, with `--strict-js-required`, reported as an error so agents
can route elsewhere).
## Agent-friendly features
- **`--single`** — fetch a single URL without discovery. Designed for tool loops.
- **`--stream`** — NDJSON one-record-per-line, flushed on every page, pipeable.
- **`--max-tokens-per-file N`** — split each page into token-bounded chunks on
heading boundaries (exact counts with tiktoken, estimate without).
- **`--emit-chunks`** — write one file or record per chunk instead of per page.
- **`--strict-js-required`** — hard-fail on JS-only pages instead of silently
skipping.
- **`--extractor trafilatura`** — swap in [trafilatura](https://trafilatura.readthedocs.io/)
for sites where the default heuristics struggle.
## Python API
```python
from docpull import fetch_one
ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title, ctx.source_type)
print(ctx.markdown[:500])
```
Async streaming:
```python
import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async def main():
cfg = DocpullConfig(
url="https://docs.example.com",
profile=ProfileName.LLM, # chunked NDJSON output
)
async with Fetcher(cfg) as fetcher:
async for event in fetcher.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}: {event.url}")
print(f"Done: {fetcher.stats.pages_fetched} pages")
asyncio.run(main())
```
Single-page from an agent tool:
```python
from docpull import Fetcher, DocpullConfig
async def tool_call(url: str) -> str:
async with Fetcher(DocpullConfig(url=url)) as f:
ctx = await f.fetch_one(url, save=False)
return ctx.markdown or ctx.error or ""
```
## Profiles
```bash
docpull https://site.com --profile rag # Default. Dedup, rich metadata.
docpull https://site.com --profile llm # NDJSON + chunks + metadata.
docpull https://site.com --profile mirror # Full archive, polite, cached.
docpull https://site.com --profile quick # Sampling: 50 pages, depth 2.
```
## MCP server
docpull ships an MCP (Model Context Protocol) server so AI agents can call it
directly over stdio:
```bash
pip install 'docpull[mcp]'
docpull mcp # starts the stdio server
```
Add to Claude Desktop or Claude Code:
```json
{
"mcpServers": {
"docpull": {
"command": "docpull",
"args": ["mcp"]
}
}
}
```
Tools exposed:
- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl
- `ensure_docs(source, force?)` — fetch a named library (cached 7 days)
- `list_sources(category?)` — show available aliases (react, nextjs, fastapi, …)
- `list_indexed()` — what has been fetched locally
- `grep_docs(pattern, library?)` — regex search across fetched Markdown
User-defined sources live in `~/.config/docpull-mcp/sources.yaml`:
```yaml
sources:
mydocs:
url: https://docs.example.com
description: My internal docs
category: internal
maxPages: 200
```
## Output
Markdown files with YAML frontmatter:
```markdown
---
title: "Getting Started"
source: https://docs.example.com/guide
source_type: "nextjs"
---
# Getting Started
…
```
NDJSON (one record per page or chunk):
```json
{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}
```
## Security
- HTTPS-only, mandatory robots.txt compliance
- SSRF protection: blocks private/internal network IPs, DNS rebinding
- XXE protection via `defusedxml` on sitemaps
- Path traversal and CRLF header injection guards
- Auth headers stripped on cross-origin redirects
## Options
Run `docpull --help` for the full list. Highlights:
```
Core:
--profile {rag,mirror,quick,llm,custom}
--single Fetch one URL (no crawl)
--format {markdown,json,ndjson,sqlite}
--stream Stream NDJSON to stdout
LLM / chunking:
--max-tokens-per-file N
--tokenizer NAME tiktoken encoding (default cl100k_base)
--emit-chunks One file/record per chunk
Content extraction:
--extractor {default,trafilatura}
--no-special-cases Disable framework extractors
--strict-js-required Error on JS-only pages
Cache:
--cache Enable incremental updates
--cache-dir DIR
--cache-ttl DAYS
```
## Troubleshooting
```bash
docpull --doctor # Check installation
docpull URL --verbose # Verbose output
docpull URL --dry-run # Test without downloading
docpull URL --preview-urls # List URLs without fetching
```
## Links
- [Website](https://docpull.raintree.technology)
- [PyPI](https://pypi.org/project/docpull/)
- [GitHub](https://github.com/raintree-technology/docpull)
- [Changelog](https://github.com/raintree-technology/docpull/blob/main/docs/CHANGELOG.md)
## License
MIT