https://github.com/raintree-technology/docpull

Crawl any website and convert it to clean, AI-ready Markdown — async Python CLI with MCP support, crawl profiles, caching, and RAG-optimized output
https://github.com/raintree-technology/docpull

ai-training-data cli crawler developer-tools documentation llm markdown mcp pypi python rag web-scraping

Last synced: 3 months ago
JSON representation

Crawl any website and convert it to clean, AI-ready Markdown — async Python CLI with MCP support, crawl profiles, caching, and RAG-optimized output

Host: GitHub
URL: https://github.com/raintree-technology/docpull
Owner: raintree-technology
License: mit
Created: 2025-11-07T22:03:11.000Z (8 months ago)
Default Branch: main
Last Pushed: 2026-04-15T21:38:53.000Z (3 months ago)
Last Synced: 2026-04-15T23:27:16.251Z (3 months ago)
Topics: ai-training-data, cli, crawler, developer-tools, documentation, llm, markdown, mcp, pypi, python, rag, web-scraping
Language: Python
Homepage: https://docpull.raintree.technology/
Size: 1.69 MB
Stars: 20
Watchers: 1
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
- Security: .github/SECURITY.md

Awesome Lists containing this project

README

          # docpull

**Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.**

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)

[![Downloads](https://pepy.tech/badge/docpull)](https://pepy.tech/project/docpull)

[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)



  

    

  



docpull uses async HTTP (not Playwright) to fetch server-rendered pages,

extracts main content, and writes clean Markdown with source-URL frontmatter —

in seconds, with a small install footprint. It won't render JavaScript, but for

the large class of docs that don't need it (API references, Python/Go stdlib,

most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a

fast, auditable, sandbox-friendly way to pipe documentation into an LLM context,

a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and

CRLF-injection protections are on by default — a necessity when an AI agent

is choosing the URLs.

## Install

```bash

pip install docpull

# Optional extras

pip install 'docpull[llm]'           # tiktoken for token-accurate chunking

pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages

pip install 'docpull[mcp]'           # run as an MCP server for AI agents

pip install 'docpull[all]'           # everything above

```

## Quick start

```bash

# Crawl and save Markdown

docpull https://docs.example.com

# One page, no crawl — the fast path for agents

docpull https://docs.example.com/guide --single

# LLM-ready NDJSON with 4k-token chunks streamed to stdout

docpull https://docs.example.com --profile llm --stream | jq .

# Mirror a site for offline use

docpull https://docs.example.com --profile mirror --cache

```

## Framework-aware extraction

docpull inspects each page before running the generic extractor and can pull

content directly from framework data feeds:

| Framework | Strategy |

|-----------|----------|

| Next.js   | Parses `__NEXT_DATA__` JSON |

| Mintlify  | `__NEXT_DATA__` with Mintlify tagging |

| OpenAPI   | Renders `openapi.json` / `swagger.json` into Markdown |

| Docusaurus| Detected and tagged; generic extractor produces Markdown |

| Sphinx    | Detected and tagged; generic extractor produces Markdown |

JS-only SPAs with no server-rendered content are detected and skipped with a

clear reason (or, with `--strict-js-required`, reported as an error so agents

can route elsewhere).

## Agent-friendly features

- **`--single`** — fetch a single URL without discovery. Designed for tool loops.

- **`--stream`** — NDJSON one-record-per-line, flushed on every page, pipeable.

- **`--max-tokens-per-file N`** — split each page into token-bounded chunks on

  heading boundaries (exact counts with tiktoken, estimate without).

- **`--emit-chunks`** — write one file or record per chunk instead of per page.

- **`--strict-js-required`** — hard-fail on JS-only pages instead of silently

  skipping.

- **`--extractor trafilatura`** — swap in [trafilatura](https://trafilatura.readthedocs.io/)

  for sites where the default heuristics struggle.

## Python API

```python

from docpull import fetch_one

ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")

print(ctx.title, ctx.source_type)

print(ctx.markdown[:500])

```

Async streaming:

```python

import asyncio

from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async def main():

    cfg = DocpullConfig(

        url="https://docs.example.com",

        profile=ProfileName.LLM,  # chunked NDJSON output

    )

    async with Fetcher(cfg) as fetcher:

        async for event in fetcher.run():

            if event.type == EventType.FETCH_PROGRESS:

                print(f"{event.current}/{event.total}: {event.url}")

        print(f"Done: {fetcher.stats.pages_fetched} pages")

asyncio.run(main())

```

Single-page from an agent tool:

```python

from docpull import Fetcher, DocpullConfig

async def tool_call(url: str) -> str:

    async with Fetcher(DocpullConfig(url=url)) as f:

        ctx = await f.fetch_one(url, save=False)

        return ctx.markdown or ctx.error or ""

```

## Profiles

```bash

docpull https://site.com --profile rag      # Default. Dedup, rich metadata.

docpull https://site.com --profile llm      # NDJSON + chunks + metadata.

docpull https://site.com --profile mirror   # Full archive, polite, cached.

docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.

```

## MCP server

docpull ships an MCP (Model Context Protocol) server so AI agents can call it

directly over stdio:

```bash

pip install 'docpull[mcp]'

docpull mcp  # starts the stdio server

```

Add to Claude Desktop or Claude Code:

```json

{

  "mcpServers": {

    "docpull": {

      "command": "docpull",

      "args": ["mcp"]

    }

  }

}

```

Tools exposed:

- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl

- `ensure_docs(source, force?)` — fetch a named library (cached 7 days)

- `list_sources(category?)` — show available aliases (react, nextjs, fastapi, …)

- `list_indexed()` — what has been fetched locally

- `grep_docs(pattern, library?)` — regex search across fetched Markdown

User-defined sources live in `~/.config/docpull-mcp/sources.yaml`:

```yaml

sources:

  mydocs:

    url: https://docs.example.com

    description: My internal docs

    category: internal

    maxPages: 200

```

## Output

Markdown files with YAML frontmatter:

```markdown

---

title: "Getting Started"

source: https://docs.example.com/guide

source_type: "nextjs"

---

# Getting Started

…

```

NDJSON (one record per page or chunk):

```json

{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}

```

## Security

- HTTPS-only, mandatory robots.txt compliance

- SSRF protection: blocks private/internal network IPs, DNS rebinding

- XXE protection via `defusedxml` on sitemaps

- Path traversal and CRLF header injection guards

- Auth headers stripped on cross-origin redirects

## Options

Run `docpull --help` for the full list. Highlights:

```

Core:

  --profile {rag,mirror,quick,llm,custom}

  --single                Fetch one URL (no crawl)

  --format {markdown,json,ndjson,sqlite}

  --stream                Stream NDJSON to stdout

LLM / chunking:

  --max-tokens-per-file N

  --tokenizer NAME        tiktoken encoding (default cl100k_base)

  --emit-chunks           One file/record per chunk

Content extraction:

  --extractor {default,trafilatura}

  --no-special-cases      Disable framework extractors

  --strict-js-required    Error on JS-only pages

Cache:

  --cache                 Enable incremental updates

  --cache-dir DIR

  --cache-ttl DAYS

```

## Troubleshooting

```bash

docpull --doctor              # Check installation

docpull URL --verbose         # Verbose output

docpull URL --dry-run         # Test without downloading

docpull URL --preview-urls    # List URLs without fetching

```

## Links

- [Website](https://docpull.raintree.technology)

- [PyPI](https://pypi.org/project/docpull/)

- [GitHub](https://github.com/raintree-technology/docpull)

- [Changelog](https://github.com/raintree-technology/docpull/blob/main/docs/CHANGELOG.md)

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raintree-technology/docpull

Awesome Lists containing this project

README