{"id":33333883,"url":"https://github.com/raintree-technology/docpull","last_synced_at":"2026-04-24T23:04:27.317Z","repository":{"id":323071754,"uuid":"1092028876","full_name":"raintree-technology/docpull","owner":"raintree-technology","description":"Crawl any website and convert it to clean, AI-ready Markdown — async Python CLI with MCP support, crawl profiles, caching, and RAG-optimized output","archived":false,"fork":false,"pushed_at":"2026-04-15T21:38:53.000Z","size":1772,"stargazers_count":20,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-15T23:27:16.251Z","etag":null,"topics":["ai-training-data","cli","crawler","developer-tools","documentation","llm","markdown","mcp","pypi","python","rag","web-scraping"],"latest_commit_sha":null,"homepage":"https://docpull.raintree.technology/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raintree-technology.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":".github/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-07T22:03:11.000Z","updated_at":"2026-04-15T21:38:47.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/raintree-technology/docpull","commit_stats":null,"previous_names":["raintree-technology/docpull"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/raintree-technology/docpull","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raintree-technology%2Fdocpull","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raintree-technology%2Fdocpull/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raintree-technology%2Fdocpull/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raintree-technology%2Fdocpull/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raintree-technology","download_url":"https://codeload.github.com/raintree-technology/docpull/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raintree-technology%2Fdocpull/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32243803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-24T13:21:15.438Z","status":"ssl_error","status_checked_at":"2026-04-24T13:21:15.005Z","response_time":64,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-training-data","cli","crawler","developer-tools","documentation","llm","markdown","mcp","pypi","python","rag","web-scraping"],"created_at":"2025-11-21T00:06:00.071Z","updated_at":"2026-04-24T23:04:27.311Z","avatar_url":"https://github.com/raintree-technology.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# docpull\n\n**Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.**\n\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)\n[![Downloads](https://pepy.tech/badge/docpull)](https://pepy.tech/project/docpull)\n[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://docpull.raintree.technology\"\u003e\n    \u003cimg src=\"https://pub-e85a1abca36f4fd8b4300a6ec2d6f45f.r2.dev/marketing/docpull/1768954147343-iaiziy-docpull-terminal-hero.gif\" alt=\"docpull demo\" width=\"600\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\ndocpull uses async HTTP (not Playwright) to fetch server-rendered pages,\nextracts main content, and writes clean Markdown with source-URL frontmatter —\nin seconds, with a small install footprint. It won't render JavaScript, but for\nthe large class of docs that don't need it (API references, Python/Go stdlib,\nmost dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a\nfast, auditable, sandbox-friendly way to pipe documentation into an LLM context,\na RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and\nCRLF-injection protections are on by default — a necessity when an AI agent\nis choosing the URLs.\n\n## Install\n\n```bash\npip install docpull\n\n# Optional extras\npip install 'docpull[llm]'           # tiktoken for token-accurate chunking\npip install 'docpull[trafilatura]'   # alternative extractor for noisy pages\npip install 'docpull[mcp]'           # run as an MCP server for AI agents\npip install 'docpull[all]'           # everything above\n```\n\n## Quick start\n\n```bash\n# Crawl and save Markdown\ndocpull https://docs.example.com\n\n# One page, no crawl — the fast path for agents\ndocpull https://docs.example.com/guide --single\n\n# LLM-ready NDJSON with 4k-token chunks streamed to stdout\ndocpull https://docs.example.com --profile llm --stream | jq .\n\n# Mirror a site for offline use\ndocpull https://docs.example.com --profile mirror --cache\n```\n\n## Framework-aware extraction\n\ndocpull inspects each page before running the generic extractor and can pull\ncontent directly from framework data feeds:\n\n| Framework | Strategy |\n|-----------|----------|\n| Next.js   | Parses `__NEXT_DATA__` JSON |\n| Mintlify  | `__NEXT_DATA__` with Mintlify tagging |\n| OpenAPI   | Renders `openapi.json` / `swagger.json` into Markdown |\n| Docusaurus| Detected and tagged; generic extractor produces Markdown |\n| Sphinx    | Detected and tagged; generic extractor produces Markdown |\n\nJS-only SPAs with no server-rendered content are detected and skipped with a\nclear reason (or, with `--strict-js-required`, reported as an error so agents\ncan route elsewhere).\n\n## Agent-friendly features\n\n- **`--single`** — fetch a single URL without discovery. Designed for tool loops.\n- **`--stream`** — NDJSON one-record-per-line, flushed on every page, pipeable.\n- **`--max-tokens-per-file N`** — split each page into token-bounded chunks on\n  heading boundaries (exact counts with tiktoken, estimate without).\n- **`--emit-chunks`** — write one file or record per chunk instead of per page.\n- **`--strict-js-required`** — hard-fail on JS-only pages instead of silently\n  skipping.\n- **`--extractor trafilatura`** — swap in [trafilatura](https://trafilatura.readthedocs.io/)\n  for sites where the default heuristics struggle.\n\n## Python API\n\n```python\nfrom docpull import fetch_one\n\nctx = fetch_one(\"https://docs.python.org/3/library/asyncio.html\")\nprint(ctx.title, ctx.source_type)\nprint(ctx.markdown[:500])\n```\n\nAsync streaming:\n\n```python\nimport asyncio\nfrom docpull import Fetcher, DocpullConfig, ProfileName, EventType\n\nasync def main():\n    cfg = DocpullConfig(\n        url=\"https://docs.example.com\",\n        profile=ProfileName.LLM,  # chunked NDJSON output\n    )\n    async with Fetcher(cfg) as fetcher:\n        async for event in fetcher.run():\n            if event.type == EventType.FETCH_PROGRESS:\n                print(f\"{event.current}/{event.total}: {event.url}\")\n        print(f\"Done: {fetcher.stats.pages_fetched} pages\")\n\nasyncio.run(main())\n```\n\nSingle-page from an agent tool:\n\n```python\nfrom docpull import Fetcher, DocpullConfig\n\nasync def tool_call(url: str) -\u003e str:\n    async with Fetcher(DocpullConfig(url=url)) as f:\n        ctx = await f.fetch_one(url, save=False)\n        return ctx.markdown or ctx.error or \"\"\n```\n\n## Profiles\n\n```bash\ndocpull https://site.com --profile rag      # Default. Dedup, rich metadata.\ndocpull https://site.com --profile llm      # NDJSON + chunks + metadata.\ndocpull https://site.com --profile mirror   # Full archive, polite, cached.\ndocpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.\n```\n\n## MCP server\n\ndocpull ships an MCP (Model Context Protocol) server so AI agents can call it\ndirectly over stdio:\n\n```bash\npip install 'docpull[mcp]'\ndocpull mcp  # starts the stdio server\n```\n\nAdd to Claude Desktop or Claude Code:\n\n```json\n{\n  \"mcpServers\": {\n    \"docpull\": {\n      \"command\": \"docpull\",\n      \"args\": [\"mcp\"]\n    }\n  }\n}\n```\n\nTools exposed:\n\n- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl\n- `ensure_docs(source, force?)` — fetch a named library (cached 7 days)\n- `list_sources(category?)` — show available aliases (react, nextjs, fastapi, …)\n- `list_indexed()` — what has been fetched locally\n- `grep_docs(pattern, library?)` — regex search across fetched Markdown\n\nUser-defined sources live in `~/.config/docpull-mcp/sources.yaml`:\n\n```yaml\nsources:\n  mydocs:\n    url: https://docs.example.com\n    description: My internal docs\n    category: internal\n    maxPages: 200\n```\n\n## Output\n\nMarkdown files with YAML frontmatter:\n\n```markdown\n---\ntitle: \"Getting Started\"\nsource: https://docs.example.com/guide\nsource_type: \"nextjs\"\n---\n\n# Getting Started\n…\n```\n\nNDJSON (one record per page or chunk):\n\n```json\n{\"url\": \"...\", \"title\": \"...\", \"content\": \"...\", \"hash\": \"...\", \"token_count\": 842, \"chunk_index\": 0}\n```\n\n## Security\n\n- HTTPS-only, mandatory robots.txt compliance\n- SSRF protection: blocks private/internal network IPs, DNS rebinding\n- XXE protection via `defusedxml` on sitemaps\n- Path traversal and CRLF header injection guards\n- Auth headers stripped on cross-origin redirects\n\n## Options\n\nRun `docpull --help` for the full list. Highlights:\n\n```\nCore:\n  --profile {rag,mirror,quick,llm,custom}\n  --single                Fetch one URL (no crawl)\n  --format {markdown,json,ndjson,sqlite}\n  --stream                Stream NDJSON to stdout\n\nLLM / chunking:\n  --max-tokens-per-file N\n  --tokenizer NAME        tiktoken encoding (default cl100k_base)\n  --emit-chunks           One file/record per chunk\n\nContent extraction:\n  --extractor {default,trafilatura}\n  --no-special-cases      Disable framework extractors\n  --strict-js-required    Error on JS-only pages\n\nCache:\n  --cache                 Enable incremental updates\n  --cache-dir DIR\n  --cache-ttl DAYS\n```\n\n## Troubleshooting\n\n```bash\ndocpull --doctor              # Check installation\ndocpull URL --verbose         # Verbose output\ndocpull URL --dry-run         # Test without downloading\ndocpull URL --preview-urls    # List URLs without fetching\n```\n\n## Links\n\n- [Website](https://docpull.raintree.technology)\n- [PyPI](https://pypi.org/project/docpull/)\n- [GitHub](https://github.com/raintree-technology/docpull)\n- [Changelog](https://github.com/raintree-technology/docpull/blob/main/docs/CHANGELOG.md)\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraintree-technology%2Fdocpull","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraintree-technology%2Fdocpull","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraintree-technology%2Fdocpull/lists"}