{"id":49210495,"url":"https://github.com/aimlpm/markcrawl","last_synced_at":"2026-04-23T21:01:18.806Z","repository":{"id":336989973,"uuid":"1052198056","full_name":"AIMLPM/markcrawl","owner":"AIMLPM","description":"Fast Python web crawler for RAG and AI ingestion. Extracts clean Markdown from any site for LLMs and vector stores.","archived":false,"fork":false,"pushed_at":"2026-04-19T13:57:28.000Z","size":14229,"stargazers_count":3,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-19T15:37:30.529Z","etag":null,"topics":["ai-agents","anthropic-claude","data-extraction","gemini","ingestion-pipeline","llm","markdown-extraction","openai","pgvector","python","rag","sitemap-crawler","structured-data","supabase","vector-database","webcrawler"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AIMLPM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-07T15:55:43.000Z","updated_at":"2026-04-19T13:57:32.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/AIMLPM/markcrawl","commit_stats":null,"previous_names":["aimlpm/webcrawler","aimlpm/markcrawl"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/AIMLPM/markcrawl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIMLPM%2Fmarkcrawl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIMLPM%2Fmarkcrawl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIMLPM%2Fmarkcrawl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIMLPM%2Fmarkcrawl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AIMLPM","download_url":"https://codeload.github.com/AIMLPM/markcrawl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIMLPM%2Fmarkcrawl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32198232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-23T20:19:26.138Z","status":"ssl_error","status_checked_at":"2026-04-23T20:19:23.520Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","anthropic-claude","data-extraction","gemini","ingestion-pipeline","llm","markdown-extraction","openai","pgvector","python","rag","sitemap-crawler","structured-data","supabase","vector-database","webcrawler"],"created_at":"2026-04-23T21:01:17.854Z","updated_at":"2026-04-23T21:01:18.793Z","avatar_url":"https://github.com/AIMLPM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MarkCrawl by iD8 🕷️📝\n### Turn any webpage or website into clean Markdown for LLM pipelines — in one command.\n\n[![CI](https://github.com/AIMLPM/markcrawl/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/AIMLPM/markcrawl/actions/workflows/ci.yml)\n![PyPI Version](https://img.shields.io/pypi/v/markcrawl)\n![License](https://img.shields.io/github/license/AIMLPM/markcrawl)\n[![MCP Server](https://img.shields.io/badge/MCP-AA-green?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZD0iTTEyIDJMMiA3bDEwIDUgMTAtNS0xMC01ek0yIDE3bDEwIDUgMTAtNS0xMC01LTEwIDV6TTIgMTJsMTAgNSAxMC01LTEwLTUtMTAgNXoiIGZpbGw9IndoaXRlIi8+PC9zdmc+)](https://glama.ai/mcp/servers/AIMLPM/markcrawl)\n\n```bash\npip install markcrawl\nmarkcrawl --base https://docs.example.com --out ./output --show-progress\n```\n\nMarkCrawl is a crawl-and-structure engine. It fetches one page or crawls an entire website, strips navigation/scripts/boilerplate, and writes clean Markdown files with a structured JSONL index. Every page includes a citation with the access date. No API keys needed.\n\nEverything else — LLM extraction, Supabase upload, MCP server, LangChain tools — is optional and installed separately.\n\n\u003e **Want a hosted API instead of running locally?** [Join the waitlist](https://github.com/AIMLPM/markcrawl/issues/13) — we're gauging interest.\n\n**LLM agents:** Load [docs/LLM_PROMPT.md](docs/LLM_PROMPT.md) as a system prompt to generate correct MarkCrawl commands automatically.\n\n## Quickstart (2 minutes)\n\n```bash\npip install markcrawl\nmarkcrawl --base https://quotes.toscrape.com --out ./demo --max-pages 5 --show-progress\n```\n\nYour `./demo` folder now contains:\n\n```text\ndemo/\n├── index__a4f3b2c1d0.md    ← clean Markdown of the page\n├── page-2__b7e2d1f0a3.md\n├── ...\n└── pages.jsonl              ← structured index (one JSON line per page)\n```\n\nEach line in `pages.jsonl`:\n\n```json\n{\n  \"url\": \"https://quotes.toscrape.com/\",\n  \"title\": \"Quotes to Scrape\",\n  \"crawled_at\": \"2026-04-04T12:30:00Z\",\n  \"citation\": \"Quotes to Scrape. quotes.toscrape.com. Available at: https://quotes.toscrape.com/ [Accessed April 04, 2026].\",\n  \"tool\": \"markcrawl\",\n  \"text\": \"# Quotes to Scrape\\n\\n\u003e \"The world as we have created it is a process of our thinking...\" — Albert Einstein\\n\\nTags: change, deep-thoughts, thinking, world...\"\n}\n```\n\n## Common Recipes\n\n**Scrape a single page:**\n\n```bash\nmarkcrawl --base https://example.com/pricing --no-sitemap --max-pages 1\n```\n\n**Scrape a single JS-rendered page** (React, Vue, YouTube, etc.):\n\n```bash\nmarkcrawl --base \"https://www.youtube.com/@channel/videos\" \\\n  --no-sitemap --max-pages 1 --render-js\n# → outputs one .md file with video titles, view counts, and dates\n```\n\nFor infinite-scroll pages like YouTube, this captures the first ~28 videos from the initial render.\n\n**Crawl a docs site:**\n\n```bash\nmarkcrawl --base https://docs.example.com --max-pages 500 --concurrency 5 --show-progress\n```\n\n**Crawl a subsection without sitemap wandering:**\n\nLarge sites (YouTube, GitHub, etc.) have sitemaps with thousands of unrelated pages.\nUse `--no-sitemap` to crawl only from your target URL:\n\n```bash\nmarkcrawl --base https://docs.example.com/guides \\\n  --no-sitemap --max-pages 50 --show-progress\n```\n\n**Competitive analysis** (crawl 3 competitors, extract pricing):\n\n```bash\nmarkcrawl --base https://competitor-one.com/pricing --no-sitemap --max-pages 1 --out ./comp1\nmarkcrawl --base https://competitor-two.com/pricing --no-sitemap --max-pages 1 --out ./comp2\nmarkcrawl --base https://competitor-three.com/pricing --no-sitemap --max-pages 1 --out ./comp3\nmarkcrawl-extract \\\n  --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \\\n  --fields pricing_tiers features free_trial --show-progress\n# → extracted.jsonl with structured pricing data across all three\n```\n\n**Docs site → RAG chatbot** (full pipeline: crawl, embed, query):\n\n```bash\nmarkcrawl --base https://docs.example.com --out ./docs --max-pages 500 --concurrency 5 --show-progress\nmarkcrawl-upload --jsonl ./docs/pages.jsonl --show-progress\n# → pages are chunked, embedded, and uploaded to Supabase/pgvector\n# Wire your chatbot to query the vector table — see docs/SUPABASE.md\n```\n\n**API docs → code generation prompt:**\n\n```bash\nmarkcrawl --base https://api.example.com/docs --out ./api-docs --max-pages 200 --show-progress\n# Feed the output to an LLM:\n# \"Using the API documentation in ./api-docs/pages.jsonl, generate a\n#  typed Python client with methods for each endpoint.\"\n```\n\n**Back up a blog before it shuts down:**\n\n```bash\nmarkcrawl --base https://engineering.example.com/blog \\\n  --no-sitemap --max-pages 1000 --concurrency 5 --out ./blog-archive --show-progress\n# → every post saved as clean Markdown with citations and access dates\n```\n\n**Skip junk pages** (job listings, login walls, SEO spam):\n\n```bash\nmarkcrawl --base https://example.com \\\n  --exclude-path \"/job/*\" --exclude-path \"/careers/*\" --exclude-path \"/login\" \\\n  --max-pages 500 --out ./output --show-progress\n```\n\n**Preview URLs before committing to a long crawl:**\n\n```bash\nmarkcrawl --base https://example.com --dry-run\n# → prints every URL that would be crawled (from sitemap), then exits\n# Pipe to wc -l to get a count, or grep to check for junk patterns\nmarkcrawl --base https://example.com --dry-run | wc -l\nmarkcrawl --base https://example.com --dry-run | grep \"/job/\"\n```\n\n**Only crawl specific sections** (blog + pricing, ignore everything else):\n\n```bash\nmarkcrawl --base https://example.com \\\n  --include-path \"/blog/*\" --include-path \"/pricing\" \\\n  --max-pages 200 --out ./output --show-progress\n```\n\n**Safe crawl of a job board** (dry-run + exclude):\n\n```bash\n# Step 1: see what you'd get\nmarkcrawl --base https://tealhq.com --dry-run | head -50\n# Step 2: exclude the job listings, crawl just the content pages\nmarkcrawl --base https://tealhq.com \\\n  --exclude-path \"/job/*\" --exclude-path \"/resume-examples/*\" \\\n  --max-pages 200 --out ./tealhq --show-progress\n```\n\n**Choose an extraction backend:**\n\n```bash\n# Default (BS4 + markdownify) — fastest, good for most sites\nmarkcrawl --base https://docs.example.com --out ./output --show-progress\n\n# Ensemble — runs default + trafilatura, picks best per page\nmarkcrawl --base https://docs.example.com --out ./output --extractor ensemble --show-progress\n\n# ReaderLM-v2 — ML-based extraction (requires: pip install markcrawl[ml])\nmarkcrawl --base https://docs.example.com --out ./output --extractor readerlm --show-progress\n```\n\n**Skip pages you've already crawled** (cross-crawl dedup):\n\n```bash\n# First crawl\nmarkcrawl --base https://docs.example.com --out ./docs --show-progress\n# Later — only fetches new/changed pages\nmarkcrawl --base https://docs.example.com --out ./docs --cross-dedup --show-progress\n```\n\n**Crawl high-value pages first** (link prioritization):\n\n```bash\nmarkcrawl --base https://docs.example.com --out ./docs \\\n  --prioritize-links --max-pages 100 --show-progress\n# Prioritizes content-rich pages (guides, docs) over low-value ones (legal, login)\n```\n\n**Smart-sample a large site** (e-commerce, job boards, real estate):\n\n```bash\n# Preview the pattern clusters first\nmarkcrawl --base https://bigsite.com --dry-run --smart-sample --show-progress\n# Crawl with sampling — 5 pages per templated cluster instead of thousands\nmarkcrawl --base https://bigsite.com --out ./bigsite \\\n  --smart-sample --sample-size 5 --sample-threshold 20 --show-progress\n```\n\n**Download images alongside content** (photography blogs, product pages):\n\n```bash\n# Crawl a photography blog and save images from the content area\nmarkcrawl --base https://photography-blog.example.com --out ./photos \\\n  --download-images --max-pages 50 --show-progress\n# Output:\n#   ./photos/assets/mountain-abc123.jpg\n#   ./photos/assets/sunset-def456.png\n#   ./photos/post-1__a1b2c3.md  ← Markdown with ![alt](assets/filename.ext) refs\n#   ./photos/pages.jsonl         ← index includes \"images\" array per page\n\n# Adjust minimum image size to skip thumbnails (default: 5000 bytes)\nmarkcrawl --base https://example.com/gallery --out ./gallery \\\n  --download-images --min-image-size 20000 --show-progress\n```\n\n**Capture page screenshots** (dashboards, data visualisations, JS-rendered charts):\n\n```bash\n# Full-page screenshot of every crawled page (auto-enables --render-js)\nmarkcrawl --base https://steamcharts.com/top --out ./dash \\\n  --screenshot --max-pages 5 --show-progress\n# Output:\n#   ./dash/screenshots/top-abc123def456.png   ← 1920-wide full-page PNG\n#   ./dash/pages.jsonl                        ← each row gets \"screenshot\": \"screenshots/...\"\n\n# Crop to just the dashboard region, JPEG for smaller files, longer wait for slow charts\nmarkcrawl --base https://example.com/dashboards --out ./dash \\\n  --screenshot --screenshot-selector \".dashboard-main\" \\\n  --screenshot-format jpeg --screenshot-wait-ms 3000 --show-progress\n```\n\nThe screenshot path loads with ``wait_until=\"load\"`` and then pauses\n``--screenshot-wait-ms`` (default 1500ms) before capturing, so canvas/SVG\ncharts have time to render.  (``networkidle`` is deliberately avoided —\nmany real sites never idle due to analytics pings.)  Failures are\nrecorded in the JSONL row as ``screenshot_error`` rather than aborting\nthe crawl.\n\n**Multi-site: discover seed URLs and fan out across sites**:\n\n```bash\n# Use a bundled curated seed pack, then crawl every site with screenshots\nmarkcrawl discover --pack game-dashboards | \\\n  markcrawl --seed-file - --out ./dashboards \\\n    --screenshot --max-pages-per-site 5 --show-progress\n\n# List available packs\nmarkcrawl discover --list-packs\n```\n\nOutput is organised per-site: ``./dashboards/\u003cnetloc\u003e/pages.jsonl`` plus\n``screenshots/`` under each.  See the full recipe (including a YouTube\nframe-extraction path using `yt-dlp` + `ffmpeg`) at\n[docs/recipes/game-dashboards.md](docs/recipes/game-dashboards.md).\n\n**Resume an interrupted crawl:**\n\n```bash\nmarkcrawl --base https://docs.example.com --out ./docs --resume --show-progress\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eHow it compares to other crawlers\u003c/summary\u003e\n\nDifferent tools make different tradeoffs. This table summarizes the main differences:\n\n| | MarkCrawl | FireCrawl | Crawl4AI | Scrapy |\n|---|---|---|---|---|\n| License | MIT | AGPL-3.0 | Apache-2.0 | BSD-3 |\n| Install | `pip install markcrawl` | SaaS or self-host | pip + Playwright | pip + framework |\n| Output | Markdown + JSONL | Markdown + JSON | Markdown | Custom pipelines |\n| JS rendering | Optional (`--render-js`) | Built-in | Built-in | Plugin |\n| LLM extraction | Optional add-on | Via API | Built-in | None |\n| Best for | Single-site crawl → Markdown | Hosted scraping API | AI-native crawling | Large-scale distributed |\n\nEach tool has strengths: FireCrawl excels as a hosted API, Crawl4AI has deep browser automation, and Scrapy handles massive distributed workloads. MarkCrawl focuses on simple local crawls that produce LLM-ready Markdown.\n\n### Benchmark results (7 tools, April 2026 — v2 methodology)\n\n**Speed:** markcrawl is fastest (12.1 pages/sec), scrapy+md second (9.5). Playwright-based tools (crawlee, playwright, crawl4ai) average 1.5–2.2 pages/sec.\n\n**Content signal:** markcrawl leads at 99% (ratio of answer-bearing tokens to total output) — almost no navigation, footer, or boilerplate makes it into your embeddings.\n\n**RAG quality:** markcrawl scores 4.52/5 on LLM-judged answer quality (tied #2, leader at 4.53 within noise) and 0.698 MRR (3rd, leader crawlee at 0.733) — with 2.1x fewer chunks than crawlee, keeping embedding costs low.\n\n| Tool | Speed (p/s) | Content Signal | MRR | Answer (/5) | Annual cost (100K pages) |\n|---|---|---|---|---|---|\n| **markcrawl** | **12.1** | **99%** | 0.698 | 4.52 | **$4,505** |\n| scrapy+md | 9.5 | 93% | 0.459 | 4.03 | $5,464 |\n| colly+md | 4.2 | 67% | 0.677 | **4.53** | $7,213 |\n| playwright | 2.2 | 64% | 0.727 | 4.42 | $7,320 |\n| crawlee | 1.7 | 63% | **0.733** | 4.52 | $7,467 |\n| crawl4ai | 1.5 | 83% | 0.694 | 4.43 | $6,960 |\n\nFull benchmark data: [docs/BENCHMARKS.md](docs/BENCHMARKS.md) | Methodology: [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks)\n\n**RAG-optimized recipe (v0.6.0):** With `--i18n-filter --title-at-top` and the opt-in chunker flags (`auto_extract_title=True`, `prepend_first_paragraph=True`, `strip_markdown_links=True` on `chunk_markdown`), markcrawl reaches **0.8148 MRR** on the same 57-query benchmark — a +0.18 jump over the default config and +0.08 over the next best tool (crawlee at 0.733).\n\u003c/details\u003e\n\n## Installation\n\n**The core crawler is the only thing you need.** Everything else is optional.\n\n```bash\npip install markcrawl                # Core crawler (free, no API keys)\n```\n\nOptional add-ons:\n\n```bash\npip install markcrawl[extract]       # + LLM extraction (OpenAI, Claude, Gemini, Grok)\npip install markcrawl[js]            # + JavaScript rendering (Playwright)\npip install markcrawl[upload]        # + Supabase upload with embeddings\npip install markcrawl[ml]            # + ReaderLM-v2 extraction backend\npip install markcrawl[mcp]           # + MCP server for AI agents\npip install markcrawl[langchain]     # + LangChain tool wrappers\npip install markcrawl[all]           # Everything\n```\n\nFor Playwright, also run `playwright install chromium` after installing.\n\n\u003cdetails\u003e\n\u003csummary\u003eInstall from source (for development)\u003c/summary\u003e\n\n```bash\ngit clone https://github.com/AIMLPM/markcrawl.git\ncd markcrawl\npython -m venv .venv\nsource .venv/bin/activate\npip install -e \".[all]\"\n```\n\u003c/details\u003e\n\n## Crawling\n\n```bash\nmarkcrawl --base https://www.example.com --out ./output --show-progress\n```\n\nAdd flags as needed:\n\n```bash\nmarkcrawl \\\n  --base https://www.example.com \\\n  --out ./output \\\n  --include-subdomains \\        # crawl sub.example.com too\n  --render-js \\                 # render JavaScript (React, Vue, etc.)\n  --concurrency 5 \\             # fetch 5 pages in parallel\n  --proxy http://proxy:8080 \\   # route through a proxy\n  --max-pages 200 \\             # stop after 200 pages\n  --format markdown \\           # or \"text\" for plain text\n  --show-progress\n```\n\nResume an interrupted crawl:\n\n```bash\nmarkcrawl --base https://www.example.com --out ./output --resume --show-progress\n```\n\n### Output\n\nEach page becomes a `.md` file with a citation header:\n\n```markdown\n# Getting Started\n\n\u003e URL: https://docs.example.com/getting-started\n\u003e Crawled: April 04, 2026\n\u003e Citation: Getting Started. docs.example.com. Available at: https://docs.example.com/getting-started [Accessed April 04, 2026].\n\nWelcome to the platform. This guide walks you through installation...\n```\n\nNavigation, footer, cookie banners, and scripts are stripped. Only the main content remains.\n\n\u003cdetails\u003e\n\u003csummary\u003eAll crawler CLI arguments\u003c/summary\u003e\n\n| Argument | Description |\n|---|---|\n| `--base` | Base site URL to crawl |\n| `--out` | Output directory |\n| `--format` | `markdown` or `text` (default: `markdown`) |\n| `--show-progress` | Print progress and crawl events |\n| `--render-js` | Render JavaScript with Playwright before extracting |\n| `--concurrency` | Pages to fetch in parallel (default: `1`) |\n| `--proxy` | HTTP/HTTPS proxy URL |\n| `--resume` | Resume from saved state |\n| `--include-subdomains` | Include subdomains under the base domain |\n| `--max-pages` | Max pages to save; `0` = unlimited (default: `500`) |\n| `--delay` | Minimum delay between requests in seconds (default: `0`, adaptive throttle adjusts automatically) |\n| `--timeout` | Per-request timeout in seconds (default: `15`) |\n| `--min-words` | Skip pages with fewer words (default: `20`) |\n| `--user-agent` | Override the default user agent |\n| `--use-sitemap` / `--no-sitemap` | Enable/disable sitemap discovery. Use `--no-sitemap` when you want to scrape a specific page or subsection — without it, large sites (YouTube, GitHub) may discover thousands of unrelated pages via their sitemap |\n| `--exclude-path` | Glob pattern to exclude URL paths (e.g. `'/job/*'`). Can be repeated |\n| `--include-path` | Glob pattern to include URL paths (e.g. `'/blog/*'`). Only matching paths are crawled. Can be repeated |\n| `--dry-run` | Discover URLs (via sitemap/links) and print them without fetching content |\n| `--smart-sample` | Auto-detect templated URL patterns and sample from large clusters instead of crawling every page |\n| `--sample-size` | Pages to sample per templated cluster (default: `5`, used with `--smart-sample`) |\n| `--sample-threshold` | Clusters larger than this are sampled (default: `20`, used with `--smart-sample`) |\n| `--auto-resume` | Automatically resume if saved state exists, otherwise start fresh |\n| `--cross-dedup` | Skip pages already seen in previous crawls to the same output directory |\n| `--prioritize-links` | Score discovered links by predicted content yield — crawl high-value pages first |\n| `--extractor` | Content extraction backend: `default`, `trafilatura`, `ensemble`, or `readerlm` |\n| `--download-images` | Download images from the content area to `assets/` and use local paths in Markdown |\n| `--min-image-size` | Minimum image file size in bytes to keep (default: `5000`). Smaller images are skipped |\n| `--i18n-filter` | Skip URLs under locale path segments (`/fr/`, `/de-DE/`, `/zh-Hans/`, ...) — generic, no per-domain config |\n| `--title-at-top` | Prepend `# {title}` to the `text` field of every JSONL row when not already present — top-MRR RAG recipe |\n\u003c/details\u003e\n\n## Optional: structured extraction\n\nIf you need structured data (not just text), the extraction add-on uses an LLM to pull specific fields from each page.\n\n```bash\npip install markcrawl[extract]\n\nmarkcrawl-extract \\\n  --jsonl ./output/pages.jsonl \\\n  --fields company_name pricing features \\\n  --show-progress\n```\n\nAuto-discover fields across multiple crawled sites:\n\n```bash\nmarkcrawl-extract \\\n  --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \\\n  --auto-fields \\\n  --context \"competitor pricing analysis\" \\\n  --show-progress\n```\n\nSupports OpenAI, Anthropic (Claude), Google Gemini, and xAI (Grok) via `--provider`.\n\n\u003cdetails\u003e\n\u003csummary\u003eExtraction details\u003c/summary\u003e\n\n### Provider and model selection\n\n```bash\nmarkcrawl-extract --jsonl ... --fields pricing --provider openai         # default\nmarkcrawl-extract --jsonl ... --fields pricing --provider anthropic      # Claude\nmarkcrawl-extract --jsonl ... --fields pricing --provider gemini         # Gemini\nmarkcrawl-extract --jsonl ... --fields pricing --provider grok           # Grok\nmarkcrawl-extract --jsonl ... --fields pricing --model gpt-4o           # override model\n```\n\n| Provider | API key env var | Default model |\n|---|---|---|\n| OpenAI | `OPENAI_API_KEY` | `gpt-4o-mini` |\n| Anthropic | `ANTHROPIC_API_KEY` | `claude-sonnet-4-20250514` |\n| Google Gemini | `GEMINI_API_KEY` | `gemini-2.0-flash` |\n| xAI (Grok) | `XAI_API_KEY` | `grok-3-mini-fast` |\n\n### All extraction CLI arguments\n\n| Argument | Description |\n|---|---|\n| `--jsonl` | Path(s) to `pages.jsonl` — pass multiple for cross-site analysis |\n| `--fields` | Field names to extract (space-separated) |\n| `--auto-fields` | Auto-discover fields by sampling pages |\n| `--context` | Describe your goal for auto-discovery |\n| `--sample-size` | Pages to sample for auto-discovery (default: `3`) |\n| `--provider` | `openai`, `anthropic`, `gemini`, or `grok` |\n| `--model` | Override the default model |\n| `--output` | Output path (default: `extracted.jsonl`) |\n| `--delay` | Delay between LLM calls in seconds (default: `0.25`) |\n| `--show-progress` | Print progress |\n\n### Output format\n\nExtracted rows include LLM attribution:\n\n```json\n{\n  \"url\": \"https://competitor.com/pricing\",\n  \"citation\": \"Pricing. competitor.com. Available at: ... [Accessed April 04, 2026].\",\n  \"pricing_tiers\": \"Starter ($29/mo), Pro ($99/mo), Enterprise (contact sales)\",\n  \"extracted_by\": \"gpt-4o-mini (openai)\",\n  \"extraction_note\": \"Field values were extracted by an LLM and may be interpreted, not verbatim.\"\n}\n```\n\u003c/details\u003e\n\n## Optional: Supabase vector search (RAG)\n\nChunk pages, generate embeddings, and upload to Supabase with pgvector:\n\n```bash\npip install markcrawl[upload]\n\nmarkcrawl --base https://docs.example.com --out ./output --show-progress\nmarkcrawl-upload --jsonl ./output/pages.jsonl --show-progress\n```\n\nRequires `SUPABASE_URL`, `SUPABASE_KEY`, and `OPENAI_API_KEY`. See **[docs/SUPABASE.md](docs/SUPABASE.md)** for table setup, query examples, and recommendations.\n\n## Optional: agent integrations\n\nMarkCrawl includes integrations for AI agents. Each is an optional add-on.\n\n\u003cdetails\u003e\n\u003csummary\u003eMCP Server (Claude Desktop, Cursor, Windsurf)\u003c/summary\u003e\n\n```bash\npip install markcrawl[mcp]\n```\n\n```json\n{\n  \"mcpServers\": {\n    \"markcrawl\": {\n      \"command\": \"python\",\n      \"args\": [\"-m\", \"markcrawl.mcp_server\"]\n    }\n  }\n}\n```\n\nTools: `crawl_site`, `list_pages`, `read_page`, `search_pages`, `extract_data`\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eLangChain Tool\u003c/summary\u003e\n\n```bash\npip install markcrawl[langchain]\n```\n\n```python\nfrom markcrawl.langchain import all_tools\nfrom langchain_openai import ChatOpenAI\nfrom langchain.agents import initialize_agent, AgentType\n\nagent = initialize_agent(tools=all_tools, llm=ChatOpenAI(model=\"gpt-4o-mini\"),\n                         agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION)\nagent.run(\"Crawl docs.example.com and summarize their auth guide\")\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eOpenClaw Skill (WhatsApp, Telegram, Slack)\u003c/summary\u003e\n\n```bash\nnpx clawhub install markcrawl-skill\n```\n\nSee [AIMLPM/markcrawl-clawhub-skill](https://github.com/AIMLPM/markcrawl-clawhub-skill).\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eLLM assistant prompt\u003c/summary\u003e\n\nCopy the system prompt from **[docs/LLM_PROMPT.md](docs/LLM_PROMPT.md)** into any LLM to get an assistant that generates correct MarkCrawl commands.\n\u003c/details\u003e\n\n## When NOT to use MarkCrawl\n\n- **Sites behind login/auth** — no cookie or session support\n- **Aggressive bot protection** (Cloudflare, Akamai) — no anti-bot evasion\n- **Millions of pages** — designed for hundreds to low thousands; use Scrapy for scale\n- **PDF content** — HTML only (PDF support is on the roadmap)\n- **JavaScript SPAs** — add `markcrawl[js]` and use `--render-js` for React/Vue/Angular\n- **Infinite-scroll pages** — `--render-js` renders the initial page load but does not scroll; you'll get the first screenful of content (e.g., ~28 of 82 YouTube videos). For complete listings, combine with the platform's API or RSS feed (e.g., YouTube's `/feeds/videos.xml?channel_id=...`)\n\n## Architecture\n\nMarkCrawl is a web crawler. The optional layers (extraction, upload, agents) are separate add-ons that work with the crawler's output.\n\n```text\nCORE (free, no API keys)              OPTIONAL ADD-ONS\n┌──────────────────────────┐\n│ 1. Discover URLs         │          markcrawl[extract]  — LLM field extraction\n│    (sitemap or links)    │          markcrawl[upload]   — Supabase/pgvector RAG\n│ 2. Fetch \u0026 clean HTML    │          markcrawl[js]       — Playwright JS rendering\n│ 3. Write Markdown + JSONL│          markcrawl[mcp]      — MCP server for agents\n│    + auto-citation       │          markcrawl[langchain] — LangChain tools\n└──────────────────────────┘\n```\n\nFor internals, see **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)**.\n\n## Extending MarkCrawl\n\n```python\nfrom markcrawl import crawl\n\nresult = crawl(\"https://example.com\", out_dir=\"./output\")\nprint(f\"Saved {result.pages_saved} pages\")\n```\n\n```python\n# Process output in your own pipeline\nimport json\nwith open(result.index_file) as f:\n    for line in f:\n        page = json.loads(line)\n        your_db.insert(page)  # Pinecone, Weaviate, Elasticsearch, etc.\n```\n\n```python\n# Use individual components\nfrom markcrawl import chunk_text\nfrom markcrawl.extract import LLMClient, extract_fields\n```\n\nSee **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** for the full module map and extensibility guide.\n\n## Cost\n\nThe core crawler is free. Two optional features have API costs:\n\n| Feature | Cost | When |\n|---|---|---|\n| Structured extraction | ~$0.01-0.03 per page | `markcrawl-extract` |\n| Supabase upload | ~$0.0001 per page | `markcrawl-upload` |\n\n## Setting up API keys\n\nOnly needed for extraction and upload. The core crawler requires no keys.\n\n```bash\n# .env — in your working directory\nOPENAI_API_KEY=\"sk-...\"           # extraction (--provider openai) + upload\nANTHROPIC_API_KEY=\"sk-ant-...\"    # extraction (--provider anthropic)\nGEMINI_API_KEY=\"AI...\"            # extraction (--provider gemini)\nXAI_API_KEY=\"xai-...\"             # extraction (--provider grok)\nSUPABASE_URL=\"https://...\"        # upload\nSUPABASE_KEY=\"eyJ...\"             # upload (service-role key)\n```\n\n```bash\nsource .env\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eProject structure\u003c/summary\u003e\n\n```text\n.\n├── README.md\n├── LICENSE\n├── PRIVACY.md\n├── SECURITY.md\n├── CONTRIBUTING.md\n├── CODE_OF_CONDUCT.md\n├── Dockerfile\n├── Makefile\n├── glama.json\n├── pyproject.toml\n├── requirements.txt\n├── .github/\n│   ├── pull_request_template.md\n│   └── workflows/\n│       ├── ci.yml\n│       └── publish.yml\n├── docs/\n│   ├── ARCHITECTURE.md\n│   ├── LLM_PROMPT.md\n│   ├── MCP_SUBMISSION.md\n│   ├── RAG_RETRIEVAL_RESEARCH.md\n│   └── SUPABASE.md\n├── tests/\n│   ├── __init__.py\n│   ├── test_chunker.py\n│   ├── test_core.py\n│   ├── test_extract.py\n│   └── test_upload.py\n└── markcrawl/\n    ├── __init__.py\n    ├── cli.py\n    ├── core.py               # orchestrator\n    ├── fetch.py              # HTTP/Playwright fetching\n    ├── robots.py             # robots.txt parsing\n    ├── throttle.py           # adaptive rate limiting\n    ├── state.py              # crawl state \u0026 resume\n    ├── urls.py               # URL normalization \u0026 filtering\n    ├── extract_content.py    # HTML → Markdown conversion\n    ├── dedup.py              # cross-crawl deduplication\n    ├── link_scorer.py        # link prioritization\n    ├── chunker.py\n    ├── exceptions.py\n    ├── utils.py\n    ├── extract.py            # LLM field extraction\n    ├── extract_cli.py\n    ├── upload.py\n    ├── upload_cli.py\n    ├── langchain.py\n    └── mcp_server.py\n```\n\u003c/details\u003e\n\n## Roadmap\n\n- [ ] Canonical URL support\n- [ ] PDF support\n- [ ] Authenticated crawling\n- [ ] Multi-provider embeddings\n\n\u003cdetails\u003e\n\u003csummary\u003eShipped features\u003c/summary\u003e\n\n- `pip install markcrawl` on PyPI\n- 200 automated tests + GitHub Actions CI (Python 3.10-3.13) + ruff linting\n- Markdown and plain text output with auto-citation\n- Sitemap-first crawling with robots.txt compliance\n- Text chunking with configurable overlap + semantic chunking\n- Supabase/pgvector upload for RAG\n- JavaScript rendering via Playwright\n- Concurrent fetching and proxy support\n- Resume interrupted crawls + auto-resume\n- LLM extraction (OpenAI, Claude, Gemini, Grok) with auto-field discovery\n- MCP server, LangChain tools, OpenClaw skill\n- Image alt text preservation\n- Python API (`result.pages`)\n- Page-type extraction and content-region heuristics\n- Multiple extraction backends (default, trafilatura, ensemble, ReaderLM-v2)\n- Cross-crawl deduplication (`--cross-dedup`)\n- Link prioritization by predicted content yield (`--prioritize-links`)\n- Smart sampling of templated URL clusters (`--smart-sample`)\n- URL path filtering (`--include-path`, `--exclude-path`) and dry-run preview\n\u003c/details\u003e\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md). If you used an LLM to generate code, include the prompt in your PR.\n\n## Security\n\nSee [SECURITY.md](SECURITY.md).\n\n## Privacy\n\nMarkCrawl runs locally. No telemetry, no analytics, no data sent anywhere. See [PRIVACY.md](PRIVACY.md).\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimlpm%2Fmarkcrawl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faimlpm%2Fmarkcrawl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimlpm%2Fmarkcrawl/lists"}