{"id":27404832,"url":"https://github.com/jamesn-dev/scroll-scribe","last_synced_at":"2025-07-28T18:03:38.821Z","repository":{"id":287188632,"uuid":"963672612","full_name":"JamesN-dev/Scroll-Scribe","owner":"JamesN-dev","description":"ScrollScribe is a set of Python tools that grab docs or index website pages and converts them into clean local Markdown using browser automation and LLM filtering, perfect for building RAG datasets.","archived":false,"fork":false,"pushed_at":"2025-06-21T03:46:36.000Z","size":1435,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-21T04:31:36.388Z","etag":null,"topics":["crawl4ai","llm","markdowngenerator","python","rag","webscraper","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JamesN-dev.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-10T03:26:04.000Z","updated_at":"2025-05-02T02:49:12.000Z","dependencies_parsed_at":"2025-06-21T04:26:36.152Z","dependency_job_id":"c8fb926c-edad-49af-95e1-6e09884f2299","html_url":"https://github.com/JamesN-dev/Scroll-Scribe","commit_stats":null,"previous_names":["jamesn-dev/scroll-scribe"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/JamesN-dev/Scroll-Scribe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JamesN-dev%2FScroll-Scribe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JamesN-dev%2FScroll-Scribe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JamesN-dev%2FScroll-Scribe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JamesN-dev%2FScroll-Scribe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JamesN-dev","download_url":"https://codeload.github.com/JamesN-dev/Scroll-Scribe/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JamesN-dev%2FScroll-Scribe/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267560430,"owners_count":24107498,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-28T02:00:09.689Z","response_time":68,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawl4ai","llm","markdowngenerator","python","rag","webscraper","webscraping"],"created_at":"2025-04-14T05:47:51.155Z","updated_at":"2025-07-28T18:03:38.810Z","avatar_url":"https://github.com/JamesN-dev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://40gwsazwi1.ufs.sh/f/GNQKLEu6hrNnLVZuuzM7lFAhGirR1v0IKaEQxCWZNeDoBMOj\" width=\"120\" height=\"120\" alt=\"ScrollScribe Logo\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eScrollScribe\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eCLI toolkit for ML engineers, developers, data scientists, and researchers.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  Extract docs to Markdown • Generate rich CSV/JSON metadata • Prepare data for vector databases\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/unclecode/crawl4ai\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=for-the-badge\u0026logo=python\u0026logoColor=white\" alt=\"Powered by Crawl4AI\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/tiangolo/typer\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/CLI-Typer-orange?style=for-the-badge\u0026logo=python\u0026logoColor=white\" alt=\"Built with Typer\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://www.python.org/\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Python-3.10%2B-blue?style=for-the-badge\u0026logo=python\u0026logoColor=white\" alt=\"Python 3.10+\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/JamesN-dev/Scroll-Scribe/blob/main/LICENSE\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/License-MIT-green?style=for-the-badge\" alt=\"MIT License\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#the-toolkit\"\u003eToolkit\u003c/a\u003e |\n  \u003ca href=\"#what-scrollscribe-does\"\u003eFeatures\u003c/a\u003e |\n  \u003ca href=\"#installation\"\u003eInstallation\u003c/a\u003e |\n  \u003ca href=\"#quick-start\"\u003eQuick Start\u003c/a\u003e |\n  \u003ca href=\"#processing-modes\"\u003eProcessing Modes\u003c/a\u003e |\n  \u003ca href=\"#commands\"\u003eCommands\u003c/a\u003e |\n  \u003ca href=\"#troubleshooting\"\u003eFAQ\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\nWith ScrollScribe, you can build your own docs library in minutes. Automatically discover all pages on a documentation site and convert them to clean Markdown files—perfect for agentic workflows, custom search systems, or offline documentation.\n\n---\n\n## The Toolkit\n\n### `discover` - URL Extraction + Metadata\n\nExtract URLs with rich metadata (keywords, depth, timestamps) exported as TXT, CSV, or JSON.\n\n### `scrape` - Page Processing\n\nProcess single pages or URL lists. Choose fast mode (500+ pages/min) or LLM mode (publication-ready Markdown).\n\n### `process` - Unified Pipeline\n\nPoint and go: discover + scrape in one command. Fast for bulk extraction, LLM for high-quality output.\n\n---\n\n## ⚡ Processing Modes\n\n### Fast Mode (`--fast`)\n\n- Quickly converts large documentation sites—great for bulk extraction, drafts, or when you don’t need perfect formatting. No API key required.\n\n### AI Mode (default, or `--no-fast`)\n\n- Uses LLMs for the highest quality Markdown output—ideal for publishing, feeding into websites, or when you want perfectly structured docs. Requires an API key and takes longer per page.\n\n---\n\n## What ScrollScribe Does\n\n1. **Discovers** URLs from documentation sites with rich metadata (keywords, depth, timestamps) - export as TXT, CSV, or JSON\n2. **Processes** single pages or entire URL lists - choose fast mode (500+ pages/min) or LLM mode for publication-quality output\n3. **Converts** HTML to clean Markdown with preserved formatting, code blocks, and working links\n4. **Outputs** structured data perfect for AI agents, vector databases, or offline documentation\n\n**Examples:**\n\n- `scribe discover docs.fastapi.com -o urls.json` → Get 200+ URLs with metadata for analysis\n- `scribe process docs.fastapi.com -o fastapi-docs/` → Get 200+ clean Markdown files\n- `scribe scrape single-page.html -o output/` → Process just one page\n\n---\n\n## Installation\n\n```bash\ngit clone https://github.com/your-username/scrollscribe\ncd scrollscribe\nuv sync  # or pip install -r requirements.txt\n```\n\n---\n\n## Quick Start\n\n### Basic Usage\n\n```bash\n# Convert entire documentation site to Markdown\nscribe process https://docs.fastapi.com/ -o fastapi-docs/\n\n# That's it! All pages are now in the fastapi-docs/ folder\n```\n\n### Set up API Key (Recommended)\n\nFor highest quality output, add your API key:\n\n```bash\n# Create .env file with your API key\necho \"OPENROUTER_API_KEY=your-key-here\" \u003e .env\n\n# Now uses best model by default (Codestral 2501)\nscribe process https://docs.fastapi.com/ -o fastapi-docs/\n```\n\n## Processing Modes\n\nScrollScribe offers two processing modes depending on your needs:\n\n| Feature           | **Fast Mode**                 | **AI Mode**                     |\n| ----------------- | ----------------------------- | ------------------------------- |\n| **Speed**         | 50-200 pages/minute           | 10-15 pages/minute              |\n| **Cost**          | Free                          | ~$0.005 per page (Codestral)    |\n| **Quality**       | Good - removes navigation/ads | Excellent - AI-filtered content |\n| **API Key**       | Not required                  | Required                        |\n| **Best For**      | Large sites, quick extraction | High-quality documentation      |\n| **Default Model** | N/A                           | Codestral 2501                  |\n\n### Fast Mode (No API Key Needed)\n\n```bash\n# Fast processing - no API key required\nscribe process https://docs.fastapi.com/ -o fastapi-docs/ --fast\n```\n\n**Good for:**\n\n- Large documentation sites (1000+ pages)\n- Quick content extraction\n- When you don't want to pay for API calls\n\n### AI Mode (Default with API Key)\n\n```bash\n# Uses Codestral 2501 by default - best quality\nscribe process https://docs.fastapi.com/ -o fastapi-docs/\n```\n\n**Good for:**\n\n- High-quality documentation extraction (default mode)\n- When clean formatting is important\n- Feeding into other AI tools\n\n## Commands\n\nScrollScribe has three main commands:\n\n### `process` - Complete Pipeline (Most Common)\n\nConvert an entire documentation site in one command:\n\n```bash\n# Discover all pages and convert them to Markdown\nscribe process https://docs.fastapi.com/ -o fastapi-docs/\n```\n\n### `discover` - Find All Documentation Pages\n\nExtract URLs from a site with optional metadata (useful for manual curation):\n\n```bash\n# Get simple list of URLs\nscribe discover https://docs.fastapi.com/ -o urls.txt\n\n# Get rich metadata with depth, keywords, and timestamps\nscribe discover https://docs.fastapi.com/ -o urls.json\n\n# Get CSV format for spreadsheet analysis\nscribe discover https://docs.fastapi.com/ -o urls.csv\n```\n\n**Output Formats:**\n\n- **`.txt`** - Simple URL list (default)\n- **`.csv`** - Rich metadata in spreadsheet format with columns for depth, keywords, timestamps, and filenames\n- **`.json`** - Same rich metadata as structured objects for programming\n\n**JSON metadata example:**\n\n```json\n{\n  \"url\": \"https://docs.fastapi.com/tutorial/first-steps/\",\n  \"path\": \"/tutorial/first-steps/\",\n  \"depth\": 2,\n  \"keywords\": [\"tutorial\", \"first\", \"steps\"],\n  \"filename_part\": \"tutorial/first-steps\",\n  \"discovered_at\": \"2025-06-24T19:43:27.627987\"\n}\n```\n\n**Why use discover separately?**\n\n- **Manual curation**: Edit output files to remove pages you don't want\n- **Planning**: See how many pages and site structure before processing\n- **Analysis**: Use JSON metadata to understand site hierarchy and content types\n- **Selective processing**: Only download the pages you actually need\n\n### `scrape` - Convert to Markdown\n\nProcess URLs or a single page:\n\n```bash\n# Process a curated list of URLs\nscribe scrape urls.txt -o fastapi-docs/\n\n# Process a single page\nscribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/\n```\n\n**Smart input detection**: `scrape` automatically detects if you're giving it:\n\n- A `.txt` file with URLs (one per line)\n- A single webpage URL (`http://` or `https://`)\n\n## API Keys \u0026 Models\n\n**Default Model**: `openrouter/mistralai/codestral-2501` ⭐ (Best quality)\n\n### Alternative Models\n\n- `openrouter/google/gemini-2.0-flash-exp:free` (Free tier)\n- `openrouter/anthropic/claude-3-haiku` (Fast premium)\n\n### Setting API Key\n\n```bash\n# Setup: Add API keys to .env file\necho \"OPENROUTER_API_KEY=your-openrouter-key\" \u003e\u003e .env\necho \"ANTHROPIC_API_KEY=your-anthropic-key\" \u003e\u003e .env\necho \"MISTRAL_API_KEY=your-mistral-key\" \u003e\u003e .env\n\n# Use default API key (OPENROUTER_API_KEY)\nscribe process https://docs.example.com/ -o output/\n\n# Use a different API key variable\nscribe process https://docs.example.com/ -o output/ --api-key-env ANTHROPIC_API_KEY\n```\n\n### Changing Models\n\n```bash\n# Use a different model with its corresponding API key\nscribe process https://docs.example.com/ -o output/ \\\n  --model openrouter/anthropic/claude-3-haiku \\\n  --api-key-env ANTHROPIC_API_KEY\n\n# Use free model (still needs OpenRouter key)\nscribe process https://docs.example.com/ -o output/ \\\n  --model openrouter/google/gemini-2.0-flash-exp:free\n```\n\nGet a free API key at [OpenRouter](https://openrouter.ai/).\n\n## Workflow Examples\n\n### Complete Workflow (Most Common)\n\n```bash\n# One command to rule them all\nscribe process https://docs.fastapi.com/ -o fastapi-docs/\n```\n\n### Curated Workflow (Manual Selection)\n\n```bash\n# Step 1: Discover all pages\nscribe discover https://docs.fastapi.com/ -o urls.txt\n\n# Step 2: Edit urls.txt - remove pages you don't want\n# Step 3: Process only the pages you kept\nscribe scrape urls.txt -o fastapi-docs/\n```\n\n### Single Page\n\n```bash\n# Process just one specific page\nscribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/\n```\n\n### For Developers\n\n- **Offline Documentation**: Work with docs without internet\n- **AI Tools**: Feed clean docs into Claude, ChatGPT, or local AI\n- **Documentation Search**: Build custom search for your team\n- **Backup**: Archive documentation that might change or disappear\n\n### For Teams\n\n- **Internal Knowledge Base**: Convert internal wikis to searchable Markdown\n- **Compliance**: Archive API documentation for regulatory requirements\n- **Training Data**: Clean documentation for training custom models\n\n### For Researchers\n\n- **Literature Review**: Convert technical documentation for analysis\n- **Comparative Studies**: Analyze documentation across different tools\n- **Academic Research**: Study how projects document their APIs\n\n## Advanced Usage\n\n### Separate Discovery and Processing\n\n```bash\n# Step 1: Discover all URLs (fast)\nscribe discover https://docs.fastapi.com/ -o urls.txt\n\n# Step 2: Process URLs to Markdown\nscribe scrape urls.txt -o fastapi-docs/\n```\n\n### Resume Processing\n\n```bash\n# Resume from the 50th page or URL #50 if processing was interrupted\nscribe scrape urls.txt -o output/ --start-at 50\n```\n\n### Custom Settings\n\n```bash\n# Use different model with custom timeout\nscribe process https://docs.example.com/ -o output/ \\\n  --model openrouter/anthropic/claude-3-haiku \\\n  --timeout 120000 \\\n  --verbose\n\n# Use different API key variable\nscribe process https://docs.example.com/ -o output/ \\\n  --api-key-env ANTHROPIC_API_KEY\n\n# Combine custom model and API key variable\nscribe process https://docs.example.com/ -o output/ \\\n  --model openrouter/mistralai/codestral-2501 \\\n  --api-key-env OPENROUTER_API_KEY \\\n  --verbose\n```\n\n## Output Structure\n\nScrollScribe saves one Markdown file per documentation page in the output folder you specify.\nYou choose the folder name—organize by language, project, or however you like.\n\n```bash\nscribe process https://docs.python.org/3/ -o python-docs/\nscribe process https://developer.mozilla.org/en-US/docs/Web/JavaScript -o javascript-docs/\n```\n\n```\noutput/\n├── python-docs/\n│   ├── index.md                # Homepage\n│   ├── getting-started.md      # Getting started guide\n│   ├── ...                     # Other pages\n├── javascript-docs/\n    ├── index.md\n    └── ...\n\n```\n\nEach file contains:\n\n- Clean Markdown formatting\n- Preserved code blocks and syntax highlighting\n- Working internal links (converted to relative paths)\n- Original page title as the filename\n\nThis flexible structure makes it easy to build your own docs library, organize by project or language, and prepare for **future features like serving docs with an MCP server**.\n\n### Large Sites (Use Fast Mode)\n\n```bash\n# Large documentation sites - use fast mode for speed\nscribe process https://docs.microsoft.com/en-us/azure/ -o azure-docs/ --fast\nscribe process https://developer.mozilla.org/en-US/docs/ -o mdn-docs/ --fast\n```\n\n## Troubleshooting\n\n### \"API key not found\"\n\nCreate a `.env` file with your OpenRouter API key:\n\n```bash\necho \"OPENROUTER_API_KEY=your-key-here\" \u003e .env\n```\n\n### \"Rate limit error\"\n\nScrollScribe automatically retries with backoff. For persistent issues:\n\n- Try the free models first\n- Use `--fast` mode to avoid API calls entirely\n\n### \"Some pages failed\"\n\nSome sites block automated access. ScrollScribe will:\n\n- Show which URLs failed\n- Continue processing other pages\n- Let you retry failed URLs later\n\n### Site-specific issues\n\n```bash\n# Increase timeout for slow sites\nscribe process https://slow-site.com/ -o output/ --timeout 120000\n\n# Use verbose mode to see what's happening\nscribe process https://site.com/ -o output/ --verbose\n```\n\n## What's Different About ScrollScribe\n\nUnlike simple web scrapers, ScrollScribe:\n\n- **Understands documentation structure** - follows internal links intelligently\n- **Cleans content** - removes navigation, ads, and irrelevant elements\n- **Preserves formatting** - maintains code blocks, headers, and structure\n- **Handles modern sites** - works with JavaScript-heavy documentation\n- **Scales efficiently** - processes hundreds of pages reliably\n\n## Contributing\n\nFound a bug or want to add a feature?\n\n1. Open an issue describing the problem\n2. Fork the repository\n3. Make your changes\n4. Submit a pull request\n\n### Building \u0026 Publishing\n\nThis project uses [Hatch](https://hatch.pypa.io/) for building and publishing. Contributors should have it installed.\n\n## License\n\nMIT License - use ScrollScribe for any purpose, commercial or personal.\n\n---\n\n**ScrollScribe** - Turn any documentation site into clean Markdown files or structured metadata for AI processing.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjamesn-dev%2Fscroll-scribe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjamesn-dev%2Fscroll-scribe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjamesn-dev%2Fscroll-scribe/lists"}