https://github.com/jamesn-dev/scroll-scribe

ScrollScribe is a set of Python tools that grab docs or index website pages and converts them into clean local Markdown using browser automation and LLM filtering, perfect for building RAG datasets.
https://github.com/jamesn-dev/scroll-scribe

crawl4ai llm markdowngenerator python rag webscraper webscraping

Last synced: 11 months ago
JSON representation

ScrollScribe is a set of Python tools that grab docs or index website pages and converts them into clean local Markdown using browser automation and LLM filtering, perfect for building RAG datasets.

Host: GitHub
URL: https://github.com/jamesn-dev/scroll-scribe
Owner: JamesN-dev
License: apache-2.0
Created: 2025-04-10T03:26:04.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-21T03:46:36.000Z (about 1 year ago)
Last Synced: 2025-06-21T04:31:36.388Z (about 1 year ago)
Topics: crawl4ai, llm, markdowngenerator, python, rag, webscraper, webscraping
Language: Python
Homepage:
Size: 1.37 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

ScrollScribe Logo

ScrollScribe

CLI toolkit for ML engineers, developers, data scientists, and researchers.

Extract docs to Markdown • Generate rich CSV/JSON metadata • Prepare data for vector databases

---

With ScrollScribe, you can build your own docs library in minutes. Automatically discover all pages on a documentation site and convert them to clean Markdown files—perfect for agentic workflows, custom search systems, or offline documentation.

---

## The Toolkit

### `discover` - URL Extraction + Metadata

Extract URLs with rich metadata (keywords, depth, timestamps) exported as TXT, CSV, or JSON.

### `scrape` - Page Processing

Process single pages or URL lists. Choose fast mode (500+ pages/min) or LLM mode (publication-ready Markdown).

### `process` - Unified Pipeline

Point and go: discover + scrape in one command. Fast for bulk extraction, LLM for high-quality output.

---

## ⚡ Processing Modes

### Fast Mode (`--fast`)

- Quickly converts large documentation sites—great for bulk extraction, drafts, or when you don’t need perfect formatting. No API key required.

### AI Mode (default, or `--no-fast`)

- Uses LLMs for the highest quality Markdown output—ideal for publishing, feeding into websites, or when you want perfectly structured docs. Requires an API key and takes longer per page.

---

## What ScrollScribe Does

1. **Discovers** URLs from documentation sites with rich metadata (keywords, depth, timestamps) - export as TXT, CSV, or JSON
2. **Processes** single pages or entire URL lists - choose fast mode (500+ pages/min) or LLM mode for publication-quality output
3. **Converts** HTML to clean Markdown with preserved formatting, code blocks, and working links
4. **Outputs** structured data perfect for AI agents, vector databases, or offline documentation

**Examples:**

- `scribe discover docs.fastapi.com -o urls.json` → Get 200+ URLs with metadata for analysis
- `scribe process docs.fastapi.com -o fastapi-docs/` → Get 200+ clean Markdown files
- `scribe scrape single-page.html -o output/` → Process just one page

---

## Installation

```bash
git clone https://github.com/your-username/scrollscribe
cd scrollscribe
uv sync # or pip install -r requirements.txt
```

---

## Quick Start

### Basic Usage

```bash
# Convert entire documentation site to Markdown
scribe process https://docs.fastapi.com/ -o fastapi-docs/

# That's it! All pages are now in the fastapi-docs/ folder
```

### Set up API Key (Recommended)

For highest quality output, add your API key:

```bash
# Create .env file with your API key
echo "OPENROUTER_API_KEY=your-key-here" > .env

# Now uses best model by default (Codestral 2501)
scribe process https://docs.fastapi.com/ -o fastapi-docs/
```

## Processing Modes

ScrollScribe offers two processing modes depending on your needs:

| Feature | **Fast Mode** | **AI Mode** |
| ----------------- | ----------------------------- | ------------------------------- |
| **Speed** | 50-200 pages/minute | 10-15 pages/minute |
| **Cost** | Free | ~$0.005 per page (Codestral) |
| **Quality** | Good - removes navigation/ads | Excellent - AI-filtered content |
| **API Key** | Not required | Required |
| **Best For** | Large sites, quick extraction | High-quality documentation |
| **Default Model** | N/A | Codestral 2501 |

### Fast Mode (No API Key Needed)

```bash
# Fast processing - no API key required
scribe process https://docs.fastapi.com/ -o fastapi-docs/ --fast
```

**Good for:**

- Large documentation sites (1000+ pages)
- Quick content extraction
- When you don't want to pay for API calls

### AI Mode (Default with API Key)

```bash
# Uses Codestral 2501 by default - best quality
scribe process https://docs.fastapi.com/ -o fastapi-docs/
```

**Good for:**

- High-quality documentation extraction (default mode)
- When clean formatting is important
- Feeding into other AI tools

## Commands

ScrollScribe has three main commands:

### `process` - Complete Pipeline (Most Common)

Convert an entire documentation site in one command:

```bash
# Discover all pages and convert them to Markdown
scribe process https://docs.fastapi.com/ -o fastapi-docs/
```

### `discover` - Find All Documentation Pages

Extract URLs from a site with optional metadata (useful for manual curation):

```bash
# Get simple list of URLs
scribe discover https://docs.fastapi.com/ -o urls.txt

# Get rich metadata with depth, keywords, and timestamps
scribe discover https://docs.fastapi.com/ -o urls.json

# Get CSV format for spreadsheet analysis
scribe discover https://docs.fastapi.com/ -o urls.csv
```

**Output Formats:**

- **`.txt`** - Simple URL list (default)
- **`.csv`** - Rich metadata in spreadsheet format with columns for depth, keywords, timestamps, and filenames
- **`.json`** - Same rich metadata as structured objects for programming

**JSON metadata example:**

```json
{
"url": "https://docs.fastapi.com/tutorial/first-steps/",
"path": "/tutorial/first-steps/",
"depth": 2,
"keywords": ["tutorial", "first", "steps"],
"filename_part": "tutorial/first-steps",
"discovered_at": "2025-06-24T19:43:27.627987"
}
```

**Why use discover separately?**

- **Manual curation**: Edit output files to remove pages you don't want
- **Planning**: See how many pages and site structure before processing
- **Analysis**: Use JSON metadata to understand site hierarchy and content types
- **Selective processing**: Only download the pages you actually need

### `scrape` - Convert to Markdown

Process URLs or a single page:

```bash
# Process a curated list of URLs
scribe scrape urls.txt -o fastapi-docs/

# Process a single page
scribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/
```

**Smart input detection**: `scrape` automatically detects if you're giving it:

- A `.txt` file with URLs (one per line)
- A single webpage URL (`http://` or `https://`)

## API Keys & Models

**Default Model**: `openrouter/mistralai/codestral-2501` ⭐ (Best quality)

### Alternative Models

- `openrouter/google/gemini-2.0-flash-exp:free` (Free tier)
- `openrouter/anthropic/claude-3-haiku` (Fast premium)

### Setting API Key

```bash
# Setup: Add API keys to .env file
echo "OPENROUTER_API_KEY=your-openrouter-key" >> .env
echo "ANTHROPIC_API_KEY=your-anthropic-key" >> .env
echo "MISTRAL_API_KEY=your-mistral-key" >> .env

# Use default API key (OPENROUTER_API_KEY)
scribe process https://docs.example.com/ -o output/

# Use a different API key variable
scribe process https://docs.example.com/ -o output/ --api-key-env ANTHROPIC_API_KEY
```

### Changing Models

```bash
# Use a different model with its corresponding API key
scribe process https://docs.example.com/ -o output/ \
--model openrouter/anthropic/claude-3-haiku \
--api-key-env ANTHROPIC_API_KEY

# Use free model (still needs OpenRouter key)
scribe process https://docs.example.com/ -o output/ \
--model openrouter/google/gemini-2.0-flash-exp:free
```

Get a free API key at [OpenRouter](https://openrouter.ai/).

## Workflow Examples

### Complete Workflow (Most Common)

```bash
# One command to rule them all
scribe process https://docs.fastapi.com/ -o fastapi-docs/
```

### Curated Workflow (Manual Selection)

```bash
# Step 1: Discover all pages
scribe discover https://docs.fastapi.com/ -o urls.txt

# Step 2: Edit urls.txt - remove pages you don't want
# Step 3: Process only the pages you kept
scribe scrape urls.txt -o fastapi-docs/
```

### Single Page

```bash
# Process just one specific page
scribe scrape https://docs.fastapi.com/tutorial/first-steps/ -o output/
```

### For Developers

- **Offline Documentation**: Work with docs without internet
- **AI Tools**: Feed clean docs into Claude, ChatGPT, or local AI
- **Documentation Search**: Build custom search for your team
- **Backup**: Archive documentation that might change or disappear

### For Teams

- **Internal Knowledge Base**: Convert internal wikis to searchable Markdown
- **Compliance**: Archive API documentation for regulatory requirements
- **Training Data**: Clean documentation for training custom models

### For Researchers

- **Literature Review**: Convert technical documentation for analysis
- **Comparative Studies**: Analyze documentation across different tools
- **Academic Research**: Study how projects document their APIs

## Advanced Usage

### Separate Discovery and Processing

```bash
# Step 1: Discover all URLs (fast)
scribe discover https://docs.fastapi.com/ -o urls.txt

# Step 2: Process URLs to Markdown
scribe scrape urls.txt -o fastapi-docs/
```

### Resume Processing

```bash
# Resume from the 50th page or URL #50 if processing was interrupted
scribe scrape urls.txt -o output/ --start-at 50
```

### Custom Settings

```bash
# Use different model with custom timeout
scribe process https://docs.example.com/ -o output/ \
--model openrouter/anthropic/claude-3-haiku \
--timeout 120000 \
--verbose

# Use different API key variable
scribe process https://docs.example.com/ -o output/ \
--api-key-env ANTHROPIC_API_KEY

# Combine custom model and API key variable
scribe process https://docs.example.com/ -o output/ \
--model openrouter/mistralai/codestral-2501 \
--api-key-env OPENROUTER_API_KEY \
--verbose
```

## Output Structure

ScrollScribe saves one Markdown file per documentation page in the output folder you specify.
You choose the folder name—organize by language, project, or however you like.

```bash
scribe process https://docs.python.org/3/ -o python-docs/
scribe process https://developer.mozilla.org/en-US/docs/Web/JavaScript -o javascript-docs/
```

```
output/
├── python-docs/
│ ├── index.md # Homepage
│ ├── getting-started.md # Getting started guide
│ ├── ... # Other pages
├── javascript-docs/
├── index.md
└── ...

```

Each file contains:

- Clean Markdown formatting
- Preserved code blocks and syntax highlighting
- Working internal links (converted to relative paths)
- Original page title as the filename

This flexible structure makes it easy to build your own docs library, organize by project or language, and prepare for **future features like serving docs with an MCP server**.

### Large Sites (Use Fast Mode)

```bash
# Large documentation sites - use fast mode for speed
scribe process https://docs.microsoft.com/en-us/azure/ -o azure-docs/ --fast
scribe process https://developer.mozilla.org/en-US/docs/ -o mdn-docs/ --fast
```

## Troubleshooting

### "API key not found"

Create a `.env` file with your OpenRouter API key:

```bash
echo "OPENROUTER_API_KEY=your-key-here" > .env
```

### "Rate limit error"

ScrollScribe automatically retries with backoff. For persistent issues:

- Try the free models first
- Use `--fast` mode to avoid API calls entirely

### "Some pages failed"

Some sites block automated access. ScrollScribe will:

- Show which URLs failed
- Continue processing other pages
- Let you retry failed URLs later

### Site-specific issues

```bash
# Increase timeout for slow sites
scribe process https://slow-site.com/ -o output/ --timeout 120000

# Use verbose mode to see what's happening
scribe process https://site.com/ -o output/ --verbose
```

## What's Different About ScrollScribe

Unlike simple web scrapers, ScrollScribe:

- **Understands documentation structure** - follows internal links intelligently
- **Cleans content** - removes navigation, ads, and irrelevant elements
- **Preserves formatting** - maintains code blocks, headers, and structure
- **Handles modern sites** - works with JavaScript-heavy documentation
- **Scales efficiently** - processes hundreds of pages reliably

## Contributing

Found a bug or want to add a feature?

1. Open an issue describing the problem
2. Fork the repository
3. Make your changes
4. Submit a pull request

### Building & Publishing

This project uses [Hatch](https://hatch.pypa.io/) for building and publishing. Contributors should have it installed.

## License

MIT License - use ScrollScribe for any purpose, commercial or personal.

---

**ScrollScribe** - Turn any documentation site into clean Markdown files or structured metadata for AI processing.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jamesn-dev/scroll-scribe

Awesome Lists containing this project

README

ScrollScribe