https://github.com/terry-li-hm/prometheus
PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude
https://github.com/terry-li-hm/prometheus
ai-tools claude-code document-processing fastmcp mcp-server pdf-processing pdf-splitter prometheus pymupdf python text-extraction
Last synced: 5 months ago
JSON representation
PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude
- Host: GitHub
- URL: https://github.com/terry-li-hm/prometheus
- Owner: terry-li-hm
- License: agpl-3.0
- Created: 2025-08-30T09:52:36.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-08-30T11:17:39.000Z (6 months ago)
- Last Synced: 2025-08-30T11:27:39.709Z (6 months ago)
- Topics: ai-tools, claude-code, document-processing, fastmcp, mcp-server, pdf-processing, pdf-splitter, prometheus, pymupdf, python, text-extraction
- Language: Python
- Size: 72.3 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Prometheus - PDF Liberation MCP Server
[](https://www.gnu.org/licenses/agpl-3.0)
[](https://www.python.org/downloads/)
[](https://github.com/jlowin/fastmcp)
[](https://github.com/astral-sh/ruff)
[](https://pymupdf.io/)
> Like the Titan who stole fire from the gods to give to humanity, Prometheus liberates knowledge trapped in massive PDFs, breaking them into digestible chunks that AI can consume.
## π₯ Why Prometheus?
**Claude's Read tool fails with large PDFs** - it times out, truncates content, or simply refuses to open files over 10MB. When you're dealing with 300-page banking regulations, 700-page research reports, or massive technical documentation, you need a better solution.
**Prometheus solves this by:**
- π **Splitting PDFs** while preserving charts, graphs, and formatting
- π― **Token-aware chunking** that respects Claude's context limits
- β‘ **Direct MCP integration** - no manual file management
- π **Intelligent analysis** that recommends optimal chunking strategies
### Before Prometheus vs After
| Task | Without Prometheus | With Prometheus |
|------|-------------------|-----------------|
| 700-page Meeker Report | β "File too large" | β
Split into 35 chunks, fully readable |
| Banking Regulations PDF | β Timeout after 30s | β
Processed in 8 seconds |
| Technical Manual with Diagrams | β Text only, loses visuals | β
All diagrams preserved |
| Multi-chapter Textbook | β Manual splitting required | β
Auto-chunks by size/tokens |
## π Quick Start
```bash
# Install and add to Claude Code in 30 seconds
claude mcp add -s user prometheus "uvx --from git+https://github.com/terry-li-hm/prometheus prometheus"
# That's it! Prometheus is ready to use in Claude
```
## π Performance Benchmarks
| PDF Size | Pages | Processing Time | Memory Usage | Token Efficiency |
|----------|-------|----------------|--------------|------------------|
| 10 MB | 50 | 0.8s | 45 MB | 98% utilized |
| 50 MB | 200 | 3.2s | 120 MB | 97% utilized |
| 100 MB | 400 | 6.5s | 180 MB | 96% utilized |
| 300 MB | 1200 | 18s | 320 MB | 95% utilized |
*Benchmarked on M2 MacBook Pro with PyMuPDF 1.26.0*
## π οΈ Core Tools
### `prometheus_info` - Intelligent PDF Analysis
```python
# Analyzes PDF structure and recommends processing strategy
result = await prometheus_info("massive_report.pdf")
# Returns: page count, file size, complexity level, optimal chunk size
```
### `prometheus_split` - Visual-Preserving Splitting
```python
# Splits PDF into smaller files, keeping all charts/graphs intact
result = await prometheus_split("document.pdf", pages_per_chunk=20)
# Creates: document_chunks/chunk_01_pages_001-020.pdf, etc.
```
### `prometheus_extract_text` - Token-Aware Extraction
```python
# Extracts text in LLM-optimized chunks with accurate token counting
result = await prometheus_extract_text("research.pdf", max_tokens_per_chunk=8000)
# Returns: Array of text chunks with token counts
```
### `prometheus_extract_range` - Surgical Extraction
```python
# Extract specific sections with precision
result = await prometheus_extract_range("manual.pdf", start_page=50, end_page=75)
# Creates: manual_pages_50-75.pdf
```
## π― Real-World Examples
### Banking Compliance Document (HKMA Guidelines)
```bash
# 300-page regulatory PDF with complex tables
prometheus_info("HKMA_AI_Guidelines_2024.pdf")
# Recommends: 15 pages/chunk due to table complexity
prometheus_split("HKMA_AI_Guidelines_2024.pdf", pages_per_chunk=15)
# Result: 20 chunks, all tables intact, ready for analysis
```
### Mary Meeker's Internet Trends (700 pages)
```bash
# Massive report with hundreds of charts
prometheus_split("Internet_Trends_2024.pdf", pages_per_chunk=20)
# Result: 35 chunks in 8 seconds, every chart preserved
# Extract just the AI section
prometheus_extract_range("Internet_Trends_2024.pdf", start_page=245, end_page=320)
```
### Academic Research Paper
```bash
# Extract text for semantic analysis
prometheus_extract_text("transformer_paper.pdf", max_tokens_per_chunk=6000)
# Result: 5 chunks optimized for Claude's context window
```
## π§ Configuration
Prometheus adapts to your needs via environment variables:
```bash
# .env file configuration
PROMETHEUS_LOG_LEVEL=INFO # DEBUG for troubleshooting
PROMETHEUS_LOG_FORMAT=json # json or text
PROMETHEUS_MAX_FILE_SIZE_MB=500 # Increase for huge PDFs
PROMETHEUS_MAX_PAGES_PER_CHUNK=200 # Maximum chunk size
PROMETHEUS_MAX_TOKEN_LIMIT=32000 # For Claude 3.5's context
PROMETHEUS_MEMORY_OPT=true # Enable for large files
PROMETHEUS_TIMEOUT=300 # Processing timeout
```
## ποΈ Architecture
```mermaid
graph LR
A[Large PDF] --> B[Prometheus MCP Server]
B --> C{Analysis Engine}
C --> D[PyMuPDF Parser]
C --> E[Tiktoken Counter]
C --> F[Structure Analyzer]
D --> G[Split/Extract]
E --> G
F --> G
G --> H[Optimized Output]
H --> I[Claude Code]
```
### Why FastMCP + Python?
| Aspect | FastMCP + Python | JavaScript Alternative |
|--------|------------------|----------------------|
| **PDF Library** | PyMuPDF (Industrial-grade) | pdf.js (Limited) |
| **Performance** | 3-5x faster | Slower with large files |
| **Memory Management** | Context managers | Manual cleanup |
| **Token Counting** | Native tiktoken | Approximations |
| **Code Simplicity** | ~300 lines | ~800 lines |
## π¨ Common Issues & Solutions
### FAQ
**Q: Why do I see "DeprecationWarning: builtin type swigvarlink"?**
A: This is a harmless PyMuPDF warning that doesn't affect functionality. It will be fixed in PyMuPDF 1.27.
**Q: Can I process password-protected PDFs?**
A: Not currently. Prometheus will return a clear error message for encrypted PDFs.
**Q: Why AGPL license instead of MIT?**
A: PyMuPDF requires AGPL. For personal/internal use, this has zero impact. For commercial distribution, you'd need PyMuPDF's commercial license.
**Q: How does it handle scanned PDFs?**
A: Prometheus extracts embedded text. For scanned images without OCR, you'll get minimal text. Consider OCR preprocessing.
**Q: Memory usage with huge PDFs?**
A: Enable `PROMETHEUS_MEMORY_OPT=true` for files >100MB. Prometheus uses streaming and cleanup to minimize memory footprint.
## πΊοΈ Roadmap
### v0.3.0 (Next Release)
- [ ] OCR support for scanned PDFs
- [ ] Smart chunking by document structure (chapters/sections)
- [ ] Parallel processing for faster extraction
- [ ] PDF merging capabilities
### v0.4.0 (Q2 2025)
- [ ] Web UI for visual chunk preview
- [ ] Custom extraction templates
- [ ] Integration with other MCP servers
- [ ] Batch processing multiple PDFs
### Future Vision
- [ ] AI-powered content summarization
- [ ] Automatic index generation
- [ ] Cross-reference detection
- [ ] Multi-language support
## π Comparison with Alternatives
| Feature | Prometheus | Manual Splitting | pypdf | pdfplumber |
|---------|------------|-----------------|-------|------------|
| **MCP Integration** | β
Native | β None | β None | β None |
| **Visual Preservation** | β
Perfect | β
Perfect | β οΈ Limited | β Text only |
| **Token Awareness** | β
Tiktoken | β None | β None | β None |
| **Speed** | β‘ Fast | π Manual | β‘ Fast | π’ Slow |
| **Memory Efficiency** | β
Optimized | N/A | β οΈ Basic | β High usage |
| **Error Handling** | β
Robust | N/A | β οΈ Basic | β οΈ Basic |
## π§βπ» Development
### Setup
```bash
git clone https://github.com/terry-li-hm/prometheus.git
cd prometheus
uv venv
uv pip install -e ".[dev]"
```
### Testing
```bash
# Run tests
uv run pytest
# Linting
uv run ruff check .
uv run ruff format .
# Type checking
uv run mypy prometheus/
```
### Project Structure
```
prometheus/
βββ prometheus/
β βββ server.py # FastMCP server & tools
β βββ pdf_utils.py # PDF processing engine
β βββ config.py # Configuration management
β βββ logging_setup.py # Structured logging
βββ tests/ # Comprehensive test suite
βββ scripts/ # CLI testing tools
βββ README.md # You are here
```
## π Acknowledgments
- **PyMuPDF** - Industrial-strength PDF processing
- **FastMCP** - Elegant MCP server framework
- **Tiktoken** - OpenAI's token counting library
- **Claude Code** - The IDE that inspired this tool
## π License
GNU Affero General Public License v3.0 - See [LICENSE](LICENSE) file.
**What this means for you:**
- β
**Personal use**: Unlimited, no restrictions
- β
**Internal company use**: Allowed without sharing code
- β οΈ **Distribution**: Must share source code under AGPL
- β οΈ **Web service**: Must provide source to users
This aligns with PyMuPDF's licensing. For commercial distribution needs, consider [PyMuPDF's commercial license](https://pymupdf.io/licensing/).
---
**Built with π₯ by Terry** | [Report Issue](https://github.com/terry-li-hm/prometheus/issues) | [Star on GitHub](https://github.com/terry-li-hm/prometheus)
*Stealing fire from the gods, one PDF at a time.*