{"id":30717808,"url":"https://github.com/terry-li-hm/prometheus","last_synced_at":"2026-05-18T09:33:19.005Z","repository":{"id":312401230,"uuid":"1047385374","full_name":"terry-li-hm/prometheus","owner":"terry-li-hm","description":"PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude","archived":false,"fork":false,"pushed_at":"2025-08-30T12:47:05.000Z","size":108,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-12T07:23:37.363Z","etag":null,"topics":["ai-tools","claude-code","document-processing","fastmcp","mcp-server","pdf-processing","pdf-splitter","prometheus","pymupdf","python","text-extraction"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/terry-li-hm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-30T09:52:36.000Z","updated_at":"2025-08-30T12:47:09.000Z","dependencies_parsed_at":"2025-08-30T11:27:42.096Z","dependency_job_id":"fd96bb5f-1310-40e7-bfd4-b329d07e697f","html_url":"https://github.com/terry-li-hm/prometheus","commit_stats":null,"previous_names":["terry-li-hm/prometheus"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/terry-li-hm/prometheus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/terry-li-hm%2Fprometheus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/terry-li-hm%2Fprometheus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/terry-li-hm%2Fprometheus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/terry-li-hm%2Fprometheus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/terry-li-hm","download_url":"https://codeload.github.com/terry-li-hm/prometheus/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/terry-li-hm%2Fprometheus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33172597,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T09:27:30.708Z","status":"ssl_error","status_checked_at":"2026-05-18T09:27:28.300Z","response_time":71,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-tools","claude-code","document-processing","fastmcp","mcp-server","pdf-processing","pdf-splitter","prometheus","pymupdf","python","text-extraction"],"created_at":"2025-09-03T09:02:06.870Z","updated_at":"2026-05-18T09:33:18.988Z","avatar_url":"https://github.com/terry-li-hm.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Prometheus - PDF Liberation MCP Server\n\n[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)\n[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)\n[![MCP](https://img.shields.io/badge/MCP-FastMCP-green.svg)](https://github.com/jlowin/fastmcp)\n[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)\n[![PyMuPDF](https://img.shields.io/badge/powered%20by-PyMuPDF-orange.svg)](https://pymupdf.io/)\n\n\u003e Like the Titan who stole fire from the gods to give to humanity, Prometheus liberates knowledge trapped in massive PDFs, breaking them into digestible chunks that AI can consume.\n\n## 🔥 Why Prometheus?\n\n**Claude's Read tool fails with large PDFs** - it times out, truncates content, or simply refuses to open files over 10MB. When you're dealing with 300-page banking regulations, 700-page research reports, or massive technical documentation, you need a better solution.\n\n**Prometheus solves this by:**\n- 📊 **Splitting PDFs** while preserving charts, graphs, and formatting\n- 🎯 **Token-aware chunking** that respects Claude's context limits\n- ⚡ **Direct MCP integration** - no manual file management\n- 🔍 **Intelligent analysis** that recommends optimal chunking strategies\n\n### Before Prometheus vs After\n\n| Task | Without Prometheus | With Prometheus |\n|------|-------------------|-----------------|\n| 700-page Meeker Report | ❌ \"File too large\" | ✅ Split into 35 chunks, fully readable |\n| Banking Regulations PDF | ❌ Timeout after 30s | ✅ Processed in 8 seconds |\n| Technical Manual with Diagrams | ❌ Text only, loses visuals | ✅ All diagrams preserved |\n| Multi-chapter Textbook | ❌ Manual splitting required | ✅ Auto-chunks by size/tokens |\n\n## 🚀 Quick Start\n\n```bash\n# Install and add to Claude Code in 30 seconds\nclaude mcp add -s user prometheus \"uvx --from git+https://github.com/terry-li-hm/prometheus prometheus\"\n\n# That's it! Prometheus is ready to use in Claude\n```\n\n## 📊 Performance Benchmarks\n\n| PDF Size | Pages | Processing Time | Memory Usage | Token Efficiency |\n|----------|-------|----------------|--------------|------------------|\n| 10 MB | 50 | 0.8s | 45 MB | 98% utilized |\n| 50 MB | 200 | 3.2s | 120 MB | 97% utilized |\n| 100 MB | 400 | 6.5s | 180 MB | 96% utilized |\n| 300 MB | 1200 | 18s | 320 MB | 95% utilized |\n\n*Benchmarked on M2 MacBook Pro with PyMuPDF 1.26.0*\n\n## 🛠️ Core Tools\n\n### `prometheus_info` - Intelligent PDF Analysis\n```python\n# Analyzes PDF structure and recommends processing strategy\nresult = await prometheus_info(\"massive_report.pdf\")\n# Returns: page count, file size, complexity level, optimal chunk size\n```\n\n### `prometheus_split` - Visual-Preserving Splitting\n```python\n# Splits PDF into smaller files, keeping all charts/graphs intact\nresult = await prometheus_split(\"document.pdf\", pages_per_chunk=20)\n# Creates: document_chunks/chunk_01_pages_001-020.pdf, etc.\n```\n\n### `prometheus_extract_text` - Token-Aware Extraction\n```python\n# Extracts text in LLM-optimized chunks with accurate token counting\nresult = await prometheus_extract_text(\"research.pdf\", max_tokens_per_chunk=8000)\n# Returns: Array of text chunks with token counts\n```\n\n### `prometheus_extract_range` - Surgical Extraction\n```python\n# Extract specific sections with precision\nresult = await prometheus_extract_range(\"manual.pdf\", start_page=50, end_page=75)\n# Creates: manual_pages_50-75.pdf\n```\n\n## 🎯 Real-World Examples\n\n### Banking Compliance Document (HKMA Guidelines)\n```bash\n# 300-page regulatory PDF with complex tables\nprometheus_info(\"HKMA_AI_Guidelines_2024.pdf\")\n# Recommends: 15 pages/chunk due to table complexity\n\nprometheus_split(\"HKMA_AI_Guidelines_2024.pdf\", pages_per_chunk=15)\n# Result: 20 chunks, all tables intact, ready for analysis\n```\n\n### Mary Meeker's Internet Trends (700 pages)\n```bash\n# Massive report with hundreds of charts\nprometheus_split(\"Internet_Trends_2024.pdf\", pages_per_chunk=20)\n# Result: 35 chunks in 8 seconds, every chart preserved\n\n# Extract just the AI section\nprometheus_extract_range(\"Internet_Trends_2024.pdf\", start_page=245, end_page=320)\n```\n\n### Academic Research Paper\n```bash\n# Extract text for semantic analysis\nprometheus_extract_text(\"transformer_paper.pdf\", max_tokens_per_chunk=6000)\n# Result: 5 chunks optimized for Claude's context window\n```\n\n## 🔧 Configuration\n\nPrometheus adapts to your needs via environment variables:\n\n```bash\n# .env file configuration\nPROMETHEUS_LOG_LEVEL=INFO          # DEBUG for troubleshooting\nPROMETHEUS_LOG_FORMAT=json         # json or text\nPROMETHEUS_MAX_FILE_SIZE_MB=500    # Increase for huge PDFs\nPROMETHEUS_MAX_PAGES_PER_CHUNK=200 # Maximum chunk size\nPROMETHEUS_MAX_TOKEN_LIMIT=32000   # For Claude 3.5's context\nPROMETHEUS_MEMORY_OPT=true         # Enable for large files\nPROMETHEUS_TIMEOUT=300             # Processing timeout\n```\n\n## 🏗️ Architecture\n\n```mermaid\ngraph LR\n    A[Large PDF] --\u003e B[Prometheus MCP Server]\n    B --\u003e C{Analysis Engine}\n    C --\u003e D[PyMuPDF Parser]\n    C --\u003e E[Tiktoken Counter]\n    C --\u003e F[Structure Analyzer]\n    D --\u003e G[Split/Extract]\n    E --\u003e G\n    F --\u003e G\n    G --\u003e H[Optimized Output]\n    H --\u003e I[Claude Code]\n```\n\n### Why FastMCP + Python?\n\n| Aspect | FastMCP + Python | JavaScript Alternative |\n|--------|------------------|----------------------|\n| **PDF Library** | PyMuPDF (Industrial-grade) | pdf.js (Limited) |\n| **Performance** | 3-5x faster | Slower with large files |\n| **Memory Management** | Context managers | Manual cleanup |\n| **Token Counting** | Native tiktoken | Approximations |\n| **Code Simplicity** | ~300 lines | ~800 lines |\n\n## 🚨 Common Issues \u0026 Solutions\n\n### FAQ\n\n**Q: Why do I see \"DeprecationWarning: builtin type swigvarlink\"?**\nA: This is a harmless PyMuPDF warning that doesn't affect functionality. It will be fixed in PyMuPDF 1.27.\n\n**Q: Can I process password-protected PDFs?**\nA: Not currently. Prometheus will return a clear error message for encrypted PDFs.\n\n**Q: Why AGPL license instead of MIT?**\nA: PyMuPDF requires AGPL. For personal/internal use, this has zero impact. For commercial distribution, you'd need PyMuPDF's commercial license.\n\n**Q: How does it handle scanned PDFs?**\nA: Prometheus extracts embedded text. For scanned images without OCR, you'll get minimal text. Consider OCR preprocessing.\n\n**Q: Memory usage with huge PDFs?**\nA: Enable `PROMETHEUS_MEMORY_OPT=true` for files \u003e100MB. Prometheus uses streaming and cleanup to minimize memory footprint.\n\n## 🗺️ Roadmap\n\n### v0.3.0 (Next Release)\n- [ ] OCR support for scanned PDFs\n- [ ] Smart chunking by document structure (chapters/sections)\n- [ ] Parallel processing for faster extraction\n- [ ] PDF merging capabilities\n\n### v0.4.0 (Q2 2025)\n- [ ] Web UI for visual chunk preview\n- [ ] Custom extraction templates\n- [ ] Integration with other MCP servers\n- [ ] Batch processing multiple PDFs\n\n### Future Vision\n- [ ] AI-powered content summarization\n- [ ] Automatic index generation\n- [ ] Cross-reference detection\n- [ ] Multi-language support\n\n## 📈 Comparison with Alternatives\n\n| Feature | Prometheus | Manual Splitting | pypdf | pdfplumber |\n|---------|------------|-----------------|-------|------------|\n| **MCP Integration** | ✅ Native | ❌ None | ❌ None | ❌ None |\n| **Visual Preservation** | ✅ Perfect | ✅ Perfect | ⚠️ Limited | ❌ Text only |\n| **Token Awareness** | ✅ Tiktoken | ❌ None | ❌ None | ❌ None |\n| **Speed** | ⚡ Fast | 🐌 Manual | ⚡ Fast | 🐢 Slow |\n| **Memory Efficiency** | ✅ Optimized | N/A | ⚠️ Basic | ❌ High usage |\n| **Error Handling** | ✅ Robust | N/A | ⚠️ Basic | ⚠️ Basic |\n\n## 🧑‍💻 Development\n\n### Setup\n```bash\ngit clone https://github.com/terry-li-hm/prometheus.git\ncd prometheus\nuv venv\nuv pip install -e \".[dev]\"\n```\n\n### Testing\n```bash\n# Run tests\nuv run pytest\n\n# Linting\nuv run ruff check .\nuv run ruff format .\n\n# Type checking\nuv run mypy prometheus/\n```\n\n### Project Structure\n```\nprometheus/\n├── prometheus/\n│   ├── server.py         # FastMCP server \u0026 tools\n│   ├── pdf_utils.py      # PDF processing engine\n│   ├── config.py         # Configuration management\n│   └── logging_setup.py  # Structured logging\n├── tests/                # Comprehensive test suite\n├── scripts/              # CLI testing tools\n└── README.md            # You are here\n```\n\n## 🙏 Acknowledgments\n\n- **PyMuPDF** - Industrial-strength PDF processing\n- **FastMCP** - Elegant MCP server framework\n- **Tiktoken** - OpenAI's token counting library\n- **Claude Code** - The IDE that inspired this tool\n\n## 📜 License\n\nGNU Affero General Public License v3.0 - See [LICENSE](LICENSE) file.\n\n**What this means for you:**\n- ✅ **Personal use**: Unlimited, no restrictions\n- ✅ **Internal company use**: Allowed without sharing code\n- ⚠️ **Distribution**: Must share source code under AGPL\n- ⚠️ **Web service**: Must provide source to users\n\nThis aligns with PyMuPDF's licensing. For commercial distribution needs, consider [PyMuPDF's commercial license](https://pymupdf.io/licensing/).\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n**Built with 🔥 by Terry** | [Report Issue](https://github.com/terry-li-hm/prometheus/issues) | [Star on GitHub](https://github.com/terry-li-hm/prometheus)\n\n*Stealing fire from the gods, one PDF at a time.*\n\n\u003c/div\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fterry-li-hm%2Fprometheus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fterry-li-hm%2Fprometheus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fterry-li-hm%2Fprometheus/lists"}