https://github.com/jztan/pdf-mcp
Production-ready MCP server for PDF processing with intelligent caching. Extract text, search, and analyze PDFs with AI agents.
https://github.com/jztan/pdf-mcp
agentic-ai ai claude codex-cli copilot document-processing llm mcp model-context-protocol opencode pdf python
Last synced: 16 days ago
JSON representation
Production-ready MCP server for PDF processing with intelligent caching. Extract text, search, and analyze PDFs with AI agents.
- Host: GitHub
- URL: https://github.com/jztan/pdf-mcp
- Owner: jztan
- License: mit
- Created: 2026-01-28T14:50:28.000Z (26 days ago)
- Default Branch: master
- Last Pushed: 2026-01-29T22:51:27.000Z (25 days ago)
- Last Synced: 2026-01-30T12:16:27.784Z (24 days ago)
- Topics: agentic-ai, ai, claude, codex-cli, copilot, document-processing, llm, mcp, model-context-protocol, opencode, pdf, python
- Language: Python
- Homepage:
- Size: 23.4 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf-mcp 📄
[](https://pypi.org/project/pdf-mcp/)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/jztan/pdf-mcp/issues)
[](https://github.com/jztan/pdf-mcp/actions/workflows/ci.yml)
[](https://codecov.io/gh/jztan/pdf-mcp)
[](https://pepy.tech/project/pdf-mcp)
**Production-ready MCP server for PDF processing with intelligent caching.**
A Python implementation of the Model Context Protocol (MCP) server that enables AI agents like Claude to read, search, and extract content from PDF files efficiently.
**mcp-name: io.github.jztan/pdf-mcp**
## ✨ Features
- 🚀 **8 Specialized Tools** - Purpose-built tools for different PDF operations
- 💾 **SQLite Caching** - Persistent cache survives server restarts (essential for STDIO transport)
- 📄 **Smart Pagination** - Read large PDFs in manageable chunks
- 🔍 **Full-Text Search** - Find content without loading entire document
- 🖼️ **Image Extraction** - Extract images as base64 PNG
- 🌐 **URL Support** - Read PDFs from HTTP/HTTPS URLs
- ⚡ **Fast Subsequent Access** - Cached pages load instantly
## 📦 Installation
```bash
pip install pdf-mcp
```
## 🚀 Quick Start
Claude Code
```bash
claude mcp add pdf-mcp -- pdf-mcp
```
Or add to `~/.claude.json`:
```json
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
```
Claude Desktop
Add to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
```
**Location of config file:**
- macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`
- Windows: `%APPDATA%\Claude\claude_desktop_config.json`
After updating the config, restart Claude Desktop to load the MCP server.
Visual Studio Code (Native MCP Support)
VS Code has built-in MCP support via GitHub Copilot (requires VS Code 1.102+).
**Using CLI (Quickest):**
```bash
code --add-mcp '{"name":"pdf-mcp","command":"pdf-mcp"}'
```
**Using Command Palette:**
1. Open Command Palette (`Cmd/Ctrl+Shift+P`)
2. Run `MCP: Open User Configuration` (for global) or `MCP: Open Workspace Folder Configuration` (for project-specific)
3. Add the configuration:
```json
{
"servers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
```
4. Save the file. VS Code will automatically load the MCP server.
**Manual Configuration:**
Create `.vscode/mcp.json` in your workspace:
```json
{
"servers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
```
Codex CLI
Add to Codex CLI using the command:
```bash
codex mcp add pdf-mcp -- pdf-mcp
```
Or configure manually in `~/.codex/config.toml`:
```toml
[mcp_servers.pdf-mcp]
command = "pdf-mcp"
```
Kiro
Create or edit `.kiro/settings/mcp.json` in your workspace:
```json
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp",
"args": [],
"disabled": false
}
}
}
```
Save the file and restart Kiro. The PDF tools will appear in the MCP panel.
Generic MCP Clients
Most MCP clients use a standard configuration format:
```json
{
"mcpServers": {
"pdf-mcp": {
"command": "pdf-mcp"
}
}
}
```
If using `uvx` (recommended for isolated environments):
```json
{
"mcpServers": {
"pdf-mcp": {
"command": "uvx",
"args": ["pdf-mcp"]
}
}
}
```
### Testing Your Setup
```bash
# Verify pdf-mcp is installed and working
pdf-mcp --help
```
## 🛠️ Tools
### 1. `pdf_info` - Get Document Information
**Always call this first** to understand the document before reading.
```
"Read the PDF at /path/to/document.pdf"
```
Returns: page count, metadata, table of contents, file size, estimated tokens.
### 2. `pdf_read_pages` - Read Specific Pages
Read pages in chunks to manage context size.
```
"Read pages 1-10 of the PDF"
"Read pages 15, 20, and 25-30"
```
### 3. `pdf_read_all` - Read Entire Document
For small documents only (has safety limit).
```
"Read the entire PDF (it's only 10 pages)"
```
### 4. `pdf_search` - Search Within PDF
Find relevant pages before loading content.
```
"Search for 'quarterly revenue' in the PDF"
```
### 5. `pdf_get_toc` - Get Table of Contents
```
"Show me the table of contents"
```
### 6. `pdf_extract_images` - Extract Images
```
"Extract images from pages 1-5"
```
### 7. `pdf_cache_stats` - View Cache Statistics
```
"Show PDF cache statistics"
```
### 8. `pdf_cache_clear` - Clear Cache
```
"Clear expired PDF cache entries"
```
## 📋 Example Workflow
For a large document (e.g., 200-page annual report):
```
User: "Summarize the risk factors in this annual report"
Claude's workflow:
1. pdf_info("report.pdf")
→ Learns: 200 pages, TOC shows "Risk Factors" on page 89
2. pdf_search("report.pdf", "risk factors")
→ Finds relevant pages: 89-110
3. pdf_read_pages("report.pdf", "89-100")
→ Reads first batch
4. pdf_read_pages("report.pdf", "101-110")
→ Reads second batch
5. Synthesizes answer from chunks
```
## 💾 Caching
The server uses **SQLite for persistent caching** because MCP with STDIO transport spawns a new process for each conversation.
### Cache Location
- `~/.cache/pdf-mcp/cache.db`
### What's Cached
| Data | Benefit |
|------|---------|
| Metadata | Instant document info |
| Page text | Skip re-extraction |
| Images | Skip re-encoding |
| TOC | Fast navigation |
### Cache Invalidation
- Automatic when file modification time changes
- Manual via `pdf_cache_clear` tool
- TTL: 24 hours (configurable)
## ⚙️ Configuration
Environment variables:
```bash
# Cache directory (default: ~/.cache/pdf-mcp)
PDF_MCP_CACHE_DIR=/path/to/cache
# Cache TTL in hours (default: 24)
PDF_MCP_CACHE_TTL=48
```
## 🔧 Development
```bash
# Clone
git clone https://github.com/jztan/pdf-mcp.git
cd pdf-mcp
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Type checking
mypy src/
# Linting
ruff check src/
```
## 📊 Comparison
| Feature | Traditional Approach | pdf-mcp |
|---------|---------------------|---------|
| Large PDFs | Context overflow | Chunked reading |
| Repeated access | Re-parse every time | SQLite cache |
| Find content | Load everything | Search first |
| Multiple tools | One monolithic tool | 8 specialized tools |
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## 📄 License
MIT License - see [LICENSE](LICENSE) file.
## 🔗 Links
- [PyPI Package](https://pypi.org/project/pdf-mcp/)
- [MCP Documentation](https://modelcontextprotocol.io/)
- [GitHub Repository](https://github.com/jztan/pdf-mcp)
- [Blog Post: How I Built pdf-mcp](https://blog.jztan.com/how-i-built-pdf-mcp-solving-claude-large-pdf-limitations/)