{"id":44018217,"url":"https://github.com/jztan/pdf-mcp","last_synced_at":"2026-03-14T18:26:52.729Z","repository":{"id":335378097,"uuid":"1144308241","full_name":"jztan/pdf-mcp","owner":"jztan","description":"Production-ready MCP server for PDF processing with intelligent caching. Extract text, search, and analyze PDFs with AI agents.","archived":false,"fork":false,"pushed_at":"2026-01-29T22:51:27.000Z","size":24,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-01-30T12:16:27.784Z","etag":null,"topics":["agentic-ai","ai","claude","codex-cli","copilot","document-processing","llm","mcp","model-context-protocol","opencode","pdf","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jztan.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-28T14:50:28.000Z","updated_at":"2026-01-30T02:39:12.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jztan/pdf-mcp","commit_stats":null,"previous_names":["jztan/pdf-mcp"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/jztan/pdf-mcp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jztan%2Fpdf-mcp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jztan%2Fpdf-mcp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jztan%2Fpdf-mcp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jztan%2Fpdf-mcp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jztan","download_url":"https://codeload.github.com/jztan/pdf-mcp/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jztan%2Fpdf-mcp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29199519,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-07T14:35:27.868Z","status":"ssl_error","status_checked_at":"2026-02-07T14:25:51.081Z","response_time":63,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","ai","claude","codex-cli","copilot","document-processing","llm","mcp","model-context-protocol","opencode","pdf","python"],"created_at":"2026-02-07T16:00:31.719Z","updated_at":"2026-03-08T15:01:57.669Z","avatar_url":"https://github.com/jztan.png","language":"Python","funding_links":[],"categories":["カテゴリ"],"sub_categories":["📁 \u003ca name=\"file-system--storage\"\u003e\u003c/a\u003eファイルシステム・ストレージ"],"readme":"# pdf-mcp\n\n[![PyPI version](https://img.shields.io/pypi/v/pdf-mcp)](https://pypi.org/project/pdf-mcp/)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![GitHub Issues](https://img.shields.io/github/issues/jztan/pdf-mcp)](https://github.com/jztan/pdf-mcp/issues)\n[![CI](https://github.com/jztan/pdf-mcp/actions/workflows/ci.yml/badge.svg)](https://github.com/jztan/pdf-mcp/actions/workflows/ci.yml)\n[![codecov](https://codecov.io/gh/jztan/pdf-mcp/graph/badge.svg)](https://codecov.io/gh/jztan/pdf-mcp)\n[![Downloads](https://pepy.tech/badge/pdf-mcp)](https://pepy.tech/project/pdf-mcp)\n\nA [Model Context Protocol](https://modelcontextprotocol.io/) (MCP) server that enables AI agents to read, search, and extract content from PDF files. Built with Python and PyMuPDF, with SQLite-based caching for persistence across server restarts.\n\n**mcp-name: io.github.jztan/pdf-mcp**\n\n## Features\n\n- **8 specialized tools** for different PDF operations\n- **SQLite caching** — persistent cache survives server restarts (essential for STDIO transport)\n- **Paginated reading** — read large PDFs in manageable chunks\n- **Full-text search** — find content without loading the entire document\n- **Image extraction** — extract images as base64 PNG\n- **URL support** — read PDFs from HTTP/HTTPS URLs\n\n## Installation\n\n```bash\npip install pdf-mcp\n```\n\n## Quick Start\n\n\u003cdetails open\u003e\n\u003csummary\u003e\u003cstrong\u003eClaude Code\u003c/strong\u003e\u003c/summary\u003e\n\n```bash\nclaude mcp add pdf-mcp -- pdf-mcp\n```\n\nOr add to `~/.claude.json`:\n\n```json\n{\n  \"mcpServers\": {\n    \"pdf-mcp\": {\n      \"command\": \"pdf-mcp\"\n    }\n  }\n}\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eClaude Desktop\u003c/strong\u003e\u003c/summary\u003e\n\nAdd to your `claude_desktop_config.json`:\n\n```json\n{\n  \"mcpServers\": {\n    \"pdf-mcp\": {\n      \"command\": \"pdf-mcp\"\n    }\n  }\n}\n```\n\nConfig file location:\n- macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`\n- Windows: `%APPDATA%\\Claude\\claude_desktop_config.json`\n\nRestart Claude Desktop after updating the config.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eVisual Studio Code\u003c/strong\u003e\u003c/summary\u003e\n\nRequires VS Code 1.102+ with GitHub Copilot.\n\n**CLI:**\n```bash\ncode --add-mcp '{\"name\":\"pdf-mcp\",\"command\":\"pdf-mcp\"}'\n```\n\n**Command Palette:**\n1. Open Command Palette (`Cmd/Ctrl+Shift+P`)\n2. Run `MCP: Open User Configuration` (global) or `MCP: Open Workspace Folder Configuration` (project-specific)\n3. Add the configuration:\n   ```json\n   {\n     \"servers\": {\n       \"pdf-mcp\": {\n         \"command\": \"pdf-mcp\"\n       }\n     }\n   }\n   ```\n4. Save. VS Code will automatically load the server.\n\n**Manual:** Create `.vscode/mcp.json` in your workspace:\n```json\n{\n  \"servers\": {\n    \"pdf-mcp\": {\n      \"command\": \"pdf-mcp\"\n    }\n  }\n}\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eCodex CLI\u003c/strong\u003e\u003c/summary\u003e\n\n```bash\ncodex mcp add pdf-mcp -- pdf-mcp\n```\n\nOr configure manually in `~/.codex/config.toml`:\n\n```toml\n[mcp_servers.pdf-mcp]\ncommand = \"pdf-mcp\"\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eKiro\u003c/strong\u003e\u003c/summary\u003e\n\nCreate or edit `.kiro/settings/mcp.json` in your workspace:\n\n```json\n{\n  \"mcpServers\": {\n    \"pdf-mcp\": {\n      \"command\": \"pdf-mcp\",\n      \"args\": [],\n      \"disabled\": false\n    }\n  }\n}\n```\n\nSave and restart Kiro.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eOther MCP Clients\u003c/strong\u003e\u003c/summary\u003e\n\nMost MCP clients use a standard configuration format:\n\n```json\n{\n  \"mcpServers\": {\n    \"pdf-mcp\": {\n      \"command\": \"pdf-mcp\"\n    }\n  }\n}\n```\n\nWith `uvx` (for isolated environments):\n\n```json\n{\n  \"mcpServers\": {\n    \"pdf-mcp\": {\n      \"command\": \"uvx\",\n      \"args\": [\"pdf-mcp\"]\n    }\n  }\n}\n```\n\n\u003c/details\u003e\n\n### Verify Installation\n\n```bash\npdf-mcp --help\n```\n\n## Tools\n\n### `pdf_info` — Get Document Information\n\nReturns page count, metadata, table of contents, file size, and estimated token count. **Call this first** to understand a document before reading it.\n\n```\n\"Read the PDF at /path/to/document.pdf\"\n```\n\n### `pdf_read_pages` — Read Specific Pages\n\nRead selected pages to manage context size.\n\n```\n\"Read pages 1-10 of the PDF\"\n\"Read pages 15, 20, and 25-30\"\n```\n\n### `pdf_read_all` — Read Entire Document\n\nRead a complete document in one call. Subject to a safety limit on page count.\n\n```\n\"Read the entire PDF (it's only 10 pages)\"\n```\n\n### `pdf_search` — Search Within PDF\n\nFind relevant pages before loading content.\n\n```\n\"Search for 'quarterly revenue' in the PDF\"\n```\n\n### `pdf_get_toc` — Get Table of Contents\n\n```\n\"Show me the table of contents\"\n```\n\n### `pdf_extract_images` — Extract Images\n\n```\n\"Extract images from pages 1-5\"\n```\n\n### `pdf_cache_stats` — View Cache Statistics\n\n```\n\"Show PDF cache statistics\"\n```\n\n### `pdf_cache_clear` — Clear Cache\n\n```\n\"Clear expired PDF cache entries\"\n```\n\n## Example Workflow\n\nFor a large document (e.g., a 200-page annual report):\n\n```\nUser: \"Summarize the risk factors in this annual report\"\n\nAgent workflow:\n1. pdf_info(\"report.pdf\")\n   → 200 pages, TOC shows \"Risk Factors\" on page 89\n\n2. pdf_search(\"report.pdf\", \"risk factors\")\n   → Relevant pages: 89-110\n\n3. pdf_read_pages(\"report.pdf\", \"89-100\")\n   → First batch\n\n4. pdf_read_pages(\"report.pdf\", \"101-110\")\n   → Second batch\n\n5. Synthesize answer from chunks\n```\n\n## Caching\n\nThe server uses SQLite for persistent caching. This is necessary because MCP servers using STDIO transport are spawned as a new process for each conversation.\n\n**Cache location:** `~/.cache/pdf-mcp/cache.db`\n\n**What's cached:**\n\n| Data | Benefit |\n|------|---------|\n| Metadata | Avoid re-parsing document info |\n| Page text | Skip re-extraction |\n| Images | Skip re-encoding |\n| TOC | Skip re-parsing |\n\n**Cache invalidation:**\n- Automatic when file modification time changes\n- Manual via the `pdf_cache_clear` tool\n- TTL: 24 hours (configurable)\n\n## Configuration\n\nEnvironment variables:\n\n```bash\n# Cache directory (default: ~/.cache/pdf-mcp)\nPDF_MCP_CACHE_DIR=/path/to/cache\n\n# Cache TTL in hours (default: 24)\nPDF_MCP_CACHE_TTL=48\n```\n\n## Development\n\n```bash\ngit clone https://github.com/jztan/pdf-mcp.git\ncd pdf-mcp\n\n# Install with dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest tests/ -v\n\n# Type checking\nmypy src/\n\n# Linting\nflake8 src/\n\n# Formatting\nblack src/\n```\n\n## Why pdf-mcp?\n\n| | Without pdf-mcp | With pdf-mcp |\n|---|---|---|\n| Large PDFs | Context overflow | Chunked reading |\n| Repeated access | Re-parse every time | SQLite cache |\n| Finding content | Load everything | Search first |\n| Tool design | Single monolithic tool | 8 specialized tools |\n\n## Contributing\n\nContributions are welcome. Please submit a pull request.\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n\n## Links\n\n- [PyPI](https://pypi.org/project/pdf-mcp/)\n- [GitHub](https://github.com/jztan/pdf-mcp)\n- [MCP Documentation](https://modelcontextprotocol.io/)\n- [How I Built pdf-mcp](https://blog.jztan.com/how-i-built-pdf-mcp-solving-claude-large-pdf-limitations/) — The story behind this project\n- [MCP Server Security: 8 Vulnerabilities](https://blog.jztan.com/mcp-server-security-8-vulnerabilities/) — Security lessons from building MCP servers\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjztan%2Fpdf-mcp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjztan%2Fpdf-mcp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjztan%2Fpdf-mcp/lists"}