{"id":29958305,"url":"https://github.com/vinit-source/pdf-to-markdown","last_synced_at":"2025-08-03T20:11:15.424Z","repository":{"id":307292566,"uuid":"1028875852","full_name":"Vinit-source/PDF-to-Markdown","owner":"Vinit-source","description":"PDF to Markdown Conversion Tool","archived":false,"fork":false,"pushed_at":"2025-07-30T12:24:02.000Z","size":48,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-30T14:49:07.660Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Vinit-source.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-30T07:33:37.000Z","updated_at":"2025-07-30T12:24:05.000Z","dependencies_parsed_at":"2025-07-30T14:49:11.675Z","dependency_job_id":"f3354893-0579-4e8d-b597-f0c45b50ff3d","html_url":"https://github.com/Vinit-source/PDF-to-Markdown","commit_stats":null,"previous_names":["vinit-source/pdf-to-markdown"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Vinit-source/PDF-to-Markdown","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vinit-source%2FPDF-to-Markdown","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vinit-source%2FPDF-to-Markdown/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vinit-source%2FPDF-to-Markdown/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vinit-source%2FPDF-to-Markdown/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Vinit-source","download_url":"https://codeload.github.com/Vinit-source/PDF-to-Markdown/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vinit-source%2FPDF-to-Markdown/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268604282,"owners_count":24276998,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-03T02:00:12.545Z","response_time":2577,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-03T20:10:50.690Z","updated_at":"2025-08-03T20:11:15.413Z","avatar_url":"https://github.com/Vinit-source.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF to Markdown Converter with MCP Client Integration\n\nA sophisticated tool that converts PDF files to Markdown format while intelligently preserving document structure, formatting, links, lists, headings, and images. It uses an MCP (Model Context Protocol) client interface for advanced structure analysis, supporting both interactive and automated workflows.\n\n## Features\n\n### 🧠 Smart Structure Detection\n- **MCP Client Analysis**: Integrates with MCP clients (like GitHub Copilot) for semantic document structure classification\n- **Interactive \u0026 Automated Modes**: Supports interactive user prompts or automated callbacks for structure analysis\n- **Heuristic Fallback**: Robust font-size and pattern-based analysis when MCP/LLM is unavailable\n\n### 📄 Format Preservation\n- **Heading Hierarchy**: Maintains proper markdown heading levels (# ## ### ####)\n- **Lists**: Preserves bullet points, numbered lists, and nested structures\n- **Images**: Extracts images and creates proper markdown references\n- **Links**: Maintains internal and external link references\n- **Tables**: Handles table structures and formatting\n- **Styling**: Preserves bold, italic, and other text formatting\n\n### 🔧 MCP Client Integration\n- **Interactive Mode**: Prompts user to paste MCP/LLM JSON analysis, or falls back to heuristics\n- **Automated Mode**: Accepts a callback for programmatic structure analysis\n- **Legacy LLM Support**: Direct LLM API usage is deprecated, but still available for backward compatibility\n\n## Installation\n\n```bash\n# Clone or download the project\ncd pdf_to_markdown\n\n# Install dependencies\npip install -r requirements.txt\n```\n\n## Usage\n\n### As a Standalone Script\n\n```bash\n# Basic conversion (interactive MCP client mode by default)\npython pdf_to_markdown.py document.pdf\n\n# With custom output directory\npython pdf_to_markdown.py document.pdf -o /path/to/output\n\n# With custom output filename\npython pdf_to_markdown.py document.pdf -n output.md\n\n# Skip image extraction\npython pdf_to_markdown.py document.pdf --no-images\n\n# Disable MCP interactive mode (for automated/callback use)\npython pdf_to_markdown.py document.pdf --no-mcp-interactive\n\n# (Deprecated) With LLM enhancement\npython pdf_to_markdown.py document.pdf \\\n  --llm-api-url \"https://api.openai.com/v1/chat/completions\" \\\n  --llm-api-key \"your-api-key\"\n\n# Skip image extraction\npython pdf_to_markdown.py document.pdf --no-images\n```\n\n### As an MCP Server\n\n```bash\n# Start the MCP server\npython mcp_server.py\n```\n\nThe server exposes three tools via the Model Context Protocol:\n\n#### 1. convert_pdf_to_markdown\nConverts a PDF file to Markdown format with intelligent structure detection.\n\n**Parameters:**\n- `pdf_path` (required): Path to the PDF file\n- `output_dir` (optional): Output directory for results\n- `output_name` (optional): Custom name for the markdown file\n- `extract_images` (optional): Whether to extract images (default: true)\n- `llm_api_url` (optional): LLM API endpoint for enhanced analysis\n- `llm_api_key` (optional): API key for LLM service\n\n#### 2. convert_pdf_from_base64\nConverts a base64-encoded PDF to Markdown format.\n\n**Parameters:**\n- `pdf_base64` (required): Base64-encoded PDF content\n- `filename` (required): Original filename for naming output\n- `output_dir` (optional): Output directory\n- `extract_images` (optional): Whether to extract images (default: true)\n- `llm_api_url` (optional): LLM API endpoint\n- `llm_api_key` (optional): API key for LLM service\n\n#### 3. analyze_pdf_structure\nAnalyzes PDF structure without full conversion, useful for understanding document layout.\n\n**Parameters:**\n- `pdf_path` (required): Path to the PDF file\n- `llm_api_url` (optional): LLM API endpoint for enhanced analysis\n- `llm_api_key` (optional): API key for LLM service\n\n### Python API\n\n```python\nfrom pdf_to_markdown import PDFToMarkdownConverter\n\ndef my_analysis_callback(prompt, text_blocks):\n    # Implement your automated MCP/LLM analysis here\n    ...\n\nconverter = PDFToMarkdownConverter(\n    pdf_path=\"document.pdf\",\n    output_dir=\"output\",\n    extract_images=True,\n    mcp_interactive=False  # Automated mode\n)\nconverter.set_mcp_analysis_callback(my_analysis_callback)\noutput_path = converter.convert()\nprint(f\"Conversion completed: {output_path}\")\n```\n\n## Smart Algorithm Details\n\n### MCP/LLM-Enhanced Structure Detection\n\nThe converter uses a multi-stage approach for intelligent document analysis:\n\n1. **Text Extraction with Metadata**: Extracts text along with font information, positioning, and formatting details\n2. **MCP/LLM Analysis**: Sends structured text blocks to an MCP client (or LLM) for semantic analysis and classification\n3. **Structure Mapping**: Maps classifications to appropriate markdown elements\n4. **Post-Processing**: Applies additional formatting rules and cleanup\n\n### Fallback Heuristics\n\nWhen MCP/LLM is not available, the system uses sophisticated heuristics:\n- **Font Size Analysis**: Larger fonts typically indicate headings\n- **Pattern Recognition**: Detects list markers, numbering patterns\n- **Positional Analysis**: Uses text positioning for structure hints\n- **Content Patterns**: Recognizes common document structures\n\n### Supported Document Elements\n\n- **Headings**: H1-H6 based on font size and MCP/LLM analysis\n- **Paragraphs**: Regular text content with proper spacing\n- **Lists**: Bullet points, numbered lists, nested structures\n- **Images**: Extracted as PNG files with markdown references\n- **Tables**: Basic table structure preservation\n- **Links**: Internal and external link maintenance\n- **Formatting**: Bold, italic, and other text styles\n\n## Configuration\n\n### MCP Client Integration\n\n- **Interactive Mode**: Default. Prompts user for JSON analysis from MCP client (e.g., Copilot)\n- **Automated Mode**: Use `mcp_interactive=False` and set a callback via `set_mcp_analysis_callback()`\n- **LLM API**: Direct LLM API usage (`--llm-api-url`, `--llm-api-key`) is deprecated\n\n### Output Customization\n\n- **Directory Structure**: Automatically creates organized output directories\n- **Image Handling**: Configurable image extraction and referencing\n- **Naming Conventions**: Flexible file naming with hash-based deduplication\n- **Format Options**: Clean markdown with proper spacing and hierarchy\n\n## Error Handling\n\nThe system includes comprehensive error handling:\n- **File Validation**: Checks for valid PDF files and permissions\n- **Network Resilience**: Handles MCP/LLM API failures gracefully\n- **Memory Management**: Efficient handling of large PDFs\n- **Corruption Recovery**: Attempts to process partially corrupted PDFs\n\n## Examples\n\n### Basic Document Conversion\n\n```python\nconverter = PDFToMarkdownConverter(\"report.pdf\")\noutput = converter.convert()\n```\n\n### Automated MCP Analysis\n\n```python\ndef my_analysis_callback(prompt, text_blocks):\n    # Implement your automated MCP/LLM analysis here\n    ...\n\nconverter = PDFToMarkdownConverter(\n    pdf_path=\"complex_document.pdf\",\n    mcp_interactive=False\n)\nconverter.set_mcp_analysis_callback(my_analysis_callback)\noutput = converter.convert(\"enhanced_output.md\")\n```\n\n### Structure Analysis Only\n\n```python\nconverter = PDFToMarkdownConverter(\"document.pdf\")\nconverter.open_pdf()\nblocks = converter.extract_text_with_formatting(converter.doc[0])\nstructure = converter.analyze_structure_with_mcp(blocks)\nprint(json.dumps(structure, indent=2))\n```\n\n## Dependencies\n\n- `pymupdf\u003e=1.23.0` - PDF processing\n- `Pillow\u003e=10.0.0` - Image handling\n- `requests\u003e=2.31.0` - HTTP requests for MCP/LLM APIs\n- `mcp\u003e=1.0.0` - Model Context Protocol client\n- `pydantic\u003e=2.0.0` - Data validation\n\n## License\n\nThis project is available under the terms specified in the LICENSE file.\n\n## Contributing\n\nContributions are welcome! Please ensure:\n1. Code follows Python best practices\n2. New features include appropriate tests\n3. Documentation is updated for new functionality\n4. MCP/LLM integration remains optional for accessibility\n\n## Troubleshooting\n\n### Common Issues\n\n**PyMuPDF Installation**: On some systems, you may need to install system dependencies:\n```bash\n# Ubuntu/Debian\nsudo apt-get install libmupdf-dev\n\n# macOS with Homebrew\nbrew install mupdf-tools\n```\n\n**Memory Issues**: For large PDFs, consider processing in chunks or using a machine with more RAM.\n\n**MCP/LLM API Errors**: The system gracefully falls back to heuristic analysis if MCP/LLM APIs are unavailable.\n\n**Image Extraction**: Some PDFs have embedded images that may not extract cleanly. The system handles these cases gracefully.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvinit-source%2Fpdf-to-markdown","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvinit-source%2Fpdf-to-markdown","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvinit-source%2Fpdf-to-markdown/lists"}