{"id":36981577,"url":"https://github.com/iamgerwin/php-pdf-to-markdown-parser","last_synced_at":"2026-01-13T22:51:26.035Z","repository":{"id":323775838,"uuid":"1067295110","full_name":"iamgerwin/php-pdf-to-markdown-parser","owner":"iamgerwin","description":"A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, and code blocks for easier content reuse and publishing.","archived":false,"fork":false,"pushed_at":"2025-09-30T16:57:05.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-21T00:43:42.076Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iamgerwin.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-30T16:47:04.000Z","updated_at":"2025-09-30T16:57:08.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/iamgerwin/php-pdf-to-markdown-parser","commit_stats":null,"previous_names":["iamgerwin/php-pdf-to-markdown-parser"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/iamgerwin/php-pdf-to-markdown-parser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamgerwin%2Fphp-pdf-to-markdown-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamgerwin%2Fphp-pdf-to-markdown-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamgerwin%2Fphp-pdf-to-markdown-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamgerwin%2Fphp-pdf-to-markdown-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iamgerwin","download_url":"https://codeload.github.com/iamgerwin/php-pdf-to-markdown-parser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamgerwin%2Fphp-pdf-to-markdown-parser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28402176,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-13T14:36:09.778Z","status":"ssl_error","status_checked_at":"2026-01-13T14:35:19.697Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-13T22:51:25.340Z","updated_at":"2026-01-13T22:51:26.026Z","avatar_url":"https://github.com/iamgerwin.png","language":"PHP","readme":"# PHP PDF to Markdown Parser\n\n[![Tests](https://github.com/iamgerwin/php-pdf-to-markdown-parser/actions/workflows/tests.yml/badge.svg)](https://github.com/iamgerwin/php-pdf-to-markdown-parser/actions/workflows/tests.yml)\n[![Latest Version on Packagist](https://img.shields.io/packagist/v/iamgerwin/php-pdf-to-markdown-parser.svg?style=flat-square)](https://packagist.org/packages/iamgerwin/php-pdf-to-markdown-parser)\n[![Total Downloads](https://img.shields.io/packagist/dt/iamgerwin/php-pdf-to-markdown-parser.svg?style=flat-square)](https://packagist.org/packages/iamgerwin/php-pdf-to-markdown-parser)\n\nA lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, tables, diagrams and code blocks for easier content reuse and publishing.\n\nBecause sometimes PDFs just need to chill out and become Markdown.\n\n## Features\n\n- 📝 **Text Extraction with Styling** - Preserves headings, bold, italic, and strikethrough formatting\n- 📊 **Table Parsing** - Extracts tables with proper headers and body formatting\n- 🎨 **Diagram Support** - Converts diagrams to Mermaid and dbdiagram.io formats\n  - Flowcharts\n  - Sequence diagrams\n  - Entity Relationship Diagrams (ERD)\n  - Gantt charts\n  - Class diagrams\n  - State diagrams\n  - Pie charts\n- 📋 **List Detection** - Automatically converts bullet points and numbered lists\n- 💻 **Code Block Recognition** - Identifies and formats code snippets\n- 🚀 **PHP 8.3 Compatible** - Built with modern PHP features\n- ✅ **PSR-12 Compliant** - Follows PHP coding standards\n\n## Installation\n\nYou can install the package via composer:\n\n```bash\ncomposer require iamgerwin/php-pdf-to-markdown-parser\n```\n\n## Usage\n\n### Basic Usage\n\n```php\nuse Iamgerwin\\PdfToMarkdownParser\\PdfToMarkdownParser;\n\n$parser = new PdfToMarkdownParser();\n\n// Parse a PDF file\n$markdown = $parser-\u003eparseFile('path/to/document.pdf');\n\n// Parse PDF content\n$pdfContent = file_get_contents('path/to/document.pdf');\n$markdown = $parser-\u003eparseContent($pdfContent);\n\n// Output the markdown\necho $markdown;\n```\n\n### Working with Tables\n\nThe parser automatically detects and converts tables in your PDF:\n\n```markdown\n| Header 1 | Header 2 | Header 3 |\n| --- | --- | --- |\n| Row 1 Col 1 | Row 1 Col 2 | Row 1 Col 3 |\n| Row 2 Col 1 | Row 2 Col 2 | Row 2 Col 3 |\n```\n\n### Diagram Extraction\n\nDiagrams are automatically detected and converted to appropriate formats:\n\n**Mermaid Flowcharts:**\n```markdown\n```mermaid\nflowchart TD\n    Start --\u003e Process --\u003e End\n```\n```\n\n**ERD (dbdiagram.io format):**\n```markdown\n```dbdiagram\nTable users {\n  id int\n  name varchar\n  email varchar\n}\n```\n```\n\n**Sequence Diagrams:**\n```markdown\n```mermaid\nsequenceDiagram\n    User-\u003e\u003eSystem: Request\n    System-\u003e\u003eDatabase: Query\n    Database-\u003e\u003eSystem: Response\n    System-\u003e\u003eUser: Result\n```\n```\n\n### Text Styling\n\nThe parser preserves text styling from PDFs:\n\n- Headings (H1-H6) based on font size and formatting\n- **Bold text**\n- *Italic text*\n- ~~Strikethrough text~~\n- Lists (bulleted and numbered)\n- Code blocks\n\n## Advanced Configuration\n\n### Custom Extractors\n\nYou can extend the parser with custom extractors:\n\n```php\nuse Iamgerwin\\PdfToMarkdownParser\\PdfToMarkdownParser;\nuse Iamgerwin\\PdfToMarkdownParser\\Extractors\\TextExtractor;\nuse Iamgerwin\\PdfToMarkdownParser\\Extractors\\TableExtractor;\nuse Iamgerwin\\PdfToMarkdownParser\\Extractors\\DiagramExtractor;\n\n$parser = new PdfToMarkdownParser();\n\n// The parser uses these extractors internally:\n// - TextExtractor: Handles text and styling\n// - TableExtractor: Processes tables\n// - DiagramExtractor: Converts diagrams\n```\n\n## Testing\n\nRun the test suite:\n\n```bash\ncomposer test\n```\n\nRun tests with coverage:\n\n```bash\ncomposer test-coverage\n```\n\nRun PHPStan static analysis:\n\n```bash\ncomposer analyse\n```\n\nFormat code with Laravel Pint:\n\n```bash\ncomposer format\n```\n\n## Requirements\n\n- PHP 8.3 or higher\n- ext-mbstring\n\n## How It Works\n\nThe parser uses a multi-stage extraction process:\n\n1. **PDF Parsing** - Uses the robust smalot/pdfparser library to extract raw content\n2. **Text Analysis** - Identifies text styling, headings, and formatting patterns\n3. **Table Detection** - Recognizes table structures (pipe, tab, or space-separated)\n4. **Diagram Recognition** - Detects diagram patterns and converts to Mermaid/dbdiagram formats\n5. **Markdown Generation** - Combines all elements into properly formatted Markdown\n\n## Limitations\n\n- **Images**: Currently, images are not extracted (coming in future versions)\n- **Complex Layouts**: Multi-column layouts may require manual adjustment\n- **Font Styling**: Basic bold/italic detection is simplified (font metadata parsing is limited)\n- **Diagrams**: Pattern matching may not catch all diagram types\n\n## Changelog\n\nPlease see [CHANGELOG](CHANGELOG.md) for more information on what has changed recently.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Security\n\nIf you discover any security related issues, please email iamgerwin@live.com instead of using the issue tracker.\n\n## Credits\n\n- [iamgerwin](https://github.com/iamgerwin)\n\n## License\n\nThe MIT License (MIT). Please see [License File](LICENSE.md) for more information.\n\n## Acknowledgments\n\nBuilt with inspiration from the PHP community and the need to make PDF content more accessible and reusable. Special thanks to the maintainers of [smalot/pdfparser](https://github.com/smalot/pdfparser) for their excellent PDF parsing library.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamgerwin%2Fphp-pdf-to-markdown-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiamgerwin%2Fphp-pdf-to-markdown-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamgerwin%2Fphp-pdf-to-markdown-parser/lists"}