https://github.com/iamgerwin/php-pdf-to-markdown-parser
A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, and code blocks for easier content reuse and publishing.
https://github.com/iamgerwin/php-pdf-to-markdown-parser
Last synced: 3 months ago
JSON representation
A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, and code blocks for easier content reuse and publishing.
- Host: GitHub
- URL: https://github.com/iamgerwin/php-pdf-to-markdown-parser
- Owner: iamgerwin
- License: mit
- Created: 2025-09-30T16:47:04.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-09-30T16:57:05.000Z (6 months ago)
- Last Synced: 2025-12-21T00:43:42.076Z (4 months ago)
- Language: PHP
- Size: 13.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project
README
# PHP PDF to Markdown Parser
[](https://github.com/iamgerwin/php-pdf-to-markdown-parser/actions/workflows/tests.yml)
[](https://packagist.org/packages/iamgerwin/php-pdf-to-markdown-parser)
[](https://packagist.org/packages/iamgerwin/php-pdf-to-markdown-parser)
A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, tables, diagrams and code blocks for easier content reuse and publishing.
Because sometimes PDFs just need to chill out and become Markdown.
## Features
- 📝 **Text Extraction with Styling** - Preserves headings, bold, italic, and strikethrough formatting
- 📊 **Table Parsing** - Extracts tables with proper headers and body formatting
- 🎨 **Diagram Support** - Converts diagrams to Mermaid and dbdiagram.io formats
- Flowcharts
- Sequence diagrams
- Entity Relationship Diagrams (ERD)
- Gantt charts
- Class diagrams
- State diagrams
- Pie charts
- 📋 **List Detection** - Automatically converts bullet points and numbered lists
- 💻 **Code Block Recognition** - Identifies and formats code snippets
- 🚀 **PHP 8.3 Compatible** - Built with modern PHP features
- ✅ **PSR-12 Compliant** - Follows PHP coding standards
## Installation
You can install the package via composer:
```bash
composer require iamgerwin/php-pdf-to-markdown-parser
```
## Usage
### Basic Usage
```php
use Iamgerwin\PdfToMarkdownParser\PdfToMarkdownParser;
$parser = new PdfToMarkdownParser();
// Parse a PDF file
$markdown = $parser->parseFile('path/to/document.pdf');
// Parse PDF content
$pdfContent = file_get_contents('path/to/document.pdf');
$markdown = $parser->parseContent($pdfContent);
// Output the markdown
echo $markdown;
```
### Working with Tables
The parser automatically detects and converts tables in your PDF:
```markdown
| Header 1 | Header 2 | Header 3 |
| --- | --- | --- |
| Row 1 Col 1 | Row 1 Col 2 | Row 1 Col 3 |
| Row 2 Col 1 | Row 2 Col 2 | Row 2 Col 3 |
```
### Diagram Extraction
Diagrams are automatically detected and converted to appropriate formats:
**Mermaid Flowcharts:**
```markdown
```mermaid
flowchart TD
Start --> Process --> End
```
```
**ERD (dbdiagram.io format):**
```markdown
```dbdiagram
Table users {
id int
name varchar
email varchar
}
```
```
**Sequence Diagrams:**
```markdown
```mermaid
sequenceDiagram
User->>System: Request
System->>Database: Query
Database->>System: Response
System->>User: Result
```
```
### Text Styling
The parser preserves text styling from PDFs:
- Headings (H1-H6) based on font size and formatting
- **Bold text**
- *Italic text*
- ~~Strikethrough text~~
- Lists (bulleted and numbered)
- Code blocks
## Advanced Configuration
### Custom Extractors
You can extend the parser with custom extractors:
```php
use Iamgerwin\PdfToMarkdownParser\PdfToMarkdownParser;
use Iamgerwin\PdfToMarkdownParser\Extractors\TextExtractor;
use Iamgerwin\PdfToMarkdownParser\Extractors\TableExtractor;
use Iamgerwin\PdfToMarkdownParser\Extractors\DiagramExtractor;
$parser = new PdfToMarkdownParser();
// The parser uses these extractors internally:
// - TextExtractor: Handles text and styling
// - TableExtractor: Processes tables
// - DiagramExtractor: Converts diagrams
```
## Testing
Run the test suite:
```bash
composer test
```
Run tests with coverage:
```bash
composer test-coverage
```
Run PHPStan static analysis:
```bash
composer analyse
```
Format code with Laravel Pint:
```bash
composer format
```
## Requirements
- PHP 8.3 or higher
- ext-mbstring
## How It Works
The parser uses a multi-stage extraction process:
1. **PDF Parsing** - Uses the robust smalot/pdfparser library to extract raw content
2. **Text Analysis** - Identifies text styling, headings, and formatting patterns
3. **Table Detection** - Recognizes table structures (pipe, tab, or space-separated)
4. **Diagram Recognition** - Detects diagram patterns and converts to Mermaid/dbdiagram formats
5. **Markdown Generation** - Combines all elements into properly formatted Markdown
## Limitations
- **Images**: Currently, images are not extracted (coming in future versions)
- **Complex Layouts**: Multi-column layouts may require manual adjustment
- **Font Styling**: Basic bold/italic detection is simplified (font metadata parsing is limited)
- **Diagrams**: Pattern matching may not catch all diagram types
## Changelog
Please see [CHANGELOG](CHANGELOG.md) for more information on what has changed recently.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Security
If you discover any security related issues, please email iamgerwin@live.com instead of using the issue tracker.
## Credits
- [iamgerwin](https://github.com/iamgerwin)
## License
The MIT License (MIT). Please see [License File](LICENSE.md) for more information.
## Acknowledgments
Built with inspiration from the PHP community and the need to make PDF content more accessible and reusable. Special thanks to the maintainers of [smalot/pdfparser](https://github.com/smalot/pdfparser) for their excellent PDF parsing library.