https://github.com/iamgerwin/php-pdf-to-markdown-parser

A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, and code blocks for easier content reuse and publishing.
https://github.com/iamgerwin/php-pdf-to-markdown-parser

Last synced: 3 months ago
JSON representation

A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, and code blocks for easier content reuse and publishing.

Host: GitHub
URL: https://github.com/iamgerwin/php-pdf-to-markdown-parser
Owner: iamgerwin
License: mit
Created: 2025-09-30T16:47:04.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-09-30T16:57:05.000Z (6 months ago)
Last Synced: 2025-12-21T00:43:42.076Z (4 months ago)
Language: PHP
Size: 13.7 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md

Awesome Lists containing this project

README

# PHP PDF to Markdown Parser

[![Tests](https://github.com/iamgerwin/php-pdf-to-markdown-parser/actions/workflows/tests.yml/badge.svg)](https://github.com/iamgerwin/php-pdf-to-markdown-parser/actions/workflows/tests.yml)
[![Latest Version on Packagist](https://img.shields.io/packagist/v/iamgerwin/php-pdf-to-markdown-parser.svg?style=flat-square)](https://packagist.org/packages/iamgerwin/php-pdf-to-markdown-parser)
[![Total Downloads](https://img.shields.io/packagist/dt/iamgerwin/php-pdf-to-markdown-parser.svg?style=flat-square)](https://packagist.org/packages/iamgerwin/php-pdf-to-markdown-parser)

A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, tables, diagrams and code blocks for easier content reuse and publishing.

Because sometimes PDFs just need to chill out and become Markdown.

## Features

- 📝 **Text Extraction with Styling** - Preserves headings, bold, italic, and strikethrough formatting
- 📊 **Table Parsing** - Extracts tables with proper headers and body formatting
- 🎨 **Diagram Support** - Converts diagrams to Mermaid and dbdiagram.io formats
- Flowcharts
- Sequence diagrams
- Entity Relationship Diagrams (ERD)
- Gantt charts
- Class diagrams
- State diagrams
- Pie charts
- 📋 **List Detection** - Automatically converts bullet points and numbered lists
- 💻 **Code Block Recognition** - Identifies and formats code snippets
- 🚀 **PHP 8.3 Compatible** - Built with modern PHP features
- ✅ **PSR-12 Compliant** - Follows PHP coding standards

## Installation

You can install the package via composer:

```bash
composer require iamgerwin/php-pdf-to-markdown-parser
```

## Usage

### Basic Usage

```php
use Iamgerwin\PdfToMarkdownParser\PdfToMarkdownParser;

$parser = new PdfToMarkdownParser();

// Parse a PDF file
$markdown = $parser->parseFile('path/to/document.pdf');

// Parse PDF content
$pdfContent = file_get_contents('path/to/document.pdf');
$markdown = $parser->parseContent($pdfContent);

// Output the markdown
echo $markdown;
```

### Working with Tables

The parser automatically detects and converts tables in your PDF:

```markdown
| Header 1 | Header 2 | Header 3 |
| --- | --- | --- |
| Row 1 Col 1 | Row 1 Col 2 | Row 1 Col 3 |
| Row 2 Col 1 | Row 2 Col 2 | Row 2 Col 3 |
```

### Diagram Extraction

Diagrams are automatically detected and converted to appropriate formats:

**Mermaid Flowcharts:**
```markdown
```mermaid
flowchart TD
Start --> Process --> End
```
```

**ERD (dbdiagram.io format):**
```markdown
```dbdiagram
Table users {
id int
name varchar
email varchar
}
```
```

**Sequence Diagrams:**
```markdown
```mermaid
sequenceDiagram
User->>System: Request
System->>Database: Query
Database->>System: Response
System->>User: Result
```
```

### Text Styling

The parser preserves text styling from PDFs:

- Headings (H1-H6) based on font size and formatting
- **Bold text**
- *Italic text*
- ~~Strikethrough text~~
- Lists (bulleted and numbered)
- Code blocks

## Advanced Configuration

### Custom Extractors

You can extend the parser with custom extractors:

```php
use Iamgerwin\PdfToMarkdownParser\PdfToMarkdownParser;
use Iamgerwin\PdfToMarkdownParser\Extractors\TextExtractor;
use Iamgerwin\PdfToMarkdownParser\Extractors\TableExtractor;
use Iamgerwin\PdfToMarkdownParser\Extractors\DiagramExtractor;

$parser = new PdfToMarkdownParser();

// The parser uses these extractors internally:
// - TextExtractor: Handles text and styling
// - TableExtractor: Processes tables
// - DiagramExtractor: Converts diagrams
```

## Testing

Run the test suite:

```bash
composer test
```

Run tests with coverage:

```bash
composer test-coverage
```

Run PHPStan static analysis:

```bash
composer analyse
```

Format code with Laravel Pint:

```bash
composer format
```

## Requirements

- PHP 8.3 or higher
- ext-mbstring

## How It Works

The parser uses a multi-stage extraction process:

1. **PDF Parsing** - Uses the robust smalot/pdfparser library to extract raw content
2. **Text Analysis** - Identifies text styling, headings, and formatting patterns
3. **Table Detection** - Recognizes table structures (pipe, tab, or space-separated)
4. **Diagram Recognition** - Detects diagram patterns and converts to Mermaid/dbdiagram formats
5. **Markdown Generation** - Combines all elements into properly formatted Markdown

## Limitations

- **Images**: Currently, images are not extracted (coming in future versions)
- **Complex Layouts**: Multi-column layouts may require manual adjustment
- **Font Styling**: Basic bold/italic detection is simplified (font metadata parsing is limited)
- **Diagrams**: Pattern matching may not catch all diagram types

## Changelog

Please see [CHANGELOG](CHANGELOG.md) for more information on what has changed recently.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Security

If you discover any security related issues, please email iamgerwin@live.com instead of using the issue tracker.

## Credits

- [iamgerwin](https://github.com/iamgerwin)

## License

The MIT License (MIT). Please see [License File](LICENSE.md) for more information.

## Acknowledgments

Built with inspiration from the PHP community and the need to make PDF content more accessible and reusable. Special thanks to the maintainers of [smalot/pdfparser](https://github.com/smalot/pdfparser) for their excellent PDF parsing library.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/iamgerwin/php-pdf-to-markdown-parser

Awesome Lists containing this project

README