https://github.com/no0bitah/pdf-highlight-extractor
A Python tool for extracting highlighted text from PDF files while preserving formatting attributes (headers, bold, italic) and removing unwanted line breaks and page breaks. Perfect for integrating with content management systems.
https://github.com/no0bitah/pdf-highlight-extractor
automation crm documentation-tool numpy opencv pdf pdf-document-processor pillow pymupdf pypdfium2 python3 scrapping
Last synced: about 11 hours ago
JSON representation
A Python tool for extracting highlighted text from PDF files while preserving formatting attributes (headers, bold, italic) and removing unwanted line breaks and page breaks. Perfect for integrating with content management systems.
- Host: GitHub
- URL: https://github.com/no0bitah/pdf-highlight-extractor
- Owner: No0Bitah
- License: mit
- Created: 2025-05-15T05:26:39.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-15T05:49:06.000Z (about 1 year ago)
- Last Synced: 2025-06-10T07:08:27.667Z (about 1 year ago)
- Topics: automation, crm, documentation-tool, numpy, opencv, pdf, pdf-document-processor, pillow, pymupdf, pypdfium2, python3, scrapping
- Language: Python
- Homepage:
- Size: 81.1 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF Highlight Extractor
A powerful Python tool for extracting highlighted text from PDF documents while preserving formatting information such as headers, bold text, and italics.
## Features
- Extracts text from highlighted areas in PDF documents
- Preserves text formatting (headers, bold, italic)
- Outputs formatted text in Markdown or HTML
- Detects and preserves hierarchical structure of documents
- Command-line interface for easy integration into workflows
- Intelligent paragraph and formatting detection
## Use Cases
- Content research and collection
- Academic paper review and note-taking
- Legal document analysis and extraction
- Knowledge management systems
- Content migration to CMSs
## Requirements
- Python 3.7+
- Dependencies:
- PyMuPDF (fitz) - For PDF text extraction and annotation handling
- pypdfium2 - For PDF rendering
- OpenCV (cv2) - For image processing and highlight detection
- NumPy - For array operations
- Pillow (PIL) - For image handling
## Installation
```bash
# Clone the repository
git clone https://github.com/No0Bitah/PDF-Highlight-Extractor.git
cd PDF-Highlight-Extractor
# Install dependencies
python main.py --install
```
Alternatively, you can install dependencies manually:
```bash
pip install PyMuPDF pypdfium2 numpy opencv-python pillow
```
## Usage
### Basic Usage
```bash
python main.py --input sample.pdf --format markdown
```
This will process `sample.pdf` and save the extracted highlighted text to `sample.txt` in Markdown format.
### Command Line Arguments
```bash
# Install dependencies
python main.py --install
# Process a PDF file with default settings (markdown output)
python main.py --input document.pdf
# Process a PDF file and specify output file
python main.py --input document.pdf --output extracted_highlights.md
# Generate HTML output
python main.py --input document.pdf --format html --output extracted_highlights.html
```
### Using as a Library
You can also use the `PDFHighlightExtractor` class directly in your Python code:
```python
from pdf_extractor import PDFHighlightExtractor
# Initialize the extractor
extractor = PDFHighlightExtractor("document.pdf")
# Run the full pipeline
formatted_text = extractor.extract_and_format(output_path="output.md", output_format="markdown")
# Or run individual steps
extractor.detect_highlights()
extractor.extract_text_from_highlights()
formatted_text = extractor.format_output(output_format="markdown")
```
## Limitations
- Currently, the tool only detects **yellow highlights** (RGB: 255, 255, 0). Other highlight colors are not supported yet.
- The highlight detection works best on clean, well-scanned PDFs. Poor quality scans may affect detection accuracy.
- Header detection is based on font size heuristics and may not be perfect for all PDF documents.
## Future Improvements
1. Support for multiple highlight colors with color-coding in output
2. Improved header and structure detection
3. Option to extract annotations and comments
4. Support for PDF forms and fillable fields
5. Better handling of complex layouts (multi-column, mixed orientations)
6. CMS integration capabilities for direct publishing to content management systems
7. Web interface/API for remote processing
8. OCR integration for scanned documents
9. Batch processing for multiple PDFs
## How It Works
The tool uses a combination of image processing techniques (with OpenCV) and PDF parsing (with PyMuPDF) to:
1. Detect highlighted areas by color analysis
2. Extract text from those areas using PDF parsing libraries
3. Preserve formatting information from the original document
4. Reconstruct the logical structure of the highlighted content
5. Output in the desired format (Markdown or HTML)
## Troubleshooting
- **No highlights detected**: Try adjusting the `tolerance` parameter for color detection
- **Missing formatting**: Some PDFs don't store formatting as expected; manual adjustments may be needed
- **Performance issues with large PDFs**: Process page ranges instead of the entire document
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Contributors
- 🔗 [No0Bitah](https://github.com/No0Bitah)
- 📧 [Contact me](jomari.daison@gmail.com)