{"id":28556922,"url":"https://github.com/no0bitah/pdf-highlight-extractor","last_synced_at":"2026-06-23T18:32:30.106Z","repository":{"id":293400546,"uuid":"983909182","full_name":"No0Bitah/PDF-Highlight-Extractor","owner":"No0Bitah","description":"A Python tool for extracting highlighted text from PDF files while preserving formatting attributes (headers, bold, italic) and removing unwanted line breaks and page breaks. Perfect for integrating with content management systems.","archived":false,"fork":false,"pushed_at":"2025-05-15T05:49:06.000Z","size":83,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-10T07:08:27.667Z","etag":null,"topics":["automation","crm","documentation-tool","numpy","opencv","pdf","pdf-document-processor","pillow","pymupdf","pypdfium2","python3","scrapping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/No0Bitah.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-15T05:26:39.000Z","updated_at":"2025-05-15T06:36:45.000Z","dependencies_parsed_at":"2025-05-15T06:46:55.402Z","dependency_job_id":null,"html_url":"https://github.com/No0Bitah/PDF-Highlight-Extractor","commit_stats":null,"previous_names":["no0bitah/pdf-highlight-extractor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/No0Bitah/PDF-Highlight-Extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/No0Bitah%2FPDF-Highlight-Extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/No0Bitah%2FPDF-Highlight-Extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/No0Bitah%2FPDF-Highlight-Extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/No0Bitah%2FPDF-Highlight-Extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/No0Bitah","download_url":"https://codeload.github.com/No0Bitah/PDF-Highlight-Extractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/No0Bitah%2FPDF-Highlight-Extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34702913,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-23T02:00:07.161Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","crm","documentation-tool","numpy","opencv","pdf","pdf-document-processor","pillow","pymupdf","pypdfium2","python3","scrapping"],"created_at":"2025-06-10T07:08:25.704Z","updated_at":"2026-06-23T18:32:30.101Z","avatar_url":"https://github.com/No0Bitah.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF Highlight Extractor\n\nA powerful Python tool for extracting highlighted text from PDF documents while preserving formatting information such as headers, bold text, and italics.\n\n## Features\n\n- Extracts text from highlighted areas in PDF documents\n- Preserves text formatting (headers, bold, italic)\n- Outputs formatted text in Markdown or HTML\n- Detects and preserves hierarchical structure of documents\n- Command-line interface for easy integration into workflows\n- Intelligent paragraph and formatting detection\n\n## Use Cases\n\n- Content research and collection\n- Academic paper review and note-taking\n- Legal document analysis and extraction\n- Knowledge management systems\n- Content migration to CMSs\n\n\n## Requirements\n\n- Python 3.7+\n- Dependencies:\n    - PyMuPDF (fitz) - For PDF text extraction and annotation handling\n    - pypdfium2 - For PDF rendering\n    - OpenCV (cv2) - For image processing and highlight detection\n    - NumPy - For array operations\n    - Pillow (PIL) - For image handling\n\n## Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/No0Bitah/PDF-Highlight-Extractor.git\ncd PDF-Highlight-Extractor\n\n# Install dependencies\npython main.py --install\n```\n\nAlternatively, you can install dependencies manually:\n\n```bash\npip install PyMuPDF pypdfium2 numpy opencv-python pillow\n```\n\n## Usage\n\n### Basic Usage\n\n```bash\npython main.py --input sample.pdf --format markdown\n```\n\nThis will process `sample.pdf` and save the extracted highlighted text to `sample.txt` in Markdown format.\n\n### Command Line Arguments\n\n```bash\n# Install dependencies\npython main.py --install\n\n# Process a PDF file with default settings (markdown output)\npython main.py --input document.pdf\n\n# Process a PDF file and specify output file\npython main.py --input document.pdf --output extracted_highlights.md\n\n# Generate HTML output\npython main.py --input document.pdf --format html --output extracted_highlights.html\n```\n\n### Using as a Library\n\nYou can also use the `PDFHighlightExtractor` class directly in your Python code:\n\n```python\nfrom pdf_extractor import PDFHighlightExtractor\n\n# Initialize the extractor\nextractor = PDFHighlightExtractor(\"document.pdf\")\n\n# Run the full pipeline\nformatted_text = extractor.extract_and_format(output_path=\"output.md\", output_format=\"markdown\")\n\n# Or run individual steps\nextractor.detect_highlights()\nextractor.extract_text_from_highlights()\nformatted_text = extractor.format_output(output_format=\"markdown\")\n```\n\n## Limitations\n\n- Currently, the tool only detects **yellow highlights** (RGB: 255, 255, 0). Other highlight colors are not supported yet.\n- The highlight detection works best on clean, well-scanned PDFs. Poor quality scans may affect detection accuracy.\n- Header detection is based on font size heuristics and may not be perfect for all PDF documents.\n\n## Future Improvements\n\n1. Support for multiple highlight colors with color-coding in output\n2. Improved header and structure detection\n3. Option to extract annotations and comments\n4. Support for PDF forms and fillable fields\n5. Better handling of complex layouts (multi-column, mixed orientations)\n6. CMS integration capabilities for direct publishing to content management systems\n7. Web interface/API for remote processing\n8. OCR integration for scanned documents\n9. Batch processing for multiple PDFs\n\n## How It Works\n\nThe tool uses a combination of image processing techniques (with OpenCV) and PDF parsing (with PyMuPDF) to:\n\n1. Detect highlighted areas by color analysis\n2. Extract text from those areas using PDF parsing libraries\n3. Preserve formatting information from the original document\n4. Reconstruct the logical structure of the highlighted content\n5. Output in the desired format (Markdown or HTML)\n\n## Troubleshooting\n\n- **No highlights detected**: Try adjusting the `tolerance` parameter for color detection\n- **Missing formatting**: Some PDFs don't store formatting as expected; manual adjustments may be needed\n- **Performance issues with large PDFs**: Process page ranges instead of the entire document\n\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Contributors\n\n- 🔗 [No0Bitah](https://github.com/No0Bitah)\n- 📧 [Contact me](jomari.daison@gmail.com)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fno0bitah%2Fpdf-highlight-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fno0bitah%2Fpdf-highlight-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fno0bitah%2Fpdf-highlight-extractor/lists"}