{"id":25936921,"url":"https://github.com/soloeinsteinmit/pdfmindforge","last_synced_at":"2025-03-04T02:55:06.348Z","repository":{"id":264483228,"uuid":"893489331","full_name":"soloeinsteinmit/PDFMindforge","owner":"soloeinsteinmit","description":"PDFMindforge: A powerful Python toolkit for intelligent PDF processing and transformation. It handles large-scale operations like smart splitting, markdown conversion, and GPU-accelerated batch processing. Ideal for ML training data prep, document analysis, and content transformation pipelines.","archived":false,"fork":false,"pushed_at":"2024-11-24T16:13:25.000Z","size":9,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-11-24T17:18:37.413Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soloeinsteinmit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-24T15:25:24.000Z","updated_at":"2024-11-24T16:13:28.000Z","dependencies_parsed_at":"2024-11-24T17:18:41.918Z","dependency_job_id":"24822ce7-79d4-4557-bafa-8ddb6d15c6c2","html_url":"https://github.com/soloeinsteinmit/PDFMindforge","commit_stats":null,"previous_names":["soloeinsteinmit/pdfmindforge"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soloeinsteinmit%2FPDFMindforge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soloeinsteinmit%2FPDFMindforge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soloeinsteinmit%2FPDFMindforge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soloeinsteinmit%2FPDFMindforge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soloeinsteinmit","download_url":"https://codeload.github.com/soloeinsteinmit/PDFMindforge/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241773254,"owners_count":20018065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-04T02:55:06.246Z","updated_at":"2025-03-04T02:55:06.331Z","avatar_url":"https://github.com/soloeinsteinmit.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDFMindforge 🔄📚\n\n[![PyPI version](https://badge.fury.io/py/pdfmindforge.svg)](https://badge.fury.io/py/pdfmindforge)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.8+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)\n\nTransform your PDF documents into machine-learning ready formats with intelligence and ease! 🚀\n\n## ✨ Features\n\n- 📊 Smart PDF splitting based on content size and complexity\n- 🔄 Batch processing with parallel workers (GPU-optimized)\n- 🎯 Markdown conversion with formatting preservation\n- 💨 GPU acceleration support for faster processing\n- 📦 Automatic ZIP archiving of processed documents\n- 🖼️ Automatic image extraction and linking\n- 🎛️ Configurable processing parameters\n- 📈 Progress tracking and logging\n- 🛡️ Error handling and recovery\n- 🔧 Easy-to-use API\n\n## 🎯 Purpose\n\nPDFMindforge is designed to streamline the process of converting PDF documents into machine-learning friendly formats. Whether you're building a training dataset, analyzing document structures, or need to process large volumes of PDFs, PDFMindforge provides the tools you need.\n\n## 🚀 Quick Start\n\n### Installation\n\n```bash\n# Basic installation\npip install pdfmindforge\n\n# Install with CUDA support (recommended for GPU acceleration)\npip install pdfmindforge[cuda]\n```\n\n### Basic Usage\n\n```python\nfrom pdfmindforge import PDFProcessor\n\n# Initialize processor with optimal settings\nprocessor = PDFProcessor()\n\n# Process a single PDF\nprocessor.process_pdf_to_md(\n    input_path=\"input.pdf\",\n    output_path=\"output_folder\"\n)\n\n# Process multiple PDFs in parallel\nprocessor.batch_process_directory(\n    input_dir=\"input_folder\",\n    output_dir=\"output_folder\",\n    workers=4  # Number of parallel workers\n)\n```\n\n## 📋 Requirements\n\n- Python 3.9+\n- PyTorch (for GPU acceleration)\n- PyPDF2\n- marker_single\n\n## 📚 API Reference\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| chunk_size | int | 100 | Pages per chunk when splitting PDFs |\n| batch_multiplier | int | 2 | VRAM multiplier (2 = ~6GB VRAM) |\n| langs | str | \"English\" | OCR language(s) |\n| clear_cuda_cache | bool | True | Clear GPU memory on init |\n| min_pages_for_split | int | 200 | When to split large PDFs |\n| max_pages | int | None | Max pages to process |\n| start_page | int | None | Starting page number |\n| workers | int | None | Parallel processing workers |\n| min_length | int | None | Min text length to process |\n\n## 🛠️ Advanced Configuration\n\n### GPU Acceleration\n\nPDFMindforge automatically detects and utilizes available GPU resources. The number of workers is optimized based on:\n- Available VRAM (each worker uses ~5GB peak)\n- CPU cores\n- Document complexity\n\n### Processing Options\n\n```python\nprocessor = PDFProcessor(\n    # Basic settings\n    chunk_size=100,              # Pages per chunk when splitting\n    batch_multiplier=2,          # VRAM multiplier (2 = ~6GB VRAM)\n    min_pages_for_split=200,     # When to split large PDFs\n    \n    # Processing control\n    max_pages=None,              # Max pages to process per PDF\n    start_page=None,             # Starting page number\n    workers=4,                   # Parallel processing workers\n    \n    # OCR settings\n    langs=\"English\",             # OCR language(s)\n    min_length=None,            # Min text length to process\n    \n    # Resource management\n    clear_cuda_cache=True       # Clear GPU memory on init\n)\n\n# Batch processing with options\nprocessor.batch_process_directory(\n    input_dir=\"pdfs\",\n    output_dir=\"output\",\n    recursive=True,         # Search subdirectories\n    create_zip=True,        # Create ZIP archive\n    max_files=None,         # Max files to process\n    use_marker_batch=True   # Use fast batch processing\n)\n```\n\n## 📊 Performance Tips\n\n1. 🚀 Adjust `batch_multiplier` based on available RAM\n2. 💻 Use GPU acceleration for large documents\n3. 📈 Enable `use_marker_batch` for processing many files\n4. 🔧 Set `min_length` to skip image-heavy PDFs\n5. 📈 Optimize `chunk_size` for your specific use case\n6. 🔧 Configure `min_pages_for_split` based on document complexity\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Out of Memory Errors**\n   - Reduce `batch_multiplier`\n   - Lower number of `workers`\n   - Enable `clear_cuda_cache`\n\n2. **Slow Processing**\n   - Enable GPU acceleration\n   - Use `use_marker_batch=True` for multiple files\n   - Set appropriate `min_length` to skip image-heavy PDFs\n\n## Documentation\n\nFor detailed documentation, visit our [Documentation Site](https://docs.pdfmindforge.com). Coming soon!\n\n## Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n### Development Setup\n\n```bash\ngit clone https://github.com/soloeinsteinmit/pdfmindforge.git\ncd pdfmindforge\npip install -e \".[dev]\"\n```\n\n## 📝 License\n\nMIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🙏 Acknowledgments\n\n- PyPDF2 team for the excellent PDF processing library\n- Marker Single project for markdown conversion capabilities\n- PyTorch team for GPU acceleration support\n\n## 📮 Contact \u0026 Support\n\n- 📧 Create an issue for bug reports\n- 🌟 Star the repo if you find it useful\n- 🔄 Fork for your own modifications\n- 📞 Contact: [Solomon Eshun](mailto:solomoneshun373@gmail.com)\n\n## 🔜 Roadmap\n\n- [ ] Multi-language support enhancement\n- [ ] Advanced OCR integration\n- [ ] Progress tracking UI\n- [ ] Parallel processing optimization\n- [ ] Cloud storage integration\n- [ ] Read pdf from URL(from web)\n- [ ] Split by file size option\n- And much more!\n\n---\n\nMade with ❤️ by Solomon Eshun\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoloeinsteinmit%2Fpdfmindforge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoloeinsteinmit%2Fpdfmindforge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoloeinsteinmit%2Fpdfmindforge/lists"}