{"id":50362490,"url":"https://github.com/nathannncurtis/ocr","last_synced_at":"2026-05-30T02:30:38.214Z","repository":{"id":349471584,"uuid":"879064019","full_name":"nathannncurtis/ocr","owner":"nathannncurtis","description":"Batch DICOM/TIFF/JPEG to searchable PDF with OCR. Docker, Tesseract, JBIG2, parallel processing.","archived":false,"fork":false,"pushed_at":"2026-04-06T04:25:51.000Z","size":110,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-06T06:12:49.530Z","etag":null,"topics":["batch-processing","cpp","docker","ocr","pdf","tesseract"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nathannncurtis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-10-26T21:41:43.000Z","updated_at":"2026-04-06T06:04:57.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/nathannncurtis/ocr","commit_stats":null,"previous_names":["nathannncurtis/ocr"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/nathannncurtis/ocr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathannncurtis%2Focr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathannncurtis%2Focr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathannncurtis%2Focr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathannncurtis%2Focr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nathannncurtis","download_url":"https://codeload.github.com/nathannncurtis/ocr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathannncurtis%2Focr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33678270,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["batch-processing","cpp","docker","ocr","pdf","tesseract"],"created_at":"2026-05-30T02:30:37.700Z","updated_at":"2026-05-30T02:30:38.205Z","avatar_url":"https://github.com/nathannncurtis.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OCRmyPDF Batch Processor\n\nBatch process multiple folders of scanned documents (TIFF, JPEG, PDF) into searchable, compressed PDFs using OCR.\n\n## Features\n\n- **Batch processing**: Process entire directories of document folders in parallel\n- **OCR with Tesseract**: Adds searchable text layer with automatic deskewing and rotation\n- **JBIG2 compression**: Optimizes file size while preserving OCR quality\n- **Parallel processing**: Utilize all CPU cores efficiently\n- **Docker-based**: Easy deployment, consistent environment\n\n## Quick Start\n\n```bash\n# Build the Docker image\ndocker build -t ocr-batch-processor .\n\n# Process all folders in /input, output to /output\n# 4 folders in parallel, 12 CPU cores per folder (48 total)\ndocker run -v /path/to/input:/input -v /path/to/output:/output \\\n  ocr-batch-processor python batch_process.py /input /output -p 4 -j 12\n```\n\n## What It Does\n\n```\nInput folder structure:       Output:\n/input/                       /output/\n  ├── 630666-01/             ├── 630666-01.pdf (searchable, compressed)\n  │   ├── page_001.tif       ├── 629157-02.pdf (searchable, compressed)\n  │   ├── page_002.tif       └── 633421-03.pdf (searchable, compressed)\n  │   └── page_003.tif\n  ├── 629157-02/\n  │   └── scan.pdf\n  └── 633421-03/\n      ├── img001.jpg\n      └── img002.jpg\n```\n\nEach subfolder becomes one output PDF with:\n- Full text OCR layer (searchable/copyable)\n- Automatic page rotation and deskewing\n- JBIG2 compression (smaller file size)\n- Preserved image quality\n\n## Usage\n\n### For 48-Core Machine (Recommended)\n\n**Balanced approach** (4 folders × 12 cores = 48 cores):\n```bash\ndocker run -v /input:/input -v /output:/output \\\n  ocr-batch-processor python batch_process.py /input /output -p 4 -j 12\n```\n\n**More parallelism** (6 folders × 8 cores = 48 cores):\n```bash\ndocker run -v /input:/input -v /output:/output \\\n  ocr-batch-processor python batch_process.py /input /output -p 6 -j 8\n```\n\n### Advanced Options\n\n```bash\n# High-quality OCR (slower, larger files)\ndocker run -v /input:/input -v /output:/output \\\n  ocr-batch-processor python batch_process.py /input /output -p 4 -j 12 --accurate-ocr\n\n# Preview what will be processed (no actual processing)\ndocker run -v /input:/input -v /output:/output \\\n  ocr-batch-processor python batch_process.py /input /output --dry-run\n\n# Longer timeout for very large documents\ndocker run -v /input:/input -v /output:/output \\\n  ocr-batch-processor python batch_process.py /input /output -p 4 -j 12 --timeout 7200\n```\n\n### Single Folder Processing\n\nIf you just need to process one folder:\n```bash\ndocker run -v /path/to/folder:/input -v /output:/output \\\n  ocr-batch-processor python FINAL.py /input/subfolder /output/result.pdf -j 48\n```\n\n## Documentation\n\nSee [DEPLOYMENT.md](DEPLOYMENT.md) for detailed deployment instructions, troubleshooting, and performance expectations.\n\n## Components\n\n- **batch_process.py**: Parallel batch processing wrapper\n- **FINAL.py**: Single folder processor (combine → OCR → compress)\n- **opencv_optimizer.py**: JBIG2 compression engine\n- **jbig2enc/**: JBIG2 encoder binaries and utilities\n\n## Requirements\n\n- Docker\n- Input folders containing TIFF, JPEG, or PDF files\n- Sufficient disk space for output\n\n## Performance\n\nOn a 48-core machine:\n- Small docs (10-20 pages): ~30-60 seconds each\n- Large docs (100+ pages): ~5-15 minutes each\n- 100 folders (~50 pages each): ~2 hours total\n\n## License\n\nUses OCRmyPDF, Tesseract OCR, and jbig2enc. See respective licenses.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnathannncurtis%2Focr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnathannncurtis%2Focr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnathannncurtis%2Focr/lists"}