https://github.com/metaory/tess-pdfseq
Configurable OCR for documents in any language. Built with Tesseract and optimized for sequential PDF processing.
https://github.com/metaory/tess-pdfseq
Last synced: 10 months ago
JSON representation
Configurable OCR for documents in any language. Built with Tesseract and optimized for sequential PDF processing.
- Host: GitHub
- URL: https://github.com/metaory/tess-pdfseq
- Owner: metaory
- License: mit
- Created: 2025-01-30T12:31:37.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-01-30T14:22:13.000Z (over 1 year ago)
- Last Synced: 2025-01-30T14:32:36.360Z (over 1 year ago)
- Language: Python
- Size: 17.6 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
tess-pdfseq
Configurable OCR for documents in any language. Built with Tesseract and optimized for PDF processing.
## Features
- Process PDFs with high accuracy
- Support for any language Tesseract supports
- Memory efficient processing
- Optional GPU acceleration via OpenCV
- Graceful fallback to CPU processing
- Docker-based deployment
- 7z archive support for batch processing
## Quick Start
```bash
# Build with default language (English)
docker build -t tess-pdfseq .
# Process a 7z archive (CPU mode)
./bin/pdfseq input.7z
# Process with GPU acceleration
GPU=1 ./bin/pdfseq input.7z
# Process with custom settings
LANGS="eng+ara" DPI=300 WIDTH=2400 GPU=1 ./bin/pdfseq docs.7z
```
## Configuration
### Shell Script Options
Environment variables for `bin/pdfseq`:
- `LANGS`: OCR languages (default: eng)
- `DPI`: PDF scan resolution (default: 500)
- `WIDTH`: Target width in pixels (default: 1800)
- `MAX_PAGES`: Max pages per PDF (default: 3)
- `WORKDIR`: Working directory (default: current directory)
- `GPU`: Enable GPU acceleration (default: 0)
- `TAG`: Docker image tag (default: latest)
### Languages
First, install languages during Docker build (plus-separated):
```bash
# Build with multiple languages available
docker build -t tess-pdfseq --build-arg TESS_LANGS="eng+fas+ara" .
```
Then use any of the installed languages at runtime:
```bash
# Use English (default)
./bin/pdfseq docs.7z
# Use Persian/Farsi
LANGS=fas ./bin/pdfseq docs.7z
# Use multiple languages together
LANGS="eng+fas" ./bin/pdfseq docs.7z
```
Note: You can only use languages at runtime that were installed during build.
List available languages:
```bash
# On Arch Linux
pacman -Ss tesseract-data | grep -Po 'tesseract-data-\K[^/\s]*(?=\s)'
# Using Docker
docker run --rm python:3.12-alpine sh -c \
"apk add --no-cache tesseract-ocr && tesseract --list-langs"
```
Common languages:
- `eng` - English
- `heb` - Hebrew
- `jpn` - Japanese
- `kor` - Korean
- `rus` - Russian
- `fas` - Persian/Farsi
- `chi_sim` - Chinese Simplified
### Processing
- Optimizes images for OCR (1800px width)
- GPU acceleration with OpenCV UMat when enabled
- Automatic fallback to CPU if GPU fails
- Outputs plain text files
## Requirements
- Docker (GPU support optional)
- NVIDIA Container Toolkit for GPU support
- ~1GB disk space for Docker image
- p7zip for archive handling
## Development
```bash
git clone https://github.com/metaory/tess-pdfseq
cd tess-pdfseq
# Build with custom languages
docker build -t tess-pdfseq --build-arg TESS_LANGS="eng+ara" .
```
## License
[MIT](LICENSE)