https://github.com/r-uben/socr
Multi-engine OCR with cascading fallback, quality audit, and figure extraction
https://github.com/r-uben/socr
deepseek document-processing gemini nougat ocr pdf
Last synced: about 1 month ago
JSON representation
Multi-engine OCR with cascading fallback, quality audit, and figure extraction
- Host: GitHub
- URL: https://github.com/r-uben/socr
- Owner: r-uben
- License: mit
- Created: 2025-12-23T00:23:42.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-04-11T18:04:08.000Z (about 2 months ago)
- Last Synced: 2026-04-11T19:27:30.179Z (about 2 months ago)
- Topics: deepseek, document-processing, gemini, nougat, ocr, pdf
- Language: Python
- Size: 919 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# socr
[](https://pypi.org/project/socr/)
[](https://pypi.org/project/socr/)
[](LICENSE)
Multi-engine document OCR with cascading fallback and quality audit.
`socr` orchestrates multiple OCR engines — calling each as a CLI subprocess, auditing output quality, and falling back to a different engine when results are poor. Each engine is a standalone CLI tool (`gemini-ocr`, `deepseek-ocr`, `marker-ocr`, etc.) that can also be used independently.
## Install
```bash
pip install socr
# With specific engine backends
pip install socr[gemini] # Google Gemini (cloud)
pip install socr[local] # DeepSeek + Nougat (local/free)
pip install socr[all] # All engines
```
Engines are installed separately because they have different dependencies (torch, cloud SDKs, etc.). Install only what you need.
## Usage
```bash
# Process a PDF
socr paper.pdf
# Choose engine
socr paper.pdf --primary gemini
socr paper.pdf --primary marker
# Save extracted figures
socr paper.pdf --save-figures
# Batch process a directory
socr batch ~/Papers/ -o ./results/
socr batch ~/Papers/ --dry-run # preview what would be processed
socr batch ~/Papers/ --reprocess # force reprocess all
# Check which engines are available
socr engines
```
## How it works
```
PDF → Primary OCR → Quality Audit → (Fallback OCR if needed) → Markdown
```
1. **Primary OCR** — Calls the primary engine CLI on the whole PDF
2. **Quality audit** — Heuristic checks (word count, garbage ratio, repetition)
3. **Fallback** — If audit fails, tries a different engine
Each engine is a separate CLI binary. `socr` calls it as a subprocess, reads the output markdown, and applies the quality pipeline.
## Engines
| Engine | Package | Type | Notes |
|--------|---------|------|-------|
| Gemini | `gemini-ocr-cli` | Cloud | Google Gemini, ~$0.0002/page |
| Mistral | `mistral-ocr-cli` | Cloud | Mistral AI |
| Marker | `marker-ocr-cli` | Local | Layout-aware (Surya + Texify) |
| DeepSeek | `deepseek-ocr-cli` | Local | Via Ollama |
| Nougat | `nougat-ocr-cli` | Local | Academic papers, Python <3.13 |
Check availability:
```
$ socr engines
[+] gemini cloud, ~$0.0002/page
[+] marker local, layout-aware (Surya + Texify)
[+] mistral cloud, ~$0.001/page
[+] deepseek local via Ollama
[x] nougat local, academic papers
```
## CLI reference
```
socr process [OPTIONS]
-o, --output-dir PATH Output directory
--primary ENGINE Primary OCR engine (gemini, marker, deepseek, etc.)
--fallback ENGINE Fallback engine
--no-audit Skip quality audit
--save-figures Save extracted figure images
--timeout SECONDS Subprocess timeout (default: 300)
--profile NAME Load ~/.config/socr/{name}.yaml
--config PATH Custom YAML config file
-q, --quiet Suppress non-error output
-v, --verbose Verbose output
--dry-run List files without processing
--reprocess Force reprocess already-done files
socr batch [OPTIONS]
Same options as process, plus:
--limit N Process first N files
socr engines Show available engines
```
## Output
```
output//
├── .md # OCR text
├── metadata.json # Processing stats
└── figures/ # With --save-figures
└── figure_1_page3.png
```
## Configuration
Create `~/.config/socr/config.yaml`:
```yaml
primary_engine: gemini
fallback_engine: marker
timeout: 300
save_figures: false
audit_enabled: true
audit_min_words: 50
```
Or use profiles: `~/.config/socr/fast.yaml` → `socr paper.pdf --profile fast`
## Engine CLIs
Each backend is an independent CLI tool:
- [gemini-ocr-cli](https://github.com/r-uben/gemini-ocr-cli) — Google Gemini
- [deepseek-ocr-cli](https://github.com/r-uben/deepseek-ocr-cli) — DeepSeek via Ollama
- [mistral-ocr-cli](https://github.com/r-uben/mistral-ocr-cli) — Mistral AI
- [marker-ocr-cli](https://github.com/r-uben/marker-ocr-cli) — Marker (Surya + Texify)
- [nougat-ocr-cli](https://github.com/r-uben/nougat-ocr-cli) — Meta Nougat
## License
MIT