{"id":47872250,"url":"https://github.com/davanstrien/ocr-bench","last_synced_at":"2026-04-04T00:56:15.491Z","repository":{"id":342088578,"uuid":"1167785520","full_name":"davanstrien/ocr-bench","owner":"davanstrien","description":"Per-collection OCR leaderboards using VLM-as-judge","archived":false,"fork":false,"pushed_at":"2026-03-23T14:48:34.000Z","size":1267,"stargazers_count":57,"open_issues_count":8,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-04T00:56:09.621Z","etag":null,"topics":["evaluation","huggingface","ocr"],"latest_commit_sha":null,"homepage":"https://huggingface.co/spaces/davanstrien/ocr-bench-britannica-results-qwen35-viewer","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davanstrien.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-26T17:21:31.000Z","updated_at":"2026-04-02T06:25:26.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/davanstrien/ocr-bench","commit_stats":null,"previous_names":["davanstrien/ocr-bench"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/davanstrien/ocr-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davanstrien%2Focr-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davanstrien%2Focr-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davanstrien%2Focr-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davanstrien%2Focr-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davanstrien","download_url":"https://codeload.github.com/davanstrien/ocr-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davanstrien%2Focr-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31383635,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-03T23:20:52.058Z","status":"ssl_error","status_checked_at":"2026-04-03T23:20:51.675Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","huggingface","ocr"],"created_at":"2026-04-04T00:56:13.528Z","updated_at":"2026-04-04T00:56:15.483Z","avatar_url":"https://github.com/davanstrien.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ocr-bench\n\n**There is currently no single best OCR model.** Rankings change depending on your documents. Manuscript cards, printed books, historical texts all produce different winners.\n\n`ocr-bench` allows you to create **per-collection leaderboards** using a VLM-as-judge approach, so you can find what works best for _your_ documents rather than relying on generic benchmarks. You can validate the VLM's judgement with human votes, and share results via the Hugging Face Hub.\n\nThe underlying OCR model inference uv scripts are available at [uv-scripts/ocr](https://huggingface.co/datasets/uv-scripts/ocr). The majority of these use vLLM for efficient GPU inference, and are designed to run on a single consumer GPU (e.g. 24GB 3090/4090). The `ocr-bench` package orchestrates running these models at scale on the Hub, and judging outputs with a VLM. If you just want to run some OCR models on your data without the judging/leaderboard aspect, you can run the scripts directly.\n\n## Why?\n\nGeneric OCR benchmarks tell you which model wins _on average_. But if you're digitising 18th-century encyclopaedias, that average doesn't help — the best model for your documents might be the worst on someone else's. Inspired by [Datalab's Benchmarks + Evals](https://www.datalab.to/blog/datalab-benchmarks-evals) approach — pairwise VLM-as-judge with Bradley-Terry scoring on your own documents — ocr-bench brings this idea to the Hugging Face Hub as an open-source, self-serve tool.\n\nocr-bench lets you run the same set of OCR models on a sample of _your_ collection, then uses a vision-language model to judge which produces the best transcription for each document. The result is a leaderboard specific to your data.\n\n| Model              | BPL card catalog | Britannica 1771 |\n| ------------------ | :--------------: | :-------------: |\n| GLM-OCR (0.9B)     |    #2 (1535)     |  **#1** (1787)  |\n| LightOnOCR-2 (1B)  |  **#1** (1559)   |    #2 (1780)    |\n| FireRed-OCR (2.1B) |        —         |    #3 (1551)    |\n| DeepSeek-OCR (4B)  |    #4 (1452)     |    #4 (1437)    |\n| dots.ocr (1.7B)    |    #3 (1453)     |    #5 (945)     |\n\nRankings can flip completely between collections.\n\n![ELO vs Parameter Count — smaller models can win on the right documents](assets/elo-scatter.png)\n\n**[Try the live viewer](https://huggingface.co/spaces/davanstrien/ocr-bench-britannica-results-qwen35-viewer)** — browse the Britannica 1771 leaderboard, compare OCR outputs side-by-side, and vote on quality yourself.\n\n## Hub-native by design\n\nThe entire evaluation loop lives on the Hugging Face Hub:\n\n1. **Your dataset** on the Hub (images + optional ground truth)\n2. **OCR models** run via [HF Jobs](https://huggingface.co/docs/hub/jobs-overview) → outputs written as PRs on a Hub dataset\n3. **VLM judge** via [HF Inference Providers](https://huggingface.co/docs/inference-providers/index) — only needs an HF token\n4. **Results** published to a Hub dataset (leaderboard + pairwise comparisons)\n5. **Viewer** as a [HF Space](https://huggingface.co/spaces) for browsing and human validation\n\nNo local GPU required. Everything is shareable via Hub URLs.\n\n## Quickstart\n\n```bash\nuv pip install ocr-bench[viewer]\n\n# 1. Run OCR models on your dataset\nocr-bench run \u003cinput-dataset\u003e \u003coutput-repo\u003e --max-samples 50\n\n# 2. Judge outputs pairwise with a VLM\nocr-bench judge \u003coutput-repo\u003e\n\n# 3. Browse results + validate\nocr-bench view \u003coutput-repo\u003e-results\n```\n\n## How it works\n\n**`ocr-bench run`** launches OCR models on your dataset via [HF Jobs](https://huggingface.co/docs/hub/jobs-overview). Each model writes its output as a PR on the same Hub dataset, keeping everything together without merge conflicts.\n\n**`ocr-bench judge`** runs pairwise comparisons using a VLM judge (default: [Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) via HF Inference Providers). For each document, the judge sees the original image and two OCR outputs (anonymised as A/B) and picks the better transcription. Results are fit to a [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) to produce ELO ratings with bootstrap 95% confidence intervals. Adaptive stopping halts early when rankings are statistically resolved.\n\n**`ocr-bench view`** serves a local web viewer with a leaderboard, comparison browser, and human validation. Vote on comparisons to cross-check the automated judge with human judgement.\n\n## Available models\n\nocr-bench ships with 5 OCR models ready to run:\n\n| Model           | Size | Best for                   | Notes                        |\n| --------------- | ---- | -------------------------- | ---------------------------- |\n| `glm-ocr`       | 0.9B | Historical printed text    | Top performer on Britannica  |\n| `lighton-ocr-2` | 1B   | Card catalogs, manuscripts | Top performer on BPL         |\n| `firered-ocr`   | 2.1B | Clean printed text         | Mid-pack on degraded docs    |\n| `deepseek-ocr`  | 4B   | Diverse documents          | Most consistent across types |\n| `dots-ocr`      | 1.7B | General                    | Struggles on historical text |\n\nAll model scripts are available at [uv-scripts/ocr](https://huggingface.co/datasets/uv-scripts/ocr) on the Hub.\n\nBy default all 5 run. To pick specific models:\n\n```bash\nocr-bench run \u003cdataset\u003e \u003coutput\u003e --models glm-ocr lighton-ocr-2\n```\n\n## Example results\n\n![Leaderboard viewer with ELO ratings, confidence intervals, and human validation](assets/leaderboard.png)\n\nBrowse these on the Hub:\n\n- [davanstrien/ocr-bench-britannica-results-qwen35](https://huggingface.co/datasets/davanstrien/ocr-bench-britannica-results-qwen35) — Encyclopaedia Britannica 1771, 5 models, 50 samples\n- [davanstrien/bpl-ocr-bench-results](https://huggingface.co/datasets/davanstrien/bpl-ocr-bench-results) — Boston Public Library card catalog, 4 models, 150 samples\n- [Live viewer](https://huggingface.co/spaces/davanstrien/ocr-bench-britannica-results-qwen35-viewer) — Britannica leaderboard with ELO chart and comparison browser\n\n## Install\n\n```bash\nuv pip install ocr-bench            # Core (run + judge)\nuv pip install ocr-bench[viewer]    # With web UI\n```\n\nOr with [uv](https://docs.astral.sh/uv/):\n\n```bash\nuv pip install ocr-bench[viewer]\n```\n\nRequires Python \u003e= 3.11 and an [HF token](https://huggingface.co/settings/tokens).\n\n## Status\n\nWorking proof of concept. The core pipeline (run → judge → view) is functional. Not polished production software — expect rough edges. This is an early-stage project to explore the idea of VLM-judged OCR leaderboards, and gather feedback on the concept and implementation!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavanstrien%2Focr-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavanstrien%2Focr-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavanstrien%2Focr-bench/lists"}