{"id":50294388,"url":"https://github.com/tdiprima/phantom-glyphs","last_synced_at":"2026-05-28T08:01:18.973Z","repository":{"id":353765454,"uuid":"1219312999","full_name":"tdiprima/phantom-glyphs","owner":"tdiprima","description":"OCR stress-test toolkit that generates DICOM images with confusable glyphs (S/$, 0/O, 1/l/I) and benchmarks Chandra OCR 2 accuracy","archived":false,"fork":false,"pushed_at":"2026-04-25T12:14:07.000Z","size":47,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-25T13:28:32.609Z","etag":null,"topics":["computer-vision","dicom","medical-imaging","ocr","testing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tdiprima.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-23T18:45:49.000Z","updated_at":"2026-04-25T12:14:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tdiprima/phantom-glyphs","commit_stats":null,"previous_names":["tdiprima/phantom-glyphs"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/tdiprima/phantom-glyphs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fphantom-glyphs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fphantom-glyphs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fphantom-glyphs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fphantom-glyphs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tdiprima","download_url":"https://codeload.github.com/tdiprima/phantom-glyphs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fphantom-glyphs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33599465,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","dicom","medical-imaging","ocr","testing"],"created_at":"2026-05-28T08:01:15.723Z","updated_at":"2026-05-28T08:01:18.961Z","avatar_url":"https://github.com/tdiprima.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Phantom Glyphs 👻 🌫️ 🫥 🌑 🕯️\n\nAn OCR stress-test toolkit that generates DICOM medical images packed with visually confusing characters and measures how well OCR handles them.\n\n## A Calibration Phantom for OCR\n\nIn medical imaging, a *phantom* is a standardized test object used to calibrate equipment. Phantom Glyphs applies the same idea to OCR: it generates a realistic radiology report embedded in a DICOM image, deliberately loaded with the character pairs that break OCR engines. Light scan noise simulates a real-world document. You run your OCR pipeline against it and see exactly where it fails.\n\nThe test report includes:\n\n| Confusable Pair | Context in Report |\n|----------------|-------------------|\n| S vs $ | SOLOMON, SOB, S5 vs $500, $5,250, $1,250 |\n| 0 vs O | O'BRIEN, OI01l0II01, 0.2cm, 0.51 |\n| 1 vs l vs I | Il1O0oO01l, 1.1cm, Claire I., MRN field |\n| 5 vs S | S5 segment, 5mm, 58-year-old, $5,250 |\n| 8 vs B | B8B88b badge, rib #8, 6-8 weeks |\n| Z vs 2 | Z-score vs -2.1 |\n\n## Getting Started\n\n### Requirements\n\n- Python 3.10+\n- NVIDIA GPU with CUDA (for GPU-based engines)\n- Tesseract system binary (optional, for Tesseract engine)\n\n### Install\n\n```bash\nbash install.sh\nsource .venv/bin/activate\n```\n\nFor Tesseract, you also need the system binary:\n\n```bash\n# Ubuntu / Debian\nsudo apt install tesseract-ocr\n\n# RHEL / Rocky\nsudo dnf install tesseract\n```\n\n### Run\n\nThe pipeline generates a test DICOM, runs all available OCR engines, times each one, and compares their accuracy:\n\n```bash\nbash run-pipeline.sh\n```\n\nOr run the pipeline directly:\n\n```bash\npython create_test_dicom.py          # generate test DICOM\npython pipeline.py test_ocr.dcm     # run all available engines, compare\n```\n\nThe pipeline automatically discovers which engines are available, skips the rest, and prints a comparison table with timing when two or more engines run.\n\nTo check a single output file against ground truth:\n\n```bash\npython check_ocr.py test_ocr_tesseract_output.md\n```\n\n## OCR Engines\n\nThe pipeline uses a plugin architecture. Each engine lives in `engines/` and is auto-discovered at runtime. Unavailable engines are skipped.\n\n### Tesseract\n\nRequires the `tesseract` system binary and `pytesseract` Python package (installed by `install.sh`).\n\n### Chandra OCR 2\n\nRuns via the `chandra` CLI. Install with:\n\n```bash\npip install \"chandra-ocr[hf]\"\n```\n\nFor the vLLM backend instead of HuggingFace:\n\n```bash\n# On a GPU server\npip install chandra-ocr\nchandra_vllm   # starts server on port 8000\n\n# Run pipeline with vLLM backend\nbash run-pipeline.sh --method vllm\n```\n\n### LightOn OCR\n\nLightOnOCR is a 1B-parameter model served through [vLLM](https://docs.vllm.ai/). It uses the standard OpenAI-compatible API, so the pipeline talks to it via the `openai` Python package.\n\n**1. Install the `openai` package:**\n\n```bash\npip install openai\n```\n\n**2. Start a vLLM server with the LightOnOCR model:**\n\n```bash\n# Docker (recommended) — needs vLLM \u003e= 0.18.0 and transformers \u003e= 5.4.0\ndocker run --gpus all -p 8000:8000 \\\n    vllm/vllm-openai:latest \\\n    --model lightonai/LightOnOCR \\\n    --max-model-len 4096\n\n# Or system install\npip install vllm\nvllm serve lightonai/LightOnOCR --max-model-len 4096\n```\n\nIf vLLM doesn't recognize the model class, build a custom image that upgrades `transformers`:\n\n```dockerfile\nFROM vllm/vllm-openai:latest\nRUN pip install --no-cache-dir --force-reinstall \"transformers\u003e=5.4.0\"\n```\n\n**3. Run the pipeline** (no extra flags needed — auto-detected):\n\n```bash\npython pipeline.py test_ocr.dcm\n```\n\n**Configuration via environment variables:**\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `LIGHTON_BASE_URL` | `http://localhost:8000/v1` | vLLM server URL |\n| `LIGHTON_MODEL` | `lightonai/LightOnOCR` | Model name as served by vLLM |\n\nExample pointing to a remote server:\n\n```bash\nexport LIGHTON_BASE_URL=http://gpu-box:8000/v1\npython pipeline.py test_ocr.dcm\n```\n\n## Adding a New Engine\n\n1. Create `engines/yourengine.py` with a class that extends `OCREngine`\n2. Implement `name`, `is_available()`, and `run(image, work_dir)`\n3. Import and add to the `ENGINES` list in `engines/__init__.py`\n\nSee `engines/base.py` for the interface and any existing engine for a working example.\n\n## Project Structure\n\n| File | Purpose |\n|------|---------|\n| `pipeline.py` | Run all available engines, time each, compare metrics |\n| `create_test_dicom.py` | Render a fake radiology report onto a DICOM image with scan noise |\n| `check_ocr.py` | Check any OCR output against ground truth, report accuracy |\n| `run-pipeline.sh` | Shell wrapper: generate DICOM then run pipeline |\n| `dicom_utils.py` | Shared DICOM-to-PIL-Image conversion |\n| `install.sh` | Set up virtualenv with dependencies |\n| `engines/` | OCR engine plugins (Chandra, Tesseract, LightOn) |\n| `engines/base.py` | `OCREngine` abstract base class |\n\n## License\n\nChandra OCR 2 code is Apache 2.0. Model weights use a modified OpenRAIL-M license -- free for research, personal use, and startups under $2M revenue. Larger commercial use requires a [Datalab license](https://datalab.to).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftdiprima%2Fphantom-glyphs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftdiprima%2Fphantom-glyphs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftdiprima%2Fphantom-glyphs/lists"}