{"id":50541775,"url":"https://github.com/notakeith/handscribe","last_synced_at":"2026-06-03T20:30:55.034Z","repository":{"id":358125684,"uuid":"1202986422","full_name":"notakeith/handscribe","owner":"notakeith","description":"Batch digitization tool for handwritten historical documents. Draw a template once — the system crops fields, runs OCR, and applies LLM correction","archived":false,"fork":false,"pushed_at":"2026-05-31T11:27:02.000Z","size":257,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-31T12:20:36.544Z","etag":null,"topics":["docker","document-processing","handwriting-recognition","hexagonal-architecture","java","llm","minio","ocr","opencv","postgresql","spring-boot"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/notakeith.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-06T15:54:48.000Z","updated_at":"2026-05-31T11:27:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/notakeith/handscribe","commit_stats":null,"previous_names":["notakeith/doclayoutparser","notakeith/handscribe"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/notakeith/handscribe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notakeith%2Fhandscribe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notakeith%2Fhandscribe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notakeith%2Fhandscribe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notakeith%2Fhandscribe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/notakeith","download_url":"https://codeload.github.com/notakeith/handscribe/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notakeith%2Fhandscribe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33878990,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","document-processing","handwriting-recognition","hexagonal-architecture","java","llm","minio","ocr","opencv","postgresql","spring-boot"],"created_at":"2026-06-03T20:30:54.552Z","updated_at":"2026-06-03T20:30:55.029Z","avatar_url":"https://github.com/notakeith.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"banner-dark.svg\"\u003e\n    \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"banner-light.svg\"\u003e\n    \u003cimg alt=\"ITMOScript preview\" src=\"banner-light.svg\"\u003e\n    \u003c/picture\u003e\n\u003c/div\u003e\n\n\u003e [Русская версия](README_RU.md)\n\n[![Java](https://img.shields.io/badge/Java-17-ED8B00?logo=openjdk\u0026logoColor=white)](https://openjdk.org/)\n[![Spring Boot](https://img.shields.io/badge/Spring_Boot-3.2-6DB33F?logo=springboot\u0026logoColor=white)](https://spring.io/projects/spring-boot)\n[![PostgreSQL](https://img.shields.io/badge/PostgreSQL-15-4169E1?logo=postgresql\u0026logoColor=white)](https://www.postgresql.org/)\n[![OpenCV](https://img.shields.io/badge/OpenCV-4.9-5C3EE8?logo=opencv\u0026logoColor=white)](https://opencv.org/)\n[![Docker](https://img.shields.io/badge/Docker-2496ED?logo=docker\u0026logoColor=white)](https://www.docker.com/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\nBatch digitization tool for handwritten historical documents. Draw a template once — mark the fields you need on a sample page. The system then processes any number of similar documents automatically: crops regions, runs OCR, and corrects errors with an LLM. Output is a structured table ready for analysis.\n\n## Why\n\nManual transcription of handwritten text is slow. A single document with a dozen fields takes minutes to half an hour. At the scale of thousands of archival items, that's years of work.\n\nExisting solutions are either unaffordable (Transkribus, ABBYY), poor quality on Russian handwriting (Tesseract, TrOCR), or cost hundreds of thousands of rubles (Smart Engines). This tool takes a different approach: no programming required, visual interface, modular OCR pipeline that can switch providers without touching business logic.\n\n## How it works\n\n```\nUpload sample → Mark fields → Process batch → Get CSV/JSON\n```\n\n1. User uploads a sample document and draws rectangles around target fields — coordinates are stored relative to the reference image size\n2. For each new document, a scale factor is computed; OpenCV crops the marked zones with padding\n3. Cropped fragments go to the OCR provider (currently Yandex Cloud OCR, `handwritten` model)\n4. Raw text is passed to an LLM for context-aware error correction\n5. Output: JSON with recognized fields; signatures are saved as Base64 images\n\n## Features\n\n- **Template editor** — visual field markup (text, numbers, signatures) via Canvas API, no coding\n- **Batch processing** — one template for thousands of documents\n- **LLM correction** — recovers meaning where OCR misread a character\n- **Multi-page PDFs** — separate markup per page\n- **Export** — JSON and CSV; full processing job history\n- **Modular architecture** — swap OCR or LLM provider by changing one interface implementation\n\n## Tech Stack\n\n| Layer | Technologies |\n|-------|-------------|\n| Backend | Java 17, Spring Boot 3.2.5, Spring Data JPA, Hibernate |\n| Database | PostgreSQL 15, Liquibase |\n| Object storage | MinIO (S3-compatible), AWS SDK v2, presigned URLs |\n| Image processing | OpenCV 4.9 (openpnp) — alignment, crop, homography |\n| PDF | Apache PDFBox 3.0.2 — page rendering to JPEG |\n| OCR | Yandex Cloud OCR API, `handwritten` model |\n| LLM post-processing | Deepseek V3.2 via Yandex AI Studio |\n| Frontend | Thymeleaf, Vanilla JS (ES modules), Canvas API |\n| Infrastructure | Docker, Docker Compose, Eclipse Temurin 17 |\n| API docs | SpringDoc OpenAPI (Swagger UI) |\n| Mapping | MapStruct, Lombok |\n\n## Architecture\n\nBuilt on **Hexagonal Architecture** (ports \u0026 adapters): the domain layer knows nothing about specific external services — it works only through port interfaces. This allowed swapping the OCR provider three times during development without touching the core.\n\n```mermaid\nflowchart TB\n    subgraph UI[\"Browser (Thymeleaf + Vanilla JS)\"]\n        ED[\"Template Editor\\nCanvas API\"]\n        RC[\"Recognition Page\\npolling / drag-and-drop\"]\n        JH[\"Job History\"]\n    end\n\n    subgraph API[\"Spring Boot — Presentation\"]\n        TC[\"TemplateController\"]\n        RCtl[\"RecognitionController\"]\n        JC[\"RecognitionJobController\"]\n    end\n\n    subgraph Domain[\"Spring Boot — Domain\"]\n        TS[\"TemplateService\"]\n        BRS[\"BatchRecognitionService\\n@Async executor\"]\n        RS[\"RecognitionService\\n(pipeline core)\"]\n    end\n\n    subgraph Infra[\"Spring Boot — Infrastructure\"]\n        S3A[\"MinIO Adapter\\nAWS SDK v2\"]\n        OCRA[\"YandexOcrService\"]\n        LLMA[\"LlmCorrectionService\"]\n        ALIGNER[\"OpenCvDocumentAligner\\nORB + homography\"]\n        PDF[\"PdfPageExtractor\\nPDFBox 3\"]\n    end\n\n    subgraph Storage[\"Storage\"]\n        PG[(\"PostgreSQL 15\\nLiquibase migrations\")]\n        S3[(\"MinIO S3\\nuploads · reference-images\")]\n    end\n\n    subgraph External[\"External APIs\"]\n        YOCR[\"Yandex OCR\\nhandwritten / ru\"]\n        YLLM[\"LLM provider\\nDeepseek V3.2\"]\n    end\n\n    ED --\u003e|\"multipart: dto + file\"| TC\n    RC --\u003e|\"POST /submit\"| RCtl\n    RCtl --\u003e BRS\n    BRS --\u003e|\"@Async\"| RS\n    TC --\u003e TS\n    TS --\u003e S3A\n    TS --\u003e PG\n    BRS --\u003e PG\n    RS --\u003e S3A\n    RS --\u003e ALIGNER\n    RS --\u003e OCRA\n    RS --\u003e LLMA\n    S3A --\u003e S3\n    OCRA --\u003e YOCR\n    LLMA --\u003e YLLM\n    RC --\u003e|\"GET /api/jobs/{id}\\nevery 1.5s\"| JC\n    JC --\u003e PG\n    JH --\u003e JC\n```\n\n## Document pipeline\n\n```mermaid\nflowchart LR\n    INPUT[\"Scan / PDF\\nfrom user\"]\n\n    subgraph PREP[\"Preparation\"]\n        direction TB\n        P1[\"PDFBox: render\\npages to JPEG\\n200 DPI\"]\n        P2[\"Upload\\nto MinIO S3\"]\n        P1 --\u003e P2\n    end\n\n    subgraph ALIGN[\"Alignment (OpenCV)\"]\n        direction TB\n        A1[\"ORB — detect\\n500 keypoints\"]\n        A2[\"BFMatcher (Hamming)\\n+ Lowe ratio test\"]\n        A3[\"findHomography\\n(RANSAC, \u003e= 12 points)\"]\n        A4[\"warpPerspective\\nto reference size\"]\n        A1 --\u003e A2 --\u003e A3 --\u003e A4\n        A3 --\u003e|\"\u003c 12 matches\\nfallback\"| A4\n    end\n\n    subgraph EXTRACT[\"Field extraction\"]\n        direction TB\n        E1[\"Scale bbox\\nscaleX/Y = doc / base\"]\n        E2[\"OpenCV crop\\n+ padding\"]\n        E1 --\u003e E2\n    end\n\n    subgraph RECOGNIZE[\"Recognition\"]\n        direction TB\n        R1{\"Field type\"}\n        R2[\"OCR provider\"]\n        R3[\"LLM correction\"]\n        R4[\"Crop \u0026\\nsave to S3\"]\n        R1 --\u003e|\"TEXT / NUMERIC\\nDATE / TABLE\"| R2\n        R2 --\u003e R3\n        R1 --\u003e|\"SIGNATURE\"| R4\n        R1 --\u003e|\"ANCHOR\"| SKIP[\"skip\"]\n    end\n\n    subgraph OUT[\"Output\"]\n        O1[\"recognition_results\\nin PostgreSQL\"]\n        O2[\"JSON / CSV\\nfor download\"]\n        O1 --\u003e O2\n    end\n\n    INPUT --\u003e PREP\n    PREP --\u003e ALIGN\n    ALIGN --\u003e EXTRACT\n    EXTRACT --\u003e RECOGNIZE\n    R3 --\u003e O1\n    R4 --\u003e O1\n```\n\n## Quality metrics\n\nTested on real archival documents. Test corpus used full pages without field markup — a harder condition than production (where the system receives clean cropped fragments).\n\n| Document type | CER without LLM | CER with LLM | WER without LLM | WER with LLM |\n|---|---|---|---|---|\n| Typewritten (Soviet era) | 7.21% | **4.35%** | 37.78% | **16.74%** |\n| Handwritten (early 20th c.) | 22.41% | **16.98%** | 57.71% | **41.91%** |\n\n\u003e CER \u003c 10% is acceptable for research; CER \u003c 3% is the professional archival standard.\n\nLLM correction reduces typewritten CER to 4.35% and halves WER. On early 20th-century handwriting the improvement is more modest — Yandex OCR was trained predominantly on modern handwriting, a structural limitation that post-processing alone cannot fully overcome.\n\n\u003cdetails\u003e\n\u003csummary\u003eTest sources\u003c/summary\u003e\n\n- Typewritten: GASO, fond R-2020, inventory №1, pp. [231](https://yandex.ru/archive/catalog/742f3d4a-4dab-4a2c-91c9-04c7a136a4cf/231), [232](https://yandex.ru/archive/catalog/742f3d4a-4dab-4a2c-91c9-04c7a136a4cf/232), [233](https://yandex.ru/archive/catalog/742f3d4a-4dab-4a2c-91c9-04c7a136a4cf/233), [236](https://yandex.ru/archive/catalog/742f3d4a-4dab-4a2c-91c9-04c7a136a4cf/236), [237](https://yandex.ru/archive/catalog/742f3d4a-4dab-4a2c-91c9-04c7a136a4cf/237), [238](https://yandex.ru/archive/catalog/742f3d4a-4dab-4a2c-91c9-04c7a136a4cf/238)\n- Handwritten: Kopylov case (1906), pp. [9](https://yandex.ru/archive/catalog/065eadb5-c558-42c6-86ef-d113eaee71b3/9), [10](https://yandex.ru/archive/catalog/065eadb5-c558-42c6-86ef-d113eaee71b3/10), [12](https://yandex.ru/archive/catalog/065eadb5-c558-42c6-86ef-d113eaee71b3/12), [14](https://yandex.ru/archive/catalog/065eadb5-c558-42c6-86ef-d113eaee71b3/14)\n\n\u003c/details\u003e\n\n## OCR stack selection\n\n| Solution | Result |\n|---|---|\n| Tesseract | Good on print; unacceptable on handwriting |\n| Surya (neural) | Better than Tesseract, insufficient for Soviet handwriting |\n| PaddleOCR | Unstable results |\n| HuggingFace (Church Slavonic models) + Surya | ~7–10 sec/word — hours per document |\n| **Yandex OCR + Deepseek V3.2** | ✅ Best quality on Russian handwriting |\n\n## Run\n\n```bash\ngit clone https://github.com/notakeith/handscribe.git\ndocker compose up --build\n```\n\nTemplate editor: [http://localhost:8080/templates/editor](http://localhost:8080/templates/editor)  \nSwagger UI: [http://localhost:8080/swagger-ui.html](http://localhost:8080/swagger-ui.html)\n\n## Known limitations\n\n**Variable table geometry** — the system works with fixed rectangles. If column widths vary between documents, markup drifts. Fix: detect table lines via OpenCV as a first pass.\n\n**Perspective distortion** — documents must be scanned reasonably flat. Auto-alignment via anchor points (warp perspective) is not implemented.\n\n**Prompt injection** — if a document contains text like \"ignore previous instructions\", the LLM will follow it. Basic filtering is in place; proper protection requires dedicated work.\n\n**No async feedback** — batch processing takes ~1 minute; the user waits without progress updates. SSE or WebSocket needed instead of polling.\n\n## Roadmap\n\n- Automatic table boundary detection from document lines\n- Fine-tuning OCR model on a specific document type\n- Server-Sent Events for job completion notifications\n- Support for multiple OCR providers selectable by the user\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnotakeith%2Fhandscribe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnotakeith%2Fhandscribe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnotakeith%2Fhandscribe/lists"}