{"id":15764594,"url":"https://github.com/soodoku/recognize","last_synced_at":"2025-03-31T10:21:49.305Z","repository":{"id":144885776,"uuid":"42123777","full_name":"soodoku/recognize","owner":"soodoku","description":"Assess OCR quality: Compare OCR to human transcription","archived":false,"fork":false,"pushed_at":"2015-11-06T22:53:37.000Z","size":2476,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-11T12:16:24.656Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soodoku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-09-08T16:10:50.000Z","updated_at":"2020-10-25T03:38:44.000Z","dependencies_parsed_at":"2023-04-12T04:41:44.623Z","dependency_job_id":null,"html_url":"https://github.com/soodoku/recognize","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soodoku%2Frecognize","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soodoku%2Frecognize/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soodoku%2Frecognize/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soodoku%2Frecognize/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soodoku","download_url":"https://codeload.github.com/soodoku/recognize/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246450641,"owners_count":20779454,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-04T12:04:10.612Z","updated_at":"2025-03-31T10:21:49.279Z","avatar_url":"https://github.com/soodoku.png","language":"R","readme":"### Recognize: Assess OCR Quality\n\n[![GPL-3.0](http://img.shields.io/:license-gpl-blue.svg)](http://opensource.org/licenses/GPL-3.0)\n[![Build Status](https://travis-ci.org/soodoku/recognize.svg?branch=master)](https://travis-ci.org/soodoku/recognize)\n[![Build status](https://ci.appveyor.com/api/projects/status/yovq8wfrw813usv2?svg=true)](https://ci.appveyor.com/project/soodoku/recognize)\n\n### Testing Quality\n\nSay you run 10,000 documents through an OCR software. And now you want to know how well did the software do in recognizing the text. To do that -- a simple solution exists. Take a random sample of the documents. Get humans to transcribe the sample. Call the human transcriptions, the gold standard. Compare the OCR text to the gold standard. In particular, calculate the edit distance between the two documents. \n\nSmall courtesies may be neccessary --- removing extra spaces for one. Whenever there are images in the document with no text, OCR software typically add loads of extra space. We would want to account for that.\n\nMore complex formulations of this simple plan are easily apprehended. The above method doesn't take account of text **decorations**, headers etc. And capturing these may be essential. Extensions to unicode would also be useful. Analysis of how well we have done with regards to more complex data structures like tables needs further thought. But a start. \n\n### Installation\n\nTo get the current development version from github:\n\n```r\n# install.packages(\"devtools\")\ndevtools::install_github(\"soodoku/recognize\")\n```\n\nThe package depends on [readr](https://github.com/hadley/readr) and [RecordLinkage](https://cran.r-project.org/web/packages/RecordLinkage/index.html)\n\n### Usage\n\n```r\nsetwd(path.package(\"recognize\"))\ncompare_txt(\"inst/extdata/abbyyR_wisc_out/PA_Casey_Auditor_General\", \"inst/extdata/gold_wisc_out/PA_Casey_Auditor_General\")\n```\n\n```r\n## 163\n```\n\n#### License\nScripts are released under [GNU V3](http://www.gnu.org/licenses/gpl-3.0.en.html).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoodoku%2Frecognize","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoodoku%2Frecognize","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoodoku%2Frecognize/lists"}