{"id":21629752,"url":"https://github.com/harryr/ghettoocr","last_synced_at":"2025-08-07T20:23:51.753Z","repository":{"id":3301034,"uuid":"4342704","full_name":"HarryR/GhettoOCR","owner":"HarryR","description":"Demonstration of fixed-width OCR and text extraction.","archived":false,"fork":false,"pushed_at":"2012-05-16T02:40:32.000Z","size":128,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-24T23:27:31.109Z","etag":null,"topics":["algorithm","courier","ocr","php"],"latest_commit_sha":null,"homepage":null,"language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HarryR.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-05-16T02:15:06.000Z","updated_at":"2015-01-12T05:28:37.000Z","dependencies_parsed_at":"2022-09-04T01:10:46.341Z","dependency_job_id":null,"html_url":"https://github.com/HarryR/GhettoOCR","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HarryR%2FGhettoOCR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HarryR%2FGhettoOCR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HarryR%2FGhettoOCR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HarryR%2FGhettoOCR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HarryR","download_url":"https://codeload.github.com/HarryR/GhettoOCR/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244306709,"owners_count":20431877,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","courier","ocr","php"],"created_at":"2024-11-25T02:08:39.121Z","updated_at":"2025-03-18T21:22:46.827Z","avatar_url":"https://github.com/HarryR.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"## GhettoOCR\n\nThis project was hacked together in a day to extract data tables from images,\nbecause of the quick and rather naive implementation it only handles fixed-width\nnon-antialiased \"Courier New\" at 12pt, but manages to do a reasonable job of it.\n\n### Problems / Features\n\n * Written in PHP / Very Portable\n * Extremely slow / Easy to Debug\n * Nearly no testing / Open-source\n * Does nothing fancy / Easy to understand\n * Cannot differentiate between '0' (zero) and 'O' in Courier New 12pt.\n\n### Design Concepts\n\nThe software uses an automatically generated 'possibility intersection table' (made-up term)\nto logically deduce which character exists within the search area. \n\nThe table is built by counting which letters have white or black pixels at each\nposition within the fixed sized font area. The 'possible letters' are eliminated\nby performing an intersection of all letters which have a black or white pixel\nagainst the previous list of possible letters.\n\nExample:\n```\nLetter E  | Letter L\n          |\n######    | ###\n #   #    |  #\n # #      |  #\n ###      |  #\n #        |  #\n```\n\nWould build a table with:\n```\n0x0: E,L\n1x0: E,L\n2x0: E,L\n3x0: E\n```\n\nThe starts of lines are identified by performing a brute force search for the first\nletter, afterwards the whole lines are scanned from left to right to produce the text\nin the correct output order.\n\n### Copyright\n\nThe code should be considered public domain, fonts and included images/text may not\nbe used with modified versions of the software as they are for demonstration purposes only.\n\n### Why?\n\nBecause I went through 6 commercially available pieces of OCR software which were \nunable to extract the demo data with reasonable accuracy (even with extensive training\nand manual tweaking).\n\nBig respect to the developers PrimeOCR: the only comercially available software which \nwas able to achieve this deceptively simple task.\n\nhttp://primerecognition.com/augprime/prime_ocr.htm\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharryr%2Fghettoocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharryr%2Fghettoocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharryr%2Fghettoocr/lists"}