Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/harryr/ghettoocr
Demonstration of fixed-width OCR and text extraction.
https://github.com/harryr/ghettoocr
algorithm courier ocr php
Last synced: about 2 months ago
JSON representation
Demonstration of fixed-width OCR and text extraction.
- Host: GitHub
- URL: https://github.com/harryr/ghettoocr
- Owner: HarryR
- Created: 2012-05-16T02:15:06.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2012-05-16T02:40:32.000Z (over 12 years ago)
- Last Synced: 2023-03-11T13:02:20.230Z (almost 2 years ago)
- Topics: algorithm, courier, ocr, php
- Language: PHP
- Size: 125 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## GhettoOCR
This project was hacked together in a day to extract data tables from images,
because of the quick and rather naive implementation it only handles fixed-width
non-antialiased "Courier New" at 12pt, but manages to do a reasonable job of it.### Problems / Features
* Written in PHP / Very Portable
* Extremely slow / Easy to Debug
* Nearly no testing / Open-source
* Does nothing fancy / Easy to understand
* Cannot differentiate between '0' (zero) and 'O' in Courier New 12pt.### Design Concepts
The software uses an automatically generated 'possibility intersection table' (made-up term)
to logically deduce which character exists within the search area.The table is built by counting which letters have white or black pixels at each
position within the fixed sized font area. The 'possible letters' are eliminated
by performing an intersection of all letters which have a black or white pixel
against the previous list of possible letters.Example:
```
Letter E | Letter L
|
###### | ###
# # | #
# # | #
### | #
# | #
```Would build a table with:
```
0x0: E,L
1x0: E,L
2x0: E,L
3x0: E
```The starts of lines are identified by performing a brute force search for the first
letter, afterwards the whole lines are scanned from left to right to produce the text
in the correct output order.### Copyright
The code should be considered public domain, fonts and included images/text may not
be used with modified versions of the software as they are for demonstration purposes only.### Why?
Because I went through 6 commercially available pieces of OCR software which were
unable to extract the demo data with reasonable accuracy (even with extensive training
and manual tweaking).Big respect to the developers PrimeOCR: the only comercially available software which
was able to achieve this deceptively simple task.http://primerecognition.com/augprime/prime_ocr.htm