https://github.com/harryr/ghettoocr

Demonstration of fixed-width OCR and text extraction.
https://github.com/harryr/ghettoocr

algorithm courier ocr php

Last synced: 4 months ago
JSON representation

Demonstration of fixed-width OCR and text extraction.

Host: GitHub
URL: https://github.com/harryr/ghettoocr
Owner: HarryR
Created: 2012-05-16T02:15:06.000Z (over 13 years ago)
Default Branch: master
Last Pushed: 2012-05-16T02:40:32.000Z (over 13 years ago)
Last Synced: 2025-01-24T23:27:31.109Z (10 months ago)
Topics: algorithm, courier, ocr, php
Language: PHP
Size: 125 KB
Stars: 2
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## GhettoOCR

This project was hacked together in a day to extract data tables from images,
because of the quick and rather naive implementation it only handles fixed-width
non-antialiased "Courier New" at 12pt, but manages to do a reasonable job of it.

### Problems / Features

* Written in PHP / Very Portable
* Extremely slow / Easy to Debug
* Nearly no testing / Open-source
* Does nothing fancy / Easy to understand
* Cannot differentiate between '0' (zero) and 'O' in Courier New 12pt.

### Design Concepts

The software uses an automatically generated 'possibility intersection table' (made-up term)
to logically deduce which character exists within the search area.

The table is built by counting which letters have white or black pixels at each
position within the fixed sized font area. The 'possible letters' are eliminated
by performing an intersection of all letters which have a black or white pixel
against the previous list of possible letters.

Example:
```
Letter E | Letter L
|
###### | ###
# # | #
# # | #
### | #
# | #
```

Would build a table with:
```
0x0: E,L
1x0: E,L
2x0: E,L
3x0: E
```

The starts of lines are identified by performing a brute force search for the first
letter, afterwards the whole lines are scanned from left to right to produce the text
in the correct output order.

### Copyright

The code should be considered public domain, fonts and included images/text may not
be used with modified versions of the software as they are for demonstration purposes only.

### Why?

Because I went through 6 commercially available pieces of OCR software which were
unable to extract the demo data with reasonable accuracy (even with extensive training
and manual tweaking).

Big respect to the developers PrimeOCR: the only comercially available software which
was able to achieve this deceptively simple task.

http://primerecognition.com/augprime/prime_ocr.htm

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/harryr/ghettoocr

Awesome Lists containing this project

README