Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/creisle/creisle.print_ocr
https://github.com/creisle/creisle.print_ocr
Last synced: 7 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/creisle/creisle.print_ocr
- Owner: creisle
- License: gpl-2.0
- Created: 2014-03-28T02:39:43.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2015-12-27T19:34:45.000Z (almost 9 years ago)
- Last Synced: 2023-08-19T06:10:28.457Z (about 1 year ago)
- Language: Java
- Size: 5.51 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Optical Character Recognition of printed text
=================**Authors: Caralyn Reisle and Sunette Mynhardt**
Date: 2014 March 27
Project Goal: take jpg file input, extract text, output to plain text file (Final Project for a datamining course)
Notes: currently only works for clean (no noise) straight images. Later versions will attempt more image cleaning in the pre-processing steps as well as a more robust scanning algorithm for isolating putative text components. See presentation.pdf for more details and references.
##How it works?
- Pre-Processing:
- Convert Image to Binary
- Clustering to isolate text components
- Classification: Training
- Compute Attribute vectors for each component binary array
- Output the training image vectors to an arff file
- Classification: Model
- build the classification model using the weka (http://www.cs.waikato.ac.nz/ml/weka/) java api and the output arff file
- Other processing:
- for test images, finding lines of text. Separating text components based on overlapping range values
- finding spaces between words. Makes the assumption that there are two types of spaces in the image (words vs characters). Computes a list of values for spaces and separates it into two groups attempting to minimize the maximum sum squared error (sse). Compares values when outputing text to the averages of these two groups##Library Structure for training images:
alpha characters: \/\/\/\
symbols: \/\/\
numbers: \/\/\
each directory must contain at leat two test images (better results with more images) that contain only 1 instance of that character
to train/add a new character, create a directory for that character in the approriate location and then add the same character to the global variable alpha_tnr. note that at this point in time this program is built to train on TimesNewRoman fonts only
##Runnning/Compiling on command line
**compile**: javac -cp weka.jar PrintOcr.java
note: you will need to have the weka.jar file in the same directory or specify a different classpath
**run**: java -cp weka.jar:. PrintOcr \ \
optional parameters:
1. **train**: generates the arff file. note you will need the training image library for this
2. **output**: additional ouput for debugging purposes. outputs to command line as well as producing a copy of the original image that shows the performance of the clustering algorithm
3. **svm**: if included weka will generate the classifier model using SMO model, else will use the MultilayerPerceptron model
4. **eval**: outputs an evaluation summary of the model tested on the training data