https://github.com/techplexengineer/scan-extractor
Extract structured data from scans
https://github.com/techplexengineer/scan-extractor
Last synced: 4 months ago
JSON representation
Extract structured data from scans
- Host: GitHub
- URL: https://github.com/techplexengineer/scan-extractor
- Owner: TechplexEngineer
- Created: 2019-12-15T17:22:05.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-12-15T17:41:32.000Z (over 6 years ago)
- Last Synced: 2025-02-25T09:19:01.318Z (about 1 year ago)
- Language: Python
- Size: 2.93 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
Scan Extractor
==============
This repo documents the process to extract strucutred data from images of textual data.
## Step 1
Convert HEIC to JPG images
one_two.sh
If your images are already in JPG format. You can skip this step.
If your images are not in HEIC, you may be able to use ImageMagick's Convert program.
## Step 2
Straighten, Dewarp, remove paper, and convert to Black & White
I reccommend using [Scan Tailor](https://scantailor.org/)
If your images are really good. You might be able to use a vairation on
`two_three.sh` which uses [textcleaner by Fred Weinhaus](http://www.fmwconcepts.com/imagemagick/textcleaner/)
## Step 3
Use Tesseract OCR to extract text from the images.
`three_four.sh`
# If you know your input has a limited character set, I have found that using
`tessedit_char_whitelist` eliminates post processing work needed.
## Step 4
Post process the OCR output and generate a CSV.
`process.py`