https://github.com/techplexengineer/scan-extractor

Extract structured data from scans
https://github.com/techplexengineer/scan-extractor

Last synced: about 1 month ago
JSON representation

Extract structured data from scans

Host: GitHub
URL: https://github.com/techplexengineer/scan-extractor
Owner: TechplexEngineer
Created: 2019-12-15T17:22:05.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-12-15T17:41:32.000Z (over 6 years ago)
Last Synced: 2025-11-17T19:04:07.179Z (7 months ago)
Language: Python
Size: 2.93 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

Scan Extractor
==============

This repo documents the process to extract strucutred data from images of textual data.

## Step 1
Convert HEIC to JPG images
one_two.sh

If your images are already in JPG format. You can skip this step.
If your images are not in HEIC, you may be able to use ImageMagick's Convert program.

## Step 2
Straighten, Dewarp, remove paper, and convert to Black & White
I reccommend using [Scan Tailor](https://scantailor.org/)

If your images are really good. You might be able to use a vairation on
`two_three.sh` which uses [textcleaner by Fred Weinhaus](http://www.fmwconcepts.com/imagemagick/textcleaner/)

## Step 3
Use Tesseract OCR to extract text from the images.
`three_four.sh`

# If you know your input has a limited character set, I have found that using
`tessedit_char_whitelist` eliminates post processing work needed.

## Step 4
Post process the OCR output and generate a CSV.
`process.py`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/techplexengineer/scan-extractor

Awesome Lists containing this project

README