https://github.com/c0ntradicti0n/picktab

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/c0ntradicti0n/picktab
Owner: c0ntradicti0n
Created: 2019-05-08T15:53:54.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2019-05-09T15:39:18.000Z (about 6 years ago)
Last Synced: 2025-02-15T00:42:12.699Z (4 months ago)
Language: Python
Size: 476 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

picktab
-------

It extracts first the tabular layout and goes on to the OCR reading.

Based on a script from [pdftabextract](https://github.com/WZBSocialScienceCenter/pdftabextract), Markus Konrad

The reason for this is, that tesseract makes a lot of mistakes, when processing layouted tables, e.g. ignoring some columns or cells.
One can either change the cell borders to readable chars or take subimages. I tried the latter.

Therefore this script uses the layout recognition of pdftabextract, and creates little images for the cells to do OpticalCharacterRecognition.

References
----------
[pdftabextract](https://github.com/WZBSocialScienceCenter/pdftabextract)

[other approach: Camelot](https://hackernoon.com/an-open-source-science-tool-to-extract-tables-from-pdfs-into-excels-3ed3cc7f22e1)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/c0ntradicti0n/picktab

Awesome Lists containing this project

README