https://github.com/c0ntradicti0n/picktab
https://github.com/c0ntradicti0n/picktab
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/c0ntradicti0n/picktab
- Owner: c0ntradicti0n
- Created: 2019-05-08T15:53:54.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2019-05-09T15:39:18.000Z (about 6 years ago)
- Last Synced: 2025-02-15T00:42:12.699Z (4 months ago)
- Language: Python
- Size: 476 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
picktab
-------It extracts first the tabular layout and goes on to the OCR reading.
Based on a script from [pdftabextract](https://github.com/WZBSocialScienceCenter/pdftabextract), Markus Konrad
The reason for this is, that tesseract makes a lot of mistakes, when processing layouted tables, e.g. ignoring some columns or cells.
One can either change the cell borders to readable chars or take subimages. I tried the latter.Therefore this script uses the layout recognition of pdftabextract, and creates little images for the cells to do OpticalCharacterRecognition.
References
----------
[pdftabextract](https://github.com/WZBSocialScienceCenter/pdftabextract)[other approach: Camelot](https://hackernoon.com/an-open-source-science-tool-to-extract-tables-from-pdfs-into-excels-3ed3cc7f22e1)