Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cseas/ocr-table
Extract tables from scanned image PDFs using Optical Character Recognition.
https://github.com/cseas/ocr-table
extract-tables ocr ocr-table optical-character-recognition pdfminer python scanned-image-pdfs shell tesseract
Last synced: 5 days ago
JSON representation
Extract tables from scanned image PDFs using Optical Character Recognition.
- Host: GitHub
- URL: https://github.com/cseas/ocr-table
- Owner: cseas
- License: mit
- Created: 2018-05-07T18:12:36.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-06-09T03:34:45.000Z (over 4 years ago)
- Last Synced: 2024-11-03T09:33:36.374Z (9 days ago)
- Topics: extract-tables, ocr, ocr-table, optical-character-recognition, pdfminer, python, scanned-image-pdfs, shell, tesseract
- Language: Python
- Homepage:
- Size: 12.8 MB
- Stars: 265
- Watchers: 14
- Forks: 64
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ocr-table
This project aims to extract tables from scanned image PDFs using Optical Character Recognition.# Install Requirements
1. Tesseract OCR
```sh
sudo apt-get install tesseract-ocr
```2. Imagemagick
```sh
sudo apt-get install imagemagick
```3. PDF Utilities
```sh
sudo apt-get install poppler-utils
```4. Python packages
```sh
sudo pip install -r requirements.txt
```# Usage
1. Clear the [pdf/](pdf) folder and copy all your pdf files to be scanned in it.
2. Run the OCR:
```sh
python3 shellocr.py
```3. The scanned text files shall be available in the [txt/](txt) folder once the process completes.
# Alternate
1. If the above doesn't work for you, try the alternate method.
2. Save your file as input.pdf in the root directory.
3. Run
```sh
python3 pdf_miner.py
```