https://github.com/cseas/ocr-table
Extract tables from scanned image PDFs using Optical Character Recognition.
https://github.com/cseas/ocr-table
extract-tables ocr ocr-table optical-character-recognition pdfminer python scanned-image-pdfs shell tesseract
Last synced: 3 months ago
JSON representation
Extract tables from scanned image PDFs using Optical Character Recognition.
- Host: GitHub
- URL: https://github.com/cseas/ocr-table
- Owner: cseas
- License: mit
- Created: 2018-05-07T18:12:36.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2020-06-09T03:34:45.000Z (about 5 years ago)
- Last Synced: 2025-04-09T21:19:20.519Z (3 months ago)
- Topics: extract-tables, ocr, ocr-table, optical-character-recognition, pdfminer, python, scanned-image-pdfs, shell, tesseract
- Language: Python
- Homepage:
- Size: 12.8 MB
- Stars: 273
- Watchers: 13
- Forks: 67
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ocr-table
This project aims to extract tables from scanned image PDFs using Optical Character Recognition.# Install Requirements
1. Tesseract OCR
```sh
sudo apt-get install tesseract-ocr
```2. Imagemagick
```sh
sudo apt-get install imagemagick
```3. PDF Utilities
```sh
sudo apt-get install poppler-utils
```4. Python packages
```sh
sudo pip install -r requirements.txt
```# Usage
1. Clear the [pdf/](pdf) folder and copy all your pdf files to be scanned in it.
2. Run the OCR:
```sh
python3 shellocr.py
```3. The scanned text files shall be available in the [txt/](txt) folder once the process completes.
# Alternate
1. If the above doesn't work for you, try the alternate method.
2. Save your file as input.pdf in the root directory.
3. Run
```sh
python3 pdf_miner.py
```