https://github.com/cseas/ocr-table

Extract tables from scanned image PDFs using Optical Character Recognition.
https://github.com/cseas/ocr-table

extract-tables ocr ocr-table optical-character-recognition pdfminer python scanned-image-pdfs shell tesseract

Last synced: 3 months ago
JSON representation

Extract tables from scanned image PDFs using Optical Character Recognition.

Host: GitHub
URL: https://github.com/cseas/ocr-table
Owner: cseas
License: mit
Created: 2018-05-07T18:12:36.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2020-06-09T03:34:45.000Z (about 5 years ago)
Last Synced: 2025-04-09T21:19:20.519Z (3 months ago)
Topics: extract-tables, ocr, ocr-table, optical-character-recognition, pdfminer, python, scanned-image-pdfs, shell, tesseract
Language: Python
Homepage:
Size: 12.8 MB
Stars: 273
Watchers: 13
Forks: 67
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

README

# ocr-table
This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

# Install Requirements

1. Tesseract OCR
```sh
sudo apt-get install tesseract-ocr
```

2. Imagemagick
```sh
sudo apt-get install imagemagick
```

3. PDF Utilities
```sh
sudo apt-get install poppler-utils
```

4. Python packages
```sh
sudo pip install -r requirements.txt
```

# Usage

1. Clear the [pdf/](pdf) folder and copy all your pdf files to be scanned in it.

2. Run the OCR:
```sh
python3 shellocr.py
```

3. The scanned text files shall be available in the [txt/](txt) folder once the process completes.

# Alternate

1. If the above doesn't work for you, try the alternate method.

2. Save your file as input.pdf in the root directory.

3. Run
```sh
python3 pdf_miner.py
```