Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/cseas/ocr-table

Extract tables from scanned image PDFs using Optical Character Recognition.
https://github.com/cseas/ocr-table

extract-tables ocr ocr-table optical-character-recognition pdfminer python scanned-image-pdfs shell tesseract

Last synced: 3 months ago
JSON representation

Extract tables from scanned image PDFs using Optical Character Recognition.

Host: GitHub
URL: https://github.com/cseas/ocr-table
Owner: cseas
License: mit
Created: 2018-05-07T18:12:36.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2020-06-09T03:34:45.000Z (about 4 years ago)
Last Synced: 2024-01-16T02:48:15.506Z (5 months ago)
Topics: extract-tables, ocr, ocr-table, optical-character-recognition, pdfminer, python, scanned-image-pdfs, shell, tesseract
Language: Python
Homepage:
Size: 12.8 MB
Stars: 243
Watchers: 14
Forks: 64
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

awesome-ocr - ocr-table - Extract tables from scanned image PDFs using Optical Character Recognition. (6. <a name='Tabledetection'></a>Table detection / 4.5. <a name='FormSegmentation'></a>Form Segmentation)
awesome-ocr - ocr-table - Extract tables from scanned image PDFs using Optical Character Recognition. (Table detection / Form Segmentation)
awesome-stars - cseas/ocr-table - Extract tables from scanned image PDFs using Optical Character Recognition. (Python)

# ocr-table
This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

# Install Requirements

1. Tesseract OCR
```sh
sudo apt-get install tesseract-ocr
```

2. Imagemagick
```sh
sudo apt-get install imagemagick
```

3. PDF Utilities
```sh
sudo apt-get install poppler-utils
```

4. Python packages
```sh
sudo pip install -r requirements.txt
```

# Usage

1. Clear the [pdf/](pdf) folder and copy all your pdf files to be scanned in it.

2. Run the OCR:
```sh
python3 shellocr.py
```

3. The scanned text files shall be available in the [txt/](txt) folder once the process completes.

# Alternate

1. If the above doesn't work for you, try the alternate method.

2. Save your file as input.pdf in the root directory.

3. Run
```sh
python3 pdf_miner.py
```