https://github.com/preetraj2002/tablify
Converts a snapshot of a table (an image) into tabular data using OCR. It using image processing and enhancement techniques to help with the OCR.
https://github.com/preetraj2002/tablify
ocr opencv otsu-thresholding pytesseract pytesseract-ocr python tabular-data
Last synced: 2 months ago
JSON representation
Converts a snapshot of a table (an image) into tabular data using OCR. It using image processing and enhancement techniques to help with the OCR.
- Host: GitHub
- URL: https://github.com/preetraj2002/tablify
- Owner: Preetraj2002
- Created: 2024-12-05T18:22:22.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-05T19:38:15.000Z (over 1 year ago)
- Last Synced: 2025-04-09T00:07:20.021Z (about 1 year ago)
- Topics: ocr, opencv, otsu-thresholding, pytesseract, pytesseract-ocr, python, tabular-data
- Language: Python
- Homepage:
- Size: 872 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# **Tablify** - Convert Images to Tabular Data using OCR
**Tablify** is a Python-based tool that converts tabular data from images into CSV files using Optical Character Recognition (OCR). It processes images, extracts the text using `pytesseract`, and organizes it into rows and columns for easy data extraction and analysis.
## **Features:**
- Converts images of tables into structured CSV files.
- Uses `pytesseract` to perform OCR on images.
- Processes images to detect individual text blocks, sort them by coordinates, and group them into rows.
## **Installation:**
1. **Clone the repository:**
```bash
git clone https://github.com/Preetraj2002/Tablify.git
cd Tablify
```
2. **Install required dependencies:**
Make sure you have Python 3.x installed. Then, install the required libraries:
```bash
pip install -r requirements.txt
```
3. **Install Tesseract OCR:**
- **Windows:** Download the Tesseract installer from [here](https://github.com/UB-Mannheim/tesseract/wiki) and add the path to your system environment variables.
- **Linux:** Install Tesseract using:
```bash
sudo apt install tesseract-ocr
```
- **macOS:** Use Homebrew to install Tesseract:
```bash
brew install tesseract
```
## **How to Use:**
1. **Prepare an Image:**
Ensure the image contains tabular data that you want to extract. The tool works best with clear, well-contrasted images.
2. **Run the Script:**
After setting up, simply run the script on your image:
```bash
python tablify.py path/to/your/image.jpg
```
This will generate a `output.csv` file in the same directory.
3. **Check the Output:**
Open `output.csv` to see the extracted table data in tabular format.
## **Process Inside Tablify:**
1. **Image Preprocessing:**
The image is converted to grayscale, and a binary thresholding is applied to make the text clearer for OCR.
Original:

Grayscale:

After OTSU thresholding:

Dilation:

2. **Contour Detection:**
Using OpenCV, contours of the text blocks are identified to group text into rows and columns.
Marked Countours:

Marked Centroids of the countours:

3. **Text Extraction:**
Each text block is processed with `pytesseract` to extract the text, which is then organized into a structured CSV format.
4. **CSV Generation:**
The processed text is organized into rows based on vertical alignment and saved as a CSV file.
## **License:**
This project is licensed under the MIT License