https://github.com/preetraj2002/tablify

Converts a snapshot of a table (an image) into tabular data using OCR. It using image processing and enhancement techniques to help with the OCR.
https://github.com/preetraj2002/tablify

ocr opencv otsu-thresholding pytesseract pytesseract-ocr python tabular-data

Last synced: 2 months ago
JSON representation

Converts a snapshot of a table (an image) into tabular data using OCR. It using image processing and enhancement techniques to help with the OCR.

Host: GitHub
URL: https://github.com/preetraj2002/tablify
Owner: Preetraj2002
Created: 2024-12-05T18:22:22.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-05T19:38:15.000Z (over 1 year ago)
Last Synced: 2025-04-09T00:07:20.021Z (about 1 year ago)
Topics: ocr, opencv, otsu-thresholding, pytesseract, pytesseract-ocr, python, tabular-data
Language: Python
Homepage:
Size: 872 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# **Tablify** - Convert Images to Tabular Data using OCR

**Tablify** is a Python-based tool that converts tabular data from images into CSV files using Optical Character Recognition (OCR). It processes images, extracts the text using `pytesseract`, and organizes it into rows and columns for easy data extraction and analysis.

## **Features:**
- Converts images of tables into structured CSV files.
- Uses `pytesseract` to perform OCR on images.
- Processes images to detect individual text blocks, sort them by coordinates, and group them into rows.

## **Installation:**

1. **Clone the repository:**

```bash
git clone https://github.com/Preetraj2002/Tablify.git
cd Tablify
```

2. **Install required dependencies:**

Make sure you have Python 3.x installed. Then, install the required libraries:

```bash
pip install -r requirements.txt
```

3. **Install Tesseract OCR:**

- **Windows:** Download the Tesseract installer from [here](https://github.com/UB-Mannheim/tesseract/wiki) and add the path to your system environment variables.
- **Linux:** Install Tesseract using:

```bash
sudo apt install tesseract-ocr
```

- **macOS:** Use Homebrew to install Tesseract:

```bash
brew install tesseract
```

## **How to Use:**

1. **Prepare an Image:**
Ensure the image contains tabular data that you want to extract. The tool works best with clear, well-contrasted images.

2. **Run the Script:**
After setting up, simply run the script on your image:

```bash
python tablify.py path/to/your/image.jpg
```

This will generate a `output.csv` file in the same directory.

3. **Check the Output:**
Open `output.csv` to see the extracted table data in tabular format.

## **Process Inside Tablify:**
1. **Image Preprocessing:**
The image is converted to grayscale, and a binary thresholding is applied to make the text clearer for OCR.

Original:
![Original](images/image_csv.jpeg)

Grayscale:
![Gray](images/gray_image.png)

After OTSU thresholding:
![Thresholded_image](images/thresholded_image.png)

Dilation:
![Dilation](images/dilation.png)

2. **Contour Detection:**
Using OpenCV, contours of the text blocks are identified to group text into rows and columns.

Marked Countours:
![Countours](images/countours.png)

Marked Centroids of the countours:
![centroids](images/centroids_with_labels.png)

3. **Text Extraction:**
Each text block is processed with `pytesseract` to extract the text, which is then organized into a structured CSV format.

4. **CSV Generation:**
The processed text is organized into rows based on vertical alignment and saved as a CSV file.

## **License:**

This project is licensed under the MIT License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/preetraj2002/tablify

Awesome Lists containing this project

README