https://github.com/ohimoiza1205/ocr-label-studio-automation-
This project is an end-to-end workflow for processing a sample invoice using OCR & manual annotation. The project demonstrates how to extract text from an invoice using Tesseract OCR, refine the results in Label Studio, & prepare a high-quality dataset for AI training. Includes configuration files, scripts, & documentation for document processing
https://github.com/ohimoiza1205/ocr-label-studio-automation-
ai-training google-colab label-studio machine-learning ocr tesseract
Last synced: 7 months ago
JSON representation
This project is an end-to-end workflow for processing a sample invoice using OCR & manual annotation. The project demonstrates how to extract text from an invoice using Tesseract OCR, refine the results in Label Studio, & prepare a high-quality dataset for AI training. Includes configuration files, scripts, & documentation for document processing
- Host: GitHub
- URL: https://github.com/ohimoiza1205/ocr-label-studio-automation-
- Owner: Ohimoiza1205
- License: mit
- Created: 2025-03-10T05:22:37.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-03-10T06:01:18.000Z (7 months ago)
- Last Synced: 2025-03-10T06:27:05.744Z (7 months ago)
- Topics: ai-training, google-colab, label-studio, machine-learning, ocr, tesseract
- Homepage:
- Size: 46.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Invoice Annotation Project
This repository contains all the code, configuration files, and documentation for the Invoice Annotation Project. The objective is to build an end-to-end workflow for processing a sample invoice using OCR, refining annotations with Label Studio, and preparing a dataset for AI training.
## Table of Contents
- [Overview](#overview)
- [Project Workflow](#project-workflow)
- [Requirements](#requirements)
- [Setup and Installation](#setup-and-installation)
- [Step-by-Step Instructions](#step-by-step-instructions)
- [Step 1: Document Analysis & Label Identification](#step-1-document-analysis--label-identification)
- [Step 2: XML Label Configuration](#step-2-xml-label-configuration)
- [Step 3: Generate OCR Data](#step-3-generate-ocr-data)
- [Step 4: Label Refinement in Label Studio](#step-4-label-refinement-in-label-studio)
- [Step 5: Reflection and Final Report](#step-5-reflection-and-final-report)
- [License](#license)
- [Contact](#contact)## Overview
The project demonstrates:
- Extraction of text from a scanned invoice using Tesseract OCR.
- Creation of a custom XML configuration for Label Studio.
- Refinement of OCR output and manual labeling of invoice fields.
- Compilation of a final annotated dataset and a reflection on the process.## Project Workflow
1. **Document Analysis & Label Identification:** Identify key invoice fields.
2. **XML Label Configuration:** Create an XML file to define labels for annotation.
3. **Generate OCR Data:** Use a Colab Notebook to extract text and bounding boxes.
4. **Label Refinement in Label Studio:** Import and refine annotations.
5. **Reflection and Final Report:** Write a reflection on your process and improvements.## Requirements
- Python 3.7+
- Tesseract OCR (installed and in your PATH)
- Google Colab account (for running the OCR notebook)
- Label Studio (installed locally)## Setup and Installation
1. **Clone the Repository:**
```bash
git clone https://github.com/yourusername/Invoice-Annotation-Project.git
cd Invoice-Annotation-Project2. **Install Label Studio:**
```bash
pip install label-studio
label-studio start
```
3. **Set Up Tesseract OCR:**
Follow instructions for your OS to install Tesseract OCR.# Invoice Annotation Project: Step-by-Step Instructions
## Step 1: Document Analysis & Label Identification
**Task:**
Review the sample invoice image and identify key fields.**Deliverable:**
A bullet list of fields saved in `document_analysis.txt`.**Example:**
- Invoice Number
- Invoice Date
- Customer Name
- Item Descriptions
- Quantity, Unit Price
- Tax
- Total Amount---
## Step 2: XML Label Configuration
**Task:**
Create an XML configuration file for Label Studio to define the labels for annotation.**Deliverable:**
Save the following as `invoice_label_config.xml`.**Example:**
```xml
```
## Step 3: Generate OCR Data**Task:**
Run the provided Google Colab Notebook to extract text and bounding boxes from the sample invoice.**Instructions:**
1. Open the Pre-Configured Colab Notebook.
2. Update the `image_url` variable to:
```bash
https://allies-assets.s3.us-east-1.amazonaws.com/birthplan_builder_assets/extra_images/invoice_sample.png
3. Click Runtime > Run all to execute all cells.
4. When prompted (or via the Files sidebar), download the generated file as invoice_ocr_output.json.
- Deliverable:
invoice_ocr_output.json## Step 4: Label Refinement in Label Studio
**Task:**
Import the OCR JSON file into Label Studio, refine the bounding boxes, correct OCR text, and assign the correct labels.**Instructions:**
1. Launch Label Studio at http://localhost:8080 and create a new project named "Invoice Annotation."
2. In the project settings, paste the XML configuration from Step 2 and click Save.
3. Import the invoice_ocr_output.json file into your project.
4. Open the task, adjust bounding boxes, delete any irrelevant annotations, and manually correct any OCR errors.
5. Once satisfied, export the final labeled dataset as label_studio_output.json.
- Deliverable:
label_studio_output.json## Step 5: Reflection and Final Report
**Task:**
Write a reflection (150–200 words) discussing:1. Label Selection: Why you chose the specific labels.
2. Challenges: What issues you encountered during OCR and annotation, and how you addressed them.
3. Workflow Improvements: Suggestions for streamlining the process in the future.
- Deliverable:
Save your reflection in reflection.md.## License
This project is licensed under the MIT License.## Contact
For any questions, please contact [omoiza@ttu.edu].