Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/priyasingh26/financial_document-data_extraction
This project extracts key information from financial documents like invoices and receipts using text recognition. It processes images, classifies documents, and extracts data, which is then stored in a CSV file. The aim is to automate data collection from scanned documents, reducing manual work and increasing accuracy.
https://github.com/priyasingh26/financial_document-data_extraction
data-extraction numpy ocr pandas pillow preprocessing pytesseract-ocr python sklearn torch transformers
Last synced: 19 days ago
JSON representation
This project extracts key information from financial documents like invoices and receipts using text recognition. It processes images, classifies documents, and extracts data, which is then stored in a CSV file. The aim is to automate data collection from scanned documents, reducing manual work and increasing accuracy.
- Host: GitHub
- URL: https://github.com/priyasingh26/financial_document-data_extraction
- Owner: priyasingh26
- License: apache-2.0
- Created: 2024-10-06T09:44:29.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-10-06T10:19:41.000Z (about 1 month ago)
- Last Synced: 2024-10-31T11:07:18.216Z (19 days ago)
- Topics: data-extraction, numpy, ocr, pandas, pillow, preprocessing, pytesseract-ocr, python, sklearn, torch, transformers
- Language: Jupyter Notebook
- Homepage:
- Size: 53.1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Financial Document Information Extraction
![Financial Document Information Extraction cover image](./Cover.png)
This project automates the extraction of key details from financial documents like invoices, receipts, and bills using Optical Character Recognition (OCR) and LayoutLM-based document classification.
## Features
- **Image Processing**: Handles multiple image formats (PNG, JPEG, TIFF, etc.) and converts them to grayscale.
- **Optical Character Recognition (OCR)**: Extracts text from images using Tesseract OCR.
- **Document Classification**: Classifies documents into predefined categories using LayoutLM, a state-of-the-art model for document understanding.
- **Information Extraction**: Extracts important financial details such as numeric values and dates.
- **CSV Storage**: Saves the extracted data, including document details, predicted labels, and accuracy, into a CSV file for easy review and analysis.## How It Works
1. **Load Images**: The system reads images of financial documents from a specified directory.
2. **OCR Process**: The images are converted to text using OCR.
3. **Document Classification**: The extracted text is used to classify the document type.
4. **Information Extraction**: Key financial data, like amounts and dates, are extracted from the text.
5. **Store Data**: The extracted information is saved in a CSV file for further use.## Installation
1. Clone the repository:
```bash
git clone https://github.com/priyasingh26/Financial_Document-Data_Extraction.git
2. Install required Libraries:
```bash
pip install -r requirements.txt
3. Ensure you have [Tesseract OCR](https://tesseract-ocr.github.io/tessdoc/Downloads.html) installed and properly configured.## Usage
1. Place your financial document images inside the archive folder, organized by document type.
2. Run the script:
```bash
python main.py
```
3. After processing, check the extracted_document_info.csv file for extracted data.
## Output### CSV file containing:
- File names
- True and predicted document labels
- Extracted text details
- Prediction accuracy### Technologies Used
- Python
- Tesseract OCR
- LayoutLM (via Hugging Face Transformers)
- OpenCV
- Pandas### Contributing
[Ronak Parmar](https://github.com/ronak-create)License
This project is licensed under the MIT License.