{"id":18075793,"url":"https://github.com/priyasingh26/financial_document-data_extraction","last_synced_at":"2026-04-08T20:02:50.924Z","repository":{"id":259089904,"uuid":"868401946","full_name":"priyasingh26/Financial_Document-Data_Extraction","owner":"priyasingh26","description":"This project extracts key information from financial documents like invoices and receipts using text recognition. It processes images, classifies documents, and extracts data, which is then stored in a CSV file. The aim is to automate data collection from scanned documents, reducing manual work and increasing accuracy.","archived":false,"fork":false,"pushed_at":"2024-10-06T10:19:41.000Z","size":55628,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-19T17:47:49.948Z","etag":null,"topics":["data-extraction","numpy","ocr","pandas","pillow","preprocessing","pytesseract-ocr","python","sklearn","torch","transformers"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/priyasingh26.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-06T09:44:29.000Z","updated_at":"2024-10-06T10:23:37.000Z","dependencies_parsed_at":"2024-10-22T14:27:00.529Z","dependency_job_id":null,"html_url":"https://github.com/priyasingh26/Financial_Document-Data_Extraction","commit_stats":null,"previous_names":["priyasingh26/financial_document-data_extraction"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/priyasingh26/Financial_Document-Data_Extraction","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyasingh26%2FFinancial_Document-Data_Extraction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyasingh26%2FFinancial_Document-Data_Extraction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyasingh26%2FFinancial_Document-Data_Extraction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyasingh26%2FFinancial_Document-Data_Extraction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/priyasingh26","download_url":"https://codeload.github.com/priyasingh26/Financial_Document-Data_Extraction/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyasingh26%2FFinancial_Document-Data_Extraction/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31571601,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-extraction","numpy","ocr","pandas","pillow","preprocessing","pytesseract-ocr","python","sklearn","torch","transformers"],"created_at":"2024-10-31T11:07:29.939Z","updated_at":"2026-04-08T20:02:50.894Z","avatar_url":"https://github.com/priyasingh26.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Financial Document Information Extraction\n\n![Financial Document Information Extraction cover image](./Cover.png)\n\nThis project automates the extraction of key details from financial documents like invoices, receipts, and bills using Optical Character Recognition (OCR) and LayoutLM-based document classification.\n\n## Features\n\n- **Image Processing**: Handles multiple image formats (PNG, JPEG, TIFF, etc.) and converts them to grayscale.\n- **Optical Character Recognition (OCR)**: Extracts text from images using Tesseract OCR.\n- **Document Classification**: Classifies documents into predefined categories using LayoutLM, a state-of-the-art model for document understanding.\n- **Information Extraction**: Extracts important financial details such as numeric values and dates.\n- **CSV Storage**: Saves the extracted data, including document details, predicted labels, and accuracy, into a CSV file for easy review and analysis.\n\n## How It Works\n\n1. **Load Images**: The system reads images of financial documents from a specified directory.\n2. **OCR Process**: The images are converted to text using OCR.\n3. **Document Classification**: The extracted text is used to classify the document type.\n4. **Information Extraction**: Key financial data, like amounts and dates, are extracted from the text.\n5. **Store Data**: The extracted information is saved in a CSV file for further use.\n\n## Installation\n\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/priyasingh26/Financial_Document-Data_Extraction.git\n2. Install required Libraries:\n   ```bash\n   pip install -r requirements.txt\n3. Ensure you have [Tesseract OCR](https://tesseract-ocr.github.io/tessdoc/Downloads.html) installed and properly configured.\n\n## Usage\n\n1. Place your financial document images inside the archive folder, organized by document type.\n2. Run the script:\n  ```bash\n  python main.py\n  ```\n3. After processing, check the extracted_document_info.csv file for extracted data.\n   \n## Output\n\n### CSV file containing:\n- File names\n- True and predicted document labels\n- Extracted text details\n- Prediction accuracy\n\n### Technologies Used\n- Python\n- Tesseract OCR\n- LayoutLM (via Hugging Face Transformers)\n- OpenCV\n- Pandas\n\n### Contributing\n[Ronak Parmar](https://github.com/ronak-create)\n\nLicense\nThis project is licensed under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpriyasingh26%2Ffinancial_document-data_extraction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpriyasingh26%2Ffinancial_document-data_extraction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpriyasingh26%2Ffinancial_document-data_extraction/lists"}