Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/403errors/ai-docparser
An application framework developed using the latest AI technologies to extract the values of specific pre-defined keys from a given PDF document. Also generating a document summary using the key & values extracted in the while doing so.
https://github.com/403errors/ai-docparser
automation csv-export nlp pdf-files python3 regex reinforcement-learning spacy
Last synced: 15 days ago
JSON representation
An application framework developed using the latest AI technologies to extract the values of specific pre-defined keys from a given PDF document. Also generating a document summary using the key & values extracted in the while doing so.
- Host: GitHub
- URL: https://github.com/403errors/ai-docparser
- Owner: 403errors
- Created: 2024-11-30T18:53:01.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-01-18T16:27:12.000Z (17 days ago)
- Last Synced: 2025-01-18T16:43:27.628Z (17 days ago)
- Topics: automation, csv-export, nlp, pdf-files, python3, regex, reinforcement-learning, spacy
- Language: Jupyter Notebook
- Homepage: https://www.kaggle.com/code/sitama/ai-docparser
- Size: 20.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AI DocParser
**AI DocParser** is an AI-powered document parsing tool designed to extract, process, and analyze data from various document formats. It leverages state-of-the-art machine learning models to **automate** the processing of structured and unstructured data.
[![Kaggle](https://img.shields.io/badge/Kaggle-Visit%20Project-blue?logo=kaggle)](https://www.kaggle.com/code/sitama/ai-docparser)
## Example
### Input:
![input image](imgs/input_sample.png)### Output:
![input image](imgs/output_sample.png)## Features
- **Document Parsing**: Extract data from PDFs, images, and other document types.
- **AI-Powered Analysis**: Use machine learning models to understand and process text.
- **Customizable Workflows**: Easily adapt to different use cases by modifying parameters or integrating additional models.
- **Model Retraining**: Fine-tune the parsing model with custom datasets for improved accuracy.## Tech Stack
- Implemented SpaCy for Named Entity Recognition, text extraction using fitz with accuracy of 99.22%
- Used RegEx for special type extractioon like date from the legal documents.
- Optimized data extraction with reinforcement learning, achieving high performance in dynamic PDFs