Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/403errors/ai-docparser

An application framework developed using the latest AI technologies to extract the values of specific pre-defined keys from a given PDF document. Also generating a document summary using the key & values extracted in the while doing so.
https://github.com/403errors/ai-docparser

automation csv-export nlp pdf-files python3 regex reinforcement-learning spacy

Last synced: 15 days ago
JSON representation

An application framework developed using the latest AI technologies to extract the values of specific pre-defined keys from a given PDF document. Also generating a document summary using the key & values extracted in the while doing so.

Awesome Lists containing this project

README

        

# AI DocParser

**AI DocParser** is an AI-powered document parsing tool designed to extract, process, and analyze data from various document formats. It leverages state-of-the-art machine learning models to **automate** the processing of structured and unstructured data.

[![Kaggle](https://img.shields.io/badge/Kaggle-Visit%20Project-blue?logo=kaggle)](https://www.kaggle.com/code/sitama/ai-docparser)

## Example

### Input:
![input image](imgs/input_sample.png)

### Output:
![input image](imgs/output_sample.png)

## Features

- **Document Parsing**: Extract data from PDFs, images, and other document types.
- **AI-Powered Analysis**: Use machine learning models to understand and process text.
- **Customizable Workflows**: Easily adapt to different use cases by modifying parameters or integrating additional models.
- **Model Retraining**: Fine-tune the parsing model with custom datasets for improved accuracy.

## Tech Stack

- Implemented SpaCy for Named Entity Recognition, text extraction using fitz with accuracy of 99.22%
- Used RegEx for special type extractioon like date from the legal documents.
- Optimized data extraction with reinforcement learning, achieving high performance in dynamic PDFs