https://github.com/alinababer/document-analysis-identification-with-rag-vector-database-and-mistral-llm
This Document Analysis pipeline is a comprehensive document analysis system, designed to automate the processing and analysis of documents from acquisition to consumption. It integrates advanced machine learning & AI models like RAG (Retrieval Augmented Generation) & Mistral LLM to efficiently extract, match, enrich, process document
https://github.com/alinababer/document-analysis-identification-with-rag-vector-database-and-mistral-llm
document-analysis document-analysis-recognition document-pipeline document-uploader llm mistral paddleocr python rag tesseract
Last synced: about 2 months ago
JSON representation
This Document Analysis pipeline is a comprehensive document analysis system, designed to automate the processing and analysis of documents from acquisition to consumption. It integrates advanced machine learning & AI models like RAG (Retrieval Augmented Generation) & Mistral LLM to efficiently extract, match, enrich, process document
- Host: GitHub
- URL: https://github.com/alinababer/document-analysis-identification-with-rag-vector-database-and-mistral-llm
- Owner: AlinaBaber
- Created: 2024-09-17T07:56:26.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-10-29T12:10:26.000Z (7 months ago)
- Last Synced: 2025-02-09T00:47:03.681Z (3 months ago)
- Topics: document-analysis, document-analysis-recognition, document-pipeline, document-uploader, llm, mistral, paddleocr, python, rag, tesseract
- Language: Python
- Homepage:
- Size: 14.8 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Document Analysis and identification with RAG Vector Database and Mistral LLM
## Overview
This project is a comprehensive document analysis system, designed to automate the processing and analysis of documents from acquisition to consumption. It integrates advanced machine learning and AI models like **RAG (Retrieval Augmented Generation) Vector Databases** and **Mistral LLM** to efficiently extract, match, enrich, and process document information. The pipeline covers:
- **Document Acquisition**
- **Document OCR (Optical Character Recognition)**
- **Document Preprocessing**
- **Document Information and Title Extraction**
- **Document Matching and Enrichment**
- **Document Consumption and Uploading**## Features
- **Document Acquisition**: Automated gathering of documents from specified sources.
- **OCR Integration**: Converts scanned images or PDFs into machine-readable text using Optical Character Recognition (OCR).
- **Preprocessing**: Cleans and prepares documents for downstream analysis (removal of noise, handling of incomplete documents).
- **Information Extraction**: Title and relevant document information extraction using **RAG vector database** and **Mistral LLM**.
- **Document Matching & Enrichment**: Leverages AI models to match and enrich documents with metadata and other contextual information.
- **Consumption & Uploading**: Final processed document output, ready for consumption, and automatically uploaded to a specified location.## Architecture
1. **Document Acquisition**:
- Connects to a variety of data sources (local, cloud storage, web scraping).
2. **Document OCR**:
- Uses popular OCR libraries (e.g., Tesseract) for converting images or scanned PDFs into editable text.
3. **Preprocessing**:
- Text cleaning, noise reduction, normalization, and segmentation.
4. **Information Extraction**:
- Uses **RAG** and **Mistral LLM** for extracting meaningful information like document title, summary, and other details.
5. **Matching and Enrichment**:
- Matches documents with existing records and enriches them with metadata using contextual AI models.
6. **Document Uploading**:
- Final step to upload or store the enriched documents in a specified repository or database.## Getting Started
### Prerequisites
- Python 3.x
- Libraries: `pandas`, `numpy`, `spacy`, `torch`, `transformers`, `tesseract`, `opencv`, `sentence-transformers`
- Access to a RAG-compatible Vector Database
- **Mistral LLM** API Key### Installation
1. Clone the repository:
```bash
git clone https://github.com/username/Document-Analysis-Pipeline-LLM-RAG.git
```2. Navigate to the project directory:
```bash
cd Document-Analysis-Pipeline-LLM-RAG
```3. Install the required dependencies:
```bash
pip install -r requirements.txt
```### Usage
1. **Configure Sources:**
- Define the document sources for acquisition (local storage, cloud storage, etc.).
2. **Run the Pipeline**:```bash
python main.py --source [path_to_documents] --output [output_path]
```3. **OCR & Preprocessing**:
- The system will automatically run OCR on documents and clean the text.4. **Information Extraction**:
- Extract document titles and key information using the integrated **Mistral LLM** and **RAG** models.5. **Enrichment & Matching**:
- The processed documents will be enriched with additional metadata and matched to existing datasets if applicable.6. **Final Output**:
- The final output is ready for upload or further analysis.### Configuration
- **Document Sources**: Define paths to documents in `config.json`.
- **Vector Database**: Set up the connection details for the **RAG Vector Database** in `rag_config.yaml`.
- **Mistral LLM**: Provide API keys for **Mistral LLM** integration in `ml_config.yaml`.### Contributor
### License
This project is licensed under the MIT License.