https://github.com/bangla-rag/porag

Fully Configurable RAG Pipeline for Bengali Language RAG Applications. Supports both Local and Huggingface Models, Built with Langchain.
https://github.com/bangla-rag/porag

ai bengali bengali-nlp chromadb langchain llama3 llm nlp rag transformers

Last synced: 2 days ago
JSON representation

Fully Configurable RAG Pipeline for Bengali Language RAG Applications. Supports both Local and Huggingface Models, Built with Langchain.

Host: GitHub
URL: https://github.com/bangla-rag/porag
Owner: Bangla-RAG
License: mit
Created: 2024-06-07T09:45:02.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-08-16T17:26:38.000Z (about 1 year ago)
Last Synced: 2024-09-27T04:04:03.493Z (about 1 year ago)
Topics: ai, bengali, bengali-nlp, chromadb, langchain, llama3, llm, nlp, rag, transformers
Language: Python
Homepage:
Size: 678 KB
Stars: 36
Watchers: 3
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

# PoRAG (পরাগ), Bangla Retrieval-Augmented Generation (RAG) Pipeline
![Banner](/banner.png)

[![LinkedIn: Abdullah Al Asif](https://img.shields.io/badge/LinkedIn-Abdullah%20Al%20Asif-blue)](https://www.linkedin.com/in/abdullahalasif-bd/)
[![LinkedIn: Hasan Ali Emon](https://img.shields.io/badge/LinkedIn-Hasan%20Ali%20Emon-blue)](https://www.linkedin.com/in/hassan-ali-emon/)

Welcome to the **Bangla Retrieval-Augmented Generation (RAG) Pipeline**! This repository provides a pipeline for interacting with Bengali text data using natural language.

## Use Cases

- Interact with your Bengali data in Bengali.
- Ask questions about your Bengali text and get answers.

## How It Works

- **LLM Framework:** [Transformers](https://huggingface.co/docs/transformers/index)
- **RAG Framework:** [Langchain](https://www.langchain.com/)
- **Chunking:** [Recursive Character Split](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/)
- **Vector Store:** [ChromaDB](https://www.trychroma.com/)
- **Data Ingestion:** Currently supports text (.txt) files only due to the lack of reliable Bengali PDF parsing tools.

## Configurability

- **Customizable LLM Integration:** Supports Hugging Face or local LLMs compatible with Transformers.
- **Flexible Embedding:** Supports embedding models compatible with Sentence Transformers (embedding dimension: 768).
- **Hyperparameter Control:** Adjust `max_new_tokens`, `top_p`, `top_k`, `temperature`, `chunk_size`, `chunk_overlap`, and `k`.
- **Toggle Quantization mode:** Pass `--quantization` argument to toggle between different types of model including LoRA and 4bit quantization.

## Installation

1. **Install Python:** Download and install Python from [python.org](https://www.python.org/).
2. **Clone the Repository:**
```bash
git clone https://github.com/Bangla-RAG/PoRAG.git
cd PoRAG
```
3. **Install Required Libraries:**
```bash
pip install -r requirements.txt
```

Click to view example `requirements.txt`

```txt
langchain==0.2.3
langchain-community==0.2.4
langchain-core==0.2.5
chromadb==0.5.0
accelerate==0.31.0
peft==0.11.1
transformers==4.40.1
bitsandbytes==0.41.3
sentence-transformers==3.0.1
rich==13.7.1
```

## Running the Pipeline

1. **Prepare Your Bangla Text Corpus:** Create a text file (e.g., `test.txt`) with the Bengali text you want to use.
2. **Run the RAG Pipeline:**
```bash
python main.py --text_path test.txt
```
3. **Interact with the System:** Type your question and press Enter to get a response based on the retrieved information.

## Example

```bash
আপনার প্রশ্ন: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কোথায়?
উত্তর: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কলকাতার জোড়াসাঁকোর 'ঠাকুরবাড়ি'তে।
```

# Parameters description
You can pass these arguments and adjust their values during each runs.

Flag Name
Type
Description
Instructions

chat_model
str
The ID of the chat model. It can be either a Hugging Face model ID or a local path to the model.
Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "hassanaliemon/bn_rag_llama3-8b".

embed_model
str
The ID of the embedding model. It can be either a Hugging Face model ID or a local path to the model.
Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "l3cube-pune/bengali-sentence-similarity-sbert".

k
int
The number of documents to retrieve.
The default value is set to 4.

top_k
int
The top_k parameter for the chat model.
The default value is set to 2.

top_p
float
The top_p parameter for the chat model.
The default value is set to 0.6.

temperature
float
The temperature parameter for the chat model.
The default value is set to 0.6.

max_new_tokens
int
The maximum number of new tokens to generate.
The default value is set to 256.

chunk_size
int
The chunk size for text splitting.
The default value is set to 500.

chunk_overlap
int
The chunk overlap for text splitting.
The default value is set to 150.

text_path
str
The txt file path to the text file.
This is a required field. Provide the path to the text file you want to use.

show_context
bool
Whether to show the retrieved context or not.
Use --show_context flag to enable this feature.

quantization
bool
Whether to enable quantization(4bit) or not.
Use --quantization flag to enable this feature.

hf_token
str
Your Hugging Face API token.
The default value is set to None. Provide your Hugging Face API token if necessary.

## Key Milestones

- **Default LLM:** Trained a LLaMA-3 8B model `hassanaliemon/bn_rag_llama3-8b` for context-based QA.
- **Embedding Model:** Tested `sagorsarker/bangla-bert-base`, `csebuetnlp/banglabert`, and found `l3cube-pune/bengali-sentence-similarity-sbert` to be most effective.
- **Retrieval Pipeline:** Implemented Langchain Retrieval pipeline and tested with our fine-tuned LLM and embedding model.
- **Ingestion System:** Settled on text files after testing several PDF parsing solutions.
- **Question Answering Chat Loop:** Developed a multi-turn chat system for terminal testing.
- **Generation Configuration Control:** Attempted to use generation config in the LLM pipeline.
- **Model Testing:** Tested with the following models(quantized and lora versions):
1. [`asif00/bangla-llama`](https://huggingface.co/asif00/bangla-llama)
2. [`hassanaliemon/bn_rag_llama3-8b`](https://huggingface.co/hassanaliemon/bn_rag_llama3-8b)
3. [`asif00/mistral-bangla`](https://huggingface.co/asif00/mistral-bangla)
4. [`KillerShoaib/llama-3-8b-bangla-4bit`](https://huggingface.co/KillerShoaib/llama-3-8b-bangla-4bit)

## Limitations

- **PDF Parsing:** Currently, only text (.txt) files are supported due to the lack of reliable Bengali PDF parsing tools.
- **Quality of answers:** The qualities of answer depends heavily on the quality of your chosen LLM, embedding model and your Bengali text corpus.
- **Scarcity of Pre-trained models:** As of now, we do not have a high fidelity Bengali LLM Pre-trained models available for QA tasks, which makes it difficult to achieve impressive RAG performance. Overall performance may very depending on the model we use.

## Future Steps

- **PDF Parsing:** Develop a reliable Bengali-specific PDF parser.
- **User Interface:** Design a chat-like UI for easier interaction.
- **Chat History Management:** Implement a system for maintaining and accessing chat history.

## Contribution and Feedback

We welcome contributions! If you have suggestions, bug reports, or enhancements, please open an issue or submit a pull request.

### Top Contributors
[![LinkedIn: Abdullah Al Asif](https://img.shields.io/badge/LinkedIn-Abdullah%20Al%20Asif-blue)](https://www.linkedin.com/in/abdullahalasif-bd/) [Abdullah Al Asif](https://github.com/asiff00)

[![LinkedIn: Hasan Ali Emon](https://img.shields.io/badge/LinkedIn-Hasan%20Ali%20Emon-blue)](https://www.linkedin.com/in/hassan-ali-emon/) [Hasan Ali Emon](https://github.com/hassanaliemon)

## Disclaimer

This is a work-in-progress and may require further refinement. The results depend on the quality of your Bengali text corpus and the chosen models.

### References

1. [Transformers](https://huggingface.co/docs/transformers/index)
2. [Langchain](https://www.langchain.com/)
3. [ChromaDB](https://www.trychroma.com/)
4. [Sentence Transformers](https://www.sbert.net/)
5. [hassanaliemon/bn_rag_llama3-8b](https://huggingface.co/hassanaliemon/bn_rag_llama3-8b)
6. [l3cube-pune/bengali-sentence-similarity-sbert](https://huggingface.co/l3cube-pune/bengali-sentence-similarity-sbert)
7. [sagorsarker/bangla-bert-base](https://huggingface.co/sagorsarker/bangla-bert-base)
8. [csebuetnlp/banglabert](https://huggingface.co/csebuetnlp/banglabert)
9. [asif00/bangla-llama](https://huggingface.co/asif00/bangla-llama)
10. [KillerShoaib/llama-3-8b-bangla-4bit](https://huggingface.co/KillerShoaib/llama-3-8b-bangla-4bit)
11. [asif00/mistral-bangla](https://huggingface.co/asif00/mistral-bangla)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bangla-rag/porag

Awesome Lists containing this project

README