Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sinha532/a-step-towards-genai

Building a RAG pipeline for a set of most Frequently Asked Questions(FAQ's), enhancing the abilities of LLM's providing most accurate and relevant answers , to which the LLM are not trained.
https://github.com/sinha532/a-step-towards-genai

Last synced: about 2 months ago
JSON representation

Building a RAG pipeline for a set of most Frequently Asked Questions(FAQ's), enhancing the abilities of LLM's providing most accurate and relevant answers , to which the LLM are not trained.

Awesome Lists containing this project

README

        

# RAG Pipeline Using Elasticsearch and Groq

## Project Overview
This project implements a Retrieval-Augmented Generation (RAG) pipeline leveraging Elasticsearch for document retrieval and Groq for text generation. The pipeline is designed to answer questions based on a dataset of frequently asked questions (FAQs). The project involves indexing the FAQ data into Elasticsearch, retrieving relevant documents based on user queries, and generating responses using a language model provided by Groq.

To get started with this project, you need to install the required Python packages. These include pandas for data manipulation, elasticsearch-serverless for interacting with Elasticsearch, and groq for text generation. You can install these dependencies using a requirements.txt file.

Copy code
- pandas==1.5.3
- elasticsearch-serverless==8.8.0
- groq==0.4.2
- accelerate==0.20.3
Install the dependencies using pip

## Data Preparation
The project starts with reading a dataset of FAQs from a JSON file named Amazon_sagemaker_Faq.txt. This file contains questions and answers which are loaded into a Pandas DataFrame. The DataFrame facilitates easy manipulation and access to the data.

## Indexing Documents
The FAQ data is indexed into an Elasticsearch instance hosted on AWS Elastic Cloud. A unique index name is generated using a random number. The Elasticsearch index is created, and each row in the DataFrame is indexed as a separate document. This setup allows efficient retrieval of relevant documents based on user queries.

## Querying Elasticsearch
To retrieve documents from Elasticsearch, the project defines a function that takes a user query as input and searches the indexed documents. The search results are processed to extract the relevant documents which match the user's query. These documents are then used in the text generation step.

## Text Generation
Text generation is handled by the Groq library. The project initializes a Groq client using an API key and defines a function to generate responses. This function takes the retrieved documents, combines their content, and sends it to the Groq model for generating a coherent answer.

## RAG Pipeline
The Retrieval-Augmented Generation pipeline combines the retrieval and generation steps. When a user submits a query, the pipeline first retrieves relevant documents from Elasticsearch. These documents are then passed to the Groq model, which generates an answer based on the content of the retrieved documents.

## Running the Project
To run the project, follow these steps:

Ensure all dependencies are installed by running pip install -r requirements.txt.

Index the FAQ data by executing the indexing function.

Input your question when prompted. The project will retrieve relevant documents from Elasticsearch and generate an answer using the Groq model.

The generated answer will be displayed as the output.

Note: Ensure you have the necessary API keys for both Elasticsearch and Groq configured correctly in the script.

## Contributing
Contributions to this project are welcome. If you have any suggestions, improvements, or bug fixes, please create a pull request or open an issue on the project's GitHub repository.

## License
This project is licensed under the MIT License. You are free to use, modify, and distribute this project in accordance with the terms of the license.