Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/praadnya/rag-for-q-a
End-to-end Retrieval-Augmented Generation (RAG) pipeline for PDF-based question answering, utilizing FAISS for efficient vector search, LangChain for orchestration, and LLMs for generating accurate responses
https://github.com/praadnya/rag-for-q-a
question-answering rag textgeneration
Last synced: 19 days ago
JSON representation
End-to-end Retrieval-Augmented Generation (RAG) pipeline for PDF-based question answering, utilizing FAISS for efficient vector search, LangChain for orchestration, and LLMs for generating accurate responses
- Host: GitHub
- URL: https://github.com/praadnya/rag-for-q-a
- Owner: Praadnya
- Created: 2024-08-21T03:13:10.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-08-21T03:26:14.000Z (5 months ago)
- Last Synced: 2024-11-10T09:09:30.200Z (3 months ago)
- Topics: question-answering, rag, textgeneration
- Language: Python
- Homepage:
- Size: 1.07 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PDF Document Q&A with RAG
This project is a Streamlit-based application that allows users to upload PDF documents and ask questions about the content. It uses a Retrieval-Augmented Generation (RAG) pipeline to provide accurate answers by leveraging embeddings and a vector database. The pipeline includes steps for document reading, chunking, embedding, vector storage, and language model inference.
## Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Technologies Used](#technologies-used)
- [Project Structure](#project-structure)## Overview
This application reads a PDF document, extracts its text, and processes it into chunks suitable for embedding. The chunks are then embedded using a pre-trained model and stored in a vector database. When a user inputs a query, the application retrieves the most relevant chunks based on the query's embedding and generates an answer using a language model.
## Features
- **PDF Document Processing:** Extract text and paragraphs from PDF files.
- **Text Chunking:** Divide documents into manageable chunks for embedding.
- **Embeddings Generation:** Use a pre-trained Hugging Face model to generate embeddings.
- **Vector Database:** Store and retrieve document chunks using FAISS for efficient similarity search.
- **Language Model Inference:** Generate answers using a pre-trained language model based on retrieved chunks.
- **Streamlit Interface:** Easy-to-use web interface for uploading documents, entering queries, and displaying results.## Technologies Used
- **Python**: Core programming language.
- **Streamlit**: For building the web application interface.
- **PyPDF2**: To extract text from PDF documents.
- **LangChain**: For text chunking and embeddings management.
- **FAISS**: For efficient similarity search in the vector database.
- **Hugging Face Transformers**: For embedding generation and language model inference.
- **DistilBART**: Pre-trained model used for generating answers.## Project Structure
```plaintext
.
├── adapter.py # Handles text chunking and preparation for embedding.
├── embedder.py # Manages embedding generation using Hugging Face models.
├── reader.py # Extracts text and paragraphs from PDF documents.
├── vector_db.py # Manages the FAISS-based vector database for storing and retrieving embeddings.
├── app.py # Main Streamlit application file.
└── README.md # Project documentation.