https://github.com/farhaj499/rag_with_chromadb
This project implements an Extractive Question Answering (EQA) system that extracts answers from a set of downloaded text files based on user queries.
https://github.com/farhaj499/rag_with_chromadb
chromadb eqa extractive-question-answering huggingface-transformers langchain large-language-models python rag retrieval-augmented-generation semantic-search text-retrieval vector-database
Last synced: 3 months ago
JSON representation
This project implements an Extractive Question Answering (EQA) system that extracts answers from a set of downloaded text files based on user queries.
- Host: GitHub
- URL: https://github.com/farhaj499/rag_with_chromadb
- Owner: Farhaj499
- License: mit
- Created: 2024-12-29T12:31:21.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-01-12T17:39:50.000Z (9 months ago)
- Last Synced: 2025-03-23T19:46:05.423Z (7 months ago)
- Topics: chromadb, eqa, extractive-question-answering, huggingface-transformers, langchain, large-language-models, python, rag, retrieval-augmented-generation, semantic-search, text-retrieval, vector-database
- Language: Jupyter Notebook
- Homepage:
- Size: 177 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Extractive Question Answering with ChromaDB, and Gemini
This project implements an Extractive Question Answering (EQA) system that extracts answers from a set of downloaded text files based on user queries. It utilizes ChromaDB as a vector database to store document embeddings for efficient retrieval of relevant information. Gemini is then used to refine the extracted information, providing more relevant, concise, and human-like responses.
## Overview
The project follows these steps:
1. **Data Download:** The script downloads the compressed archive "new_articles.zip" from Dropbox using the provided link.
2. **Text Extraction:** The downloaded archive is unzipped, and all text files within are extracted.
3. **Text Preprocessing:** Each text file is preprocessed to clean and normalize the text content (optional).
4. **Embedding Generation:** Embeddings are created for each preprocessed text document using a chosen embedding model (e.g., Sentence Transformers).
5. **Data Storage in ChromaDB:** Text documents and their corresponding embeddings are stored in ChromaDB.
6. **Retrieval:** When a user asks a question, the system generates an embedding for the query and retrieves the most similar documents from ChromaDB using cosine similarity.
7. **Response Refinement with Gemini:** The retrieved text snippets are passed to the Gemini LLM, which refines the information to provide a more relevant, concise, and human-like response to the user's query.## Technologies Used
* ChromaDB: Vector database for storing and retrieving embeddings.
* Hugging Face Transformers (optional): For generating document embeddings.
* LangChain (optional): For streamlining document preprocessing and loading.
* Python: Programming language for implementation.
* Requests: To download the data from Dropbox.
* File manipulation libraries (e.g., os, zipfile): For handling file downloads and extraction.
* Gemini: Large Language Model for refining extracted information and generating human-like responses.