https://github.com/farhaj499/rag_with_chromadb

This project implements an Extractive Question Answering (EQA) system that extracts answers from a set of downloaded text files based on user queries.
https://github.com/farhaj499/rag_with_chromadb

chromadb eqa extractive-question-answering huggingface-transformers langchain large-language-models python rag retrieval-augmented-generation semantic-search text-retrieval vector-database

Last synced: 3 months ago
JSON representation

This project implements an Extractive Question Answering (EQA) system that extracts answers from a set of downloaded text files based on user queries.

Host: GitHub
URL: https://github.com/farhaj499/rag_with_chromadb
Owner: Farhaj499
License: mit
Created: 2024-12-29T12:31:21.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-01-12T17:39:50.000Z (9 months ago)
Last Synced: 2025-03-23T19:46:05.423Z (7 months ago)
Topics: chromadb, eqa, extractive-question-answering, huggingface-transformers, langchain, large-language-models, python, rag, retrieval-augmented-generation, semantic-search, text-retrieval, vector-database
Language: Jupyter Notebook
Homepage:
Size: 177 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Extractive Question Answering with ChromaDB, and Gemini

This project implements an Extractive Question Answering (EQA) system that extracts answers from a set of downloaded text files based on user queries. It utilizes ChromaDB as a vector database to store document embeddings for efficient retrieval of relevant information. Gemini is then used to refine the extracted information, providing more relevant, concise, and human-like responses.

## Overview

The project follows these steps:

1. **Data Download:** The script downloads the compressed archive "new_articles.zip" from Dropbox using the provided link.
2. **Text Extraction:** The downloaded archive is unzipped, and all text files within are extracted.
3. **Text Preprocessing:** Each text file is preprocessed to clean and normalize the text content (optional).
4. **Embedding Generation:** Embeddings are created for each preprocessed text document using a chosen embedding model (e.g., Sentence Transformers).
5. **Data Storage in ChromaDB:** Text documents and their corresponding embeddings are stored in ChromaDB.
6. **Retrieval:** When a user asks a question, the system generates an embedding for the query and retrieves the most similar documents from ChromaDB using cosine similarity.
7. **Response Refinement with Gemini:** The retrieved text snippets are passed to the Gemini LLM, which refines the information to provide a more relevant, concise, and human-like response to the user's query.

## Technologies Used

* ChromaDB: Vector database for storing and retrieving embeddings.
* Hugging Face Transformers (optional): For generating document embeddings.
* LangChain (optional): For streamlining document preprocessing and loading.
* Python: Programming language for implementation.
* Requests: To download the data from Dropbox.
* File manipulation libraries (e.g., os, zipfile): For handling file downloads and extraction.
* Gemini: Large Language Model for refining extracted information and generating human-like responses.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/farhaj499/rag_with_chromadb

Awesome Lists containing this project

README