https://github.com/julianvelandia/simpleraghuggingface
Designed to implement retrieval-augmented generation systems. It uses datasets from Hugging Face, vectorizes them, and allows fast queries based on cosine similarity.
https://github.com/julianvelandia/simpleraghuggingface
dataset embedings huggingface rag
Last synced: 8 months ago
JSON representation
Designed to implement retrieval-augmented generation systems. It uses datasets from Hugging Face, vectorizes them, and allows fast queries based on cosine similarity.
- Host: GitHub
- URL: https://github.com/julianvelandia/simpleraghuggingface
- Owner: julianVelandia
- Created: 2025-01-15T03:01:09.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-01-25T19:15:47.000Z (over 1 year ago)
- Last Synced: 2025-01-25T19:16:02.556Z (over 1 year ago)
- Topics: dataset, embedings, huggingface, rag
- Language: Python
- Homepage:
- Size: 10.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Simple RAG HuggingFace
## Description
Designed to implement retrieval-augmented generation systems. It uses datasets from Hugging Face, vectorizes them, and allows fast queries based on cosine similarity.

## Installation
```bash
pip install SimpleRAGHuggingFace
```
## Usage
### Initial Setup
During the first execution, the dataset is loaded, vectorized, and embeddings are stored:
```python
from rag import Rag
RAG_HF_DATASET = "JulianVelandia/unal-repository-dataset-alternative-format"
rag = Rag(hf_dataset=RAG_HF_DATASET)
query = "What is the lighting design, control, and beautification of the field at Alfonso López Stadium?"
response = rag.retrieval_augmented_generation(query)
print(response)
```
Once run for the first time, the dataset can be queried for cosine similarity with the following parameters
```
Parameters:
- query (str): The input question or statement to be processed.
- max_sections (int): Maximum number of context sections to retrieve (range: 1 to 10).
Higher values provide more context but may dilute relevance.
- threshold (float): Minimum similarity score for a section to be included (range: 0.0 to 1.0).
Higher values ensure stricter relevance.
- max_words (int, optional): Maximum number of words in the combined context (default: 1000).
Longer limits provide more detail but may reduce conciseness.
Returns:
- str: The combined query and relevant context, or just the query if no context is found.
```
This process generates:
- **Original Database**: Stored in memory as a list of documents.
- **Vectorized Database**: Saved as a `.npy` file in the `embeddings/` folder.
### Query and Retrieval
Once the setup is complete, you can perform queries:
```python
query = "What is the lighting design, control, and beautification of the field at Alfonso López Stadium?"
response = rag.retrieval_augmented_generation(query)
print(response)
```
The result will be the initial `prompt` combined with the most relevant sections of context:
```
What is the lighting design, control, and beautification of the field at Alfonso López Stadium?
Keep in mind this context:
Lighting design ... Alfonso López Stadium, as well as the results obtained, understanding that a soccer team ...
...
```
## Workflow
1. **Setup (Preprocessing)**:
- Load the dataset from Hugging Face.
- Vectorize the documents using TF-IDF.
- Save the embeddings in `.npy` format.
```plaintext
HF Dataset -> Load -> Vectorization -> Embeddings (.npy)
```
2. **Querying**:
- Vectorize the prompt.
- Calculate cosine similarity between the prompt and the vectorized documents.
- Retrieve the most relevant sections.
- Combine the prompt with the retrieved context.
```plaintext
Prompt -> Vectorization -> Cosine Similarity -> Retrieval -> Combined Context
```