{"id":24906272,"url":"https://github.com/julianvelandia/simpleraghuggingface","last_synced_at":"2025-10-16T18:30:42.948Z","repository":{"id":273281234,"uuid":"916931886","full_name":"julianVelandia/SimpleRAGHuggingFace","owner":"julianVelandia","description":"Designed to implement retrieval-augmented generation systems. It uses datasets from Hugging Face, vectorizes them, and allows fast queries based on cosine similarity.","archived":false,"fork":false,"pushed_at":"2025-01-25T19:15:47.000Z","size":11,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-25T19:16:02.556Z","etag":null,"topics":["dataset","embedings","huggingface","rag"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/julianVelandia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-15T03:01:09.000Z","updated_at":"2025-01-25T19:14:07.000Z","dependencies_parsed_at":"2025-01-25T19:16:03.497Z","dependency_job_id":null,"html_url":"https://github.com/julianVelandia/SimpleRAGHuggingFace","commit_stats":null,"previous_names":["julianvelandia/raggradeworksunaldataset","julianvelandia/simpleraghuggingface"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/julianVelandia%2FSimpleRAGHuggingFace","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/julianVelandia%2FSimpleRAGHuggingFace/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/julianVelandia%2FSimpleRAGHuggingFace/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/julianVelandia%2FSimpleRAGHuggingFace/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/julianVelandia","download_url":"https://codeload.github.com/julianVelandia/SimpleRAGHuggingFace/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236738656,"owners_count":19196962,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","embedings","huggingface","rag"],"created_at":"2025-02-02T00:39:19.581Z","updated_at":"2025-10-16T18:30:42.939Z","avatar_url":"https://github.com/julianVelandia.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\r\n# Simple RAG HuggingFace\r\n\r\n## Description\r\nDesigned to implement retrieval-augmented generation systems. It uses datasets from Hugging Face, vectorizes them, and allows fast queries based on cosine similarity.\r\n![image](https://github.com/user-attachments/assets/ea271b48-376e-4496-a554-48ae915cecd4)\r\n\r\n## Installation\r\n\r\n```bash\r\npip install SimpleRAGHuggingFace\r\n```\r\n\r\n## Usage\r\n\r\n### Initial Setup\r\nDuring the first execution, the dataset is loaded, vectorized, and embeddings are stored:\r\n\r\n```python\r\nfrom rag import Rag\r\n\r\nRAG_HF_DATASET = \"JulianVelandia/unal-repository-dataset-alternative-format\"\r\nrag = Rag(hf_dataset=RAG_HF_DATASET)\r\nquery = \"What is the lighting design, control, and beautification of the field at Alfonso López Stadium?\"\r\nresponse = rag.retrieval_augmented_generation(query)\r\nprint(response)\r\n```\r\n\r\nOnce run for the first time, the dataset can be queried for cosine similarity with the following parameters\r\n\r\n```\r\n Parameters:\r\n        - query (str): The input question or statement to be processed.\r\n        - max_sections (int): Maximum number of context sections to retrieve (range: 1 to 10).\r\n        Higher values provide more context but may dilute relevance.\r\n        - threshold (float): Minimum similarity score for a section to be included (range: 0.0 to 1.0).\r\n        Higher values ensure stricter relevance.\r\n        - max_words (int, optional): Maximum number of words in the combined context (default: 1000).\r\n        Longer limits provide more detail but may reduce conciseness.\r\n\r\n        Returns:\r\n        - str: The combined query and relevant context, or just the query if no context is found.\r\n```\r\n\r\nThis process generates:\r\n- **Original Database**: Stored in memory as a list of documents.\r\n- **Vectorized Database**: Saved as a `.npy` file in the `embeddings/` folder.\r\n\r\n### Query and Retrieval\r\nOnce the setup is complete, you can perform queries:\r\n\r\n```python\r\nquery = \"What is the lighting design, control, and beautification of the field at Alfonso López Stadium?\"\r\nresponse = rag.retrieval_augmented_generation(query)\r\nprint(response)\r\n```\r\n\r\nThe result will be the initial `prompt` combined with the most relevant sections of context:\r\n\r\n```\r\nWhat is the lighting design, control, and beautification of the field at Alfonso López Stadium?\r\n\r\nKeep in mind this context:\r\nLighting design ... Alfonso López Stadium, as well as the results obtained, understanding that a soccer team ...\r\n...\r\n```\r\n\r\n## Workflow\r\n\r\n1. **Setup (Preprocessing)**:\r\n   - Load the dataset from Hugging Face.\r\n   - Vectorize the documents using TF-IDF.\r\n   - Save the embeddings in `.npy` format.\r\n\r\n   ```plaintext\r\n   HF Dataset -\u003e Load -\u003e Vectorization -\u003e Embeddings (.npy)\r\n   ```\r\n\r\n2. **Querying**:\r\n   - Vectorize the prompt.\r\n   - Calculate cosine similarity between the prompt and the vectorized documents.\r\n   - Retrieve the most relevant sections.\r\n   - Combine the prompt with the retrieved context.\r\n\r\n   ```plaintext\r\n   Prompt -\u003e Vectorization -\u003e Cosine Similarity -\u003e Retrieval -\u003e Combined Context\r\n   ```\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjulianvelandia%2Fsimpleraghuggingface","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjulianvelandia%2Fsimpleraghuggingface","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjulianvelandia%2Fsimpleraghuggingface/lists"}