{"id":30219156,"url":"https://github.com/j4nn0/llm-rag","last_synced_at":"2025-08-14T07:48:17.475Z","repository":{"id":213714009,"uuid":"705258213","full_name":"J4NN0/llm-rag","owner":"J4NN0","description":"LLMs prompt augmentation with RAG by integrating external custom data from a variety of sources, allowing chat with such documents","archived":false,"fork":false,"pushed_at":"2024-07-22T13:57:36.000Z","size":129,"stargazers_count":20,"open_issues_count":0,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-08-10T06:35:57.991Z","etag":null,"topics":["chat-application","chatapp","chatbot","chatgpt","custom-data","llama","llama-index","llama2","llamacpp","llm","llm-apps","llm-framework","llms","local-llama","local-llm","mixtral","mixtral-8x7b","rag","rag-embeddings","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/J4NN0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-15T14:09:37.000Z","updated_at":"2025-04-19T21:49:48.000Z","dependencies_parsed_at":"2023-12-27T16:25:49.272Z","dependency_job_id":"8ab02728-9185-4954-8d99-0855595aaeb6","html_url":"https://github.com/J4NN0/llm-rag","commit_stats":{"total_commits":69,"total_committers":2,"mean_commits":34.5,"dds":"0.23188405797101452","last_synced_commit":"dee2c3c0874aeda4cbd8c5bd14aee78a26956891"},"previous_names":["j4nn0/themis-ai","j4nn0/llama-index-rag"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/J4NN0/llm-rag","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J4NN0%2Fllm-rag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J4NN0%2Fllm-rag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J4NN0%2Fllm-rag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J4NN0%2Fllm-rag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/J4NN0","download_url":"https://codeload.github.com/J4NN0/llm-rag/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J4NN0%2Fllm-rag/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270385318,"owners_count":24574544,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-14T02:00:10.309Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chat-application","chatapp","chatbot","chatgpt","custom-data","llama","llama-index","llama2","llamacpp","llm","llm-apps","llm-framework","llms","local-llama","local-llm","mixtral","mixtral-8x7b","rag","rag-embeddings","retrieval-augmented-generation"],"created_at":"2025-08-14T07:48:03.517Z","updated_at":"2025-08-14T07:48:17.453Z","avatar_url":"https://github.com/J4NN0.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# llm-rag\n\nThis repository provides documentation and resources for understanding the basic concepts behind Large Language Models (LLMs) and the process of augments LLMs prompt with Retrieval Augmented Generation (RAG) by integrating external custom data from a variety of sources (e.g. text files, web pages, PDFs, etc.) using [LlamaIndex](https://www.llamaindex.ai/) framework. This allows you to ask questions about such documents.\n\n# Table of Contents\n\n- [Retrieval Augmented Generation (RAG)](#retrieval-augmented-generation-rag)\n- [Environment Setup](#environment-setup)\n- [Ingest your data](#ingest-your-data)\n- [Chat with your documents](#chat-with-your-documents)\n- [Local LLM vs Cloud-based LLM](#local-llm-vs-cloud-based-llm)\n- [Quantization methods](#quantization-methods)\n- [Resources](#resources)\n\n# Retrieval Augmented Generation (RAG)\n\nLLMs are a type of artificial intelligence model designed to understand and generate human-like text based on the patterns and structures present in vast amounts of textual data. These models have become increasingly sophisticated thanks to advances in deep learning, particularly using transformer architectures.\n\nWhile LLMs are trained on large datasets, they lack knowledge of your specific data. Retrieval-Augmented Generation (RAG) bridges this gap by integrating your data. In RAG, your data is loaded and prepared for queries or \"indexed\". User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response. For chatbot or agent development, mastering RAG techniques is essential for seamless data integration into your application.\n\nWithin the RAG there are five key stages:\n- **Loading**: This involves acquiring your data from its source, whether it's stored in text files, PDFs, another website, a database, or an API.\n- **Indexing**: Involves generating vector embeddings and employing various metadata strategies to facilitate accurate retrieval of contextually relevant information.\n- **Storage**: After indexing, it is often beneficial to store the index and associated metadata to avoid the need for future reindexing.\n- **Retrieve**: With various indexing strategies available, you can use LLMs data structures for querying, using techniques such as sub-queries, multi-step queries, and hybrid strategies.\n- **Evaluation**: It provides objective metrics to measure the accuracy, fidelity, and speed of your responses to queries.\n\n# Environment Setup\n\n1. The project has been tested with Python `3.10` (version `3.10.11` to be exact). To check your Python version run\n\n       python3 --version\n\n   If you have a different one, you can download version `3.10.X` in the [Python releases archive](https://www.python.org/downloads/). \n\n2. Clone the repository\n\n       git clone https://github.com/J4NN0/llm-rag.git\n       cd llm-rag\n\n3. Install requirements\n\n       pip install -r requirements.txt\n\n4. Copy the example.env template into .env and source them however you like\n       \n       cp .sample.env .env\n\n5. Decide if you want to use a local LLM or OpenAI model (in case you don't know what to choose, refer to the below section [Local LLM vs Cloud-based LLM](#local-llm-vs-cloud-based-llm) and [Quantization methods](#quantization-methods))\n   - If you want to use a **local LLM**:\n     - Set `MODEL_TYPE` to the LLM you want to use between the supported ones:\n       - `LLAMA2-7B_Q4` - medium, balanced quality (7 billion parameters)\n       - `LLAMA2-7B_Q5` - large, very low-quality loss (7 billion parameters)\n       - `LLAMA2-13B_Q4` - medium, balanced quality (13 billion parameters)\n       - `LLAMA2-13B_Q5` - large, very low-quality loss (13 billion parameters)\n       - `MIXTRAL-7B_Q4` - medium, balanced quality (7 billion parameters)\n       - `MIXTRAL-7B_Q5` - large, very low-quality loss (7 billion parameters)\n       \n     Each downloaded model is cached in `~/Users/$USER/Library/Caches/llama_index` to avoid downloading it again.\n\n   - If you want to use **OpenAI model**:\n     - Set `MODEL_TYPE` to `DEFAULT`.\n     - Set `OPENAI_API_KEY` to your OpenAI API key. If you don't have one, you can get one in [platform.openai](https://platform.openai.com/api-keys).\n\n6. Optionally, you can update the following variables\n   - `LOGGING_LEVEL` to set level output verbosity:\n     - Set to `DEBUG` for verbose \n     - Set to `INFO` for less.\n   - `INDEX_STORAGE` to set the path where to store the index. By default, it is set to `./vector_store`.\n   - `DATA_DIR` to set the path where your custom documents are stored. By default, it is set to `./data`.\n\n# Ingest your data\n\nAdd all the files you want to chat with in the `data` folder. The following file types are supported:\n   - `.csv` - comma-separated values \n   - `.docx` - Microsoft Word \n   - `.epub` - EPUB ebook format \n   - `.hwp` - Hangul Word Processor \n   - `.ipynb` - Jupyter Notebook \n   - `.jpeg`, `.jpg` - JPEG image \n   - `.mbox` - MBOX email archive \n   - `.md` - Markdown \n   - `.mp3`, `.mp4` - audio and video \n   - `.pdf` - Portable Document Format \n   - `.png` - Portable Network Graphics \n   - `.ppt`, `.pptm`, `.pptx` - Microsoft PowerPoint\n   - `.json` - JSON file\n\nYou can also ingest data from Wikipedia pages. To do so, you can use `.wikipedia` file extension and insert as many Wikipedia page titles as you want in the file.\n   - Note that only the page name is required, not the full URL.\n   - For instance for the Berlin Wikipedia page (at [wikipedia.org/wiki/Berlin](https://en.wikipedia.org/wiki/Berlin)), just insert `Berlin` in the file.\n\nIn case you want to connect it to more data sources, please refer to [Data Connectors for LlamaIndex](https://docs.llamaindex.ai/en/stable/api_reference/readers.html#classes), [LlamaHub](https://llamahub.ai/) or write your data reader.\n\nTo ingest all the data, run the following command\n\n    python3 main.py --load-data\n\nOr just\n\n    python3 main.py -L\n\nIt will create a folder (named `vector_store` by default) containing the local vectorstore. The time of ingestion depends on the size of each single document.\n\n# Chat with your documents\n\nTo start chatting with your documents, run the following command\n\n    python3 main.py --query-data\n\nOr just\n\n    python3 main.py -Q\n\nWait for the local vectorstore to be loaded, and then you can start chatting with your documents. Write your query and hit enter. The model consumes the prompt and prepares the answer (waiting time depends on your machine in case of local LLM, or OpenAI system load)\n\nFor instance, asking about myself based on the customs documents fed before:\n\n```\nQ: Why is Federico's nickname J4NN0?\n```\n\nThe model's answer should be:\n\n\u003e Federico's nickname \"J4NN0\" was given to him by a friend during one of his League of Legends games. The friend started calling him \"J4NN0\" because he was playing so well that it sounded like \"Janna,\" which is a character in the game. Federico found the nickname funny and decided to keep it as his nickname.\n\nSuch information - which is actually not true at all (it was proposed by GitHub Copilot and I accepted it) - is contained in [data/j4nn0.md](https://github.com/J4NN0/llm-rag/blob/main/data/j4nn0.md).\n\nType `exit` to finish chatting with the documents.\n\n# Local LLM vs Cloud-based LLM\n\nWhen it comes to running an LLM locally versus using a cloud-based service (such as [ChatGPT](https://chat.openai.com/)), the main differences often concern where the model is hosted and where the calculation takes place. But privacy issues are also an important aspect of this discussion.\n\nRunning an LLM locally means that the model is deployed on your own device (e.g., your computer or a server you control). The data and computations associated with the model are confined to your local environment, providing a higher level of privacy as your data doesn't leave your device.\n\nCloud-based LLM typically involves interacting with a model hosted on a (cloud) server. When a request is sent, the input is processed by the model on the server side. This means your input data is temporarily stored and processed on external servers, raising privacy concerns as the service provider has access to the data you input, at least temporarily.\n\n# Quantization methods\n\nThe names of the quantization methods follow the naming convention: \"q\" + the number of bits + the variant used (in the *attention* and *feedforward* layers). The following `S`, `M` and `L` refer to \"Small\", \"Medium\" and \"Large\" respectively. In the models above, the variant used is omitted as it is always the same i.e., `K_M`. The [lower the quantization](https://github.com/ggerganov/llama.cpp/pull/1684#issuecomment-1579252501), the lower the memory consumption but also the higher the perplexity loss (a metric indicating a model's proficiency in predicting the subsequent word based on the context provide).\n\nAs a rule of thumb, it is recommended to use `Q5_K_M` as it preserves most of the model's performance. Alternatively, you can use `Q4_K_M` to save some memory.\n\nDifference in different quantization methods:\n```\n 2  or  Q4_0   :  3.50G, +0.2499 ppl @ 7B - small, very high quality loss - legacy, prefer using Q3_K_M\n 3  or  Q4_1   :  3.90G, +0.1846 ppl @ 7B - small, substantial quality loss - legacy, prefer using Q3_K_L\n 8  or  Q5_0   :  4.30G, +0.0796 ppl @ 7B - medium, balanced quality - legacy, prefer using Q4_K_M\n 9  or  Q5_1   :  4.70G, +0.0415 ppl @ 7B - medium, low quality loss - legacy, prefer using Q5_K_M\n10  or  Q2_K   :  2.67G, +0.8698 ppl @ 7B - smallest, extreme quality loss - not recommended\n12  or  Q3_K   : alias for Q3_K_M\n11  or  Q3_K_S :  2.75G, +0.5505 ppl @ 7B - very small, very high quality loss\n12  or  Q3_K_M :  3.06G, +0.2437 ppl @ 7B - very small, very high quality loss\n13  or  Q3_K_L :  3.35G, +0.1803 ppl @ 7B - small, substantial quality loss\n15  or  Q4_K   : alias for Q4_K_M\n14  or  Q4_K_S :  3.56G, +0.1149 ppl @ 7B - small, significant quality loss\n15  or  Q4_K_M :  3.80G, +0.0535 ppl @ 7B - medium, balanced quality - *recommended*\n17  or  Q5_K   : alias for Q5_K_M\n16  or  Q5_K_S :  4.33G, +0.0353 ppl @ 7B - large, low quality loss - *recommended*\n17  or  Q5_K_M :  4.45G, +0.0142 ppl @ 7B - large, very low quality loss - *recommended*\n18  or  Q6_K   :  5.15G, +0.0044 ppl @ 7B - very large, extremely low quality loss\n 7  or  Q8_0   :  6.70G, +0.0004 ppl @ 7B - very large, extremely low quality loss - not recommended\n 1  or  F16    : 13.00G              @ 7B - extremely large, virtually no quality loss - not recommended\n 0  or  F32    : 26.00G              @ 7B - absolutely huge, lossless - not recommended\n```\n\n# Resources\n\n- [LlamaIndex Documentation](https://docs.llamaindex.ai/en/stable/index.html#)\n- [Large language models, explained with a minimum of math and jargon](https://seantrott.substack.com/p/large-language-models-explained)\n- [Building LLM applications for production](https://huyenchip.com/2023/04/11/llm-engineering.html)\n- [TheBloke - Hugging Face](https://huggingface.co/TheBloke)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj4nn0%2Fllm-rag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fj4nn0%2Fllm-rag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj4nn0%2Fllm-rag/lists"}