{"id":27736770,"url":"https://github.com/umbertogriffo/rag-chatbot","last_synced_at":"2025-04-28T14:32:59.140Z","repository":{"id":176934557,"uuid":"659723393","full_name":"umbertogriffo/rag-chatbot","owner":"umbertogriffo","description":"RAG (Retrieval-augmented generation) ChatBot that provides answers based on contextual information extracted from a collection of Markdown files.","archived":false,"fork":false,"pushed_at":"2025-04-26T10:35:42.000Z","size":23619,"stargazers_count":251,"open_issues_count":1,"forks_count":62,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-26T11:24:11.976Z","etag":null,"topics":["chatbot","chromadb","gpu","lamacpp","llama3","llm","rag","streamlit","vector-database"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/umbertogriffo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-06-28T12:26:05.000Z","updated_at":"2025-04-26T10:35:46.000Z","dependencies_parsed_at":"2023-12-22T18:35:58.958Z","dependency_job_id":null,"html_url":"https://github.com/umbertogriffo/rag-chatbot","commit_stats":null,"previous_names":["umbertogriffo/contextual-chatbot-gpt4all","umbertogriffo/rag-chatbot"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/umbertogriffo%2Frag-chatbot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/umbertogriffo%2Frag-chatbot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/umbertogriffo%2Frag-chatbot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/umbertogriffo%2Frag-chatbot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/umbertogriffo","download_url":"https://codeload.github.com/umbertogriffo/rag-chatbot/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251330115,"owners_count":21572230,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","chromadb","gpu","lamacpp","llama3","llm","rag","streamlit","vector-database"],"created_at":"2025-04-28T14:30:37.340Z","updated_at":"2025-04-28T14:32:59.110Z","avatar_url":"https://github.com/umbertogriffo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RAG (Retrieval-augmented generation) ChatBot\n\n[![CI](https://github.com/umbertogriffo/rag-chatbot/workflows/CI/badge.svg)](https://github.com/umbertogriffo/rag-chatbot/actions/workflows/ci.yaml)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)\n[![Code style: Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n\n\u003e [!IMPORTANT]\n\u003e Disclaimer:\n\u003e The code has been tested on:\n\u003e   * `Ubuntu 22.04.2 LTS` running on a Lenovo Legion 5 Pro with twenty `12th Gen Intel® Core™ i7-12700H` and\n      an `NVIDIA GeForce RTX 3060`.\n\u003e   * `MacOS Sonoma 14.3.1` running on a MacBook Pro M1 (2020).\n\u003e\n\u003e If you are using another Operating System or different hardware, and you can't load the models, please\n\u003e take a look at the official Llama Cpp Python's\n\u003e GitHub [issue](https://github.com/abetlen/llama-cpp-python/issues).\n\n\u003e [!WARNING]\n\u003e - `lama_cpp_pyhon` doesn't use `GPU` on `M1` if you are running an `x86` version of `Python`. More\n    info [here](https://github.com/abetlen/llama-cpp-python/issues/756#issuecomment-1870324323).\n\u003e - It's important to note that the large language model sometimes generates hallucinations or false information.\n\n## Table of contents\n\n- [Introduction](#introduction)\n- [Prerequisites](#prerequisites)\n    - [Install Poetry](#install-poetry)\n- [Bootstrap Environment](#bootstrap-environment)\n    - [How to use the make file](#how-to-use-the-make-file)\n- [Using the Open-Source Models Locally](#using-the-open-source-models-locally)\n    - [Supported Models](#supported-models)\n- [Supported Response Synthesis strategies](#supported-response-synthesis-strategies)\n- [Example Data](#example-data)\n- [Build the memory index](#build-the-memory-index)\n- [Run the Chatbot](#run-the-chatbot)\n- [Run the RAG Chatbot](#run-the-rag-chatbot)\n- [How to debug the Streamlit app on Pycharm](#how-to-debug-the-streamlit-app-on-pycharm)\n- [References](#references)\n\n## Introduction\n\nThis project combines the power\nof [Lama.cpp](https://github.com/abetlen/llama-cpp-python), [Chroma](https://github.com/chroma-core/chroma)\nand [Streamlit](https://discuss.streamlit.io/) to build:\n\n* a Conversation-aware Chatbot (ChatGPT like experience).\n* a RAG (Retrieval-augmented generation) ChatBot.\n\nThe RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the\ncorresponding answer\nbased on the context provided by those files.\n\n![rag-chatbot-architecture-1.png](images/rag-chatbot-architecture-1.png)\n\n\u003e [!NOTE]\n\u003e We decided to grab and refactor the `RecursiveCharacterTextSplitter` class from `LangChain` to effectively chunk\n\u003e Markdown files without adding LangChain as a dependency.\n\nThe `Memory Builder` component of the project loads Markdown pages from the `docs` folder.\nIt then divides these pages into smaller sections, calculates the embeddings (a numerical representation) of these\nsections with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)\n`sentence-transformer`, and saves them in an embedding database called [Chroma](https://github.com/chroma-core/chroma)\nfor later use.\n\nWhen a user asks a question, the RAG ChatBot retrieves the most relevant sections from the Embedding database.\nSince the original question can't be always optimal to retrieve for the LLM, we first prompt an LLM to rewrite the\nquestion, then conduct retrieval-augmented reading.\nThe most relevant sections are then used as context to generate the final answer using a local language model (LLM).\nAdditionally, the chatbot is designed to remember previous interactions. It saves the chat history and considers the\nrelevant context from previous conversations to provide more accurate answers.\n\nTo deal with context overflows, we implemented three approaches:\n\n* `Create And Refine the Context`: synthesize a responses sequentially through all retrieved contents.\n    * ![create-and-refine-the-context.png](images/create-and-refine-the-context.png)\n* `Hierarchical Summarization of Context`: generate an answer for each relevant section independently, and then\n  hierarchically combine the answers.\n    * ![hierarchical-summarization.png](images/hierarchical-summarization.png)\n* `Async Hierarchical Summarization of Context`: parallelized version of the Hierarchical Summarization of Context which\n  lead to big speedups in response synthesis.\n\n## Prerequisites\n\n* Python 3.10+\n* GPU supporting CUDA 12.1+\n* Poetry 1.7.0\n\n### Install Poetry\n\nInstall Poetry with the official installer by following\nthis [link](https://python-poetry.org/docs/#installing-with-the-official-installer).\n\nYou must use the current adopted version of Poetry\ndefined [here](https://github.com/umbertogriffo/rag-chatbot/blob/main/version/poetry).\n\nIf you have poetry already installed and is not the right version, you can downgrade (or upgrade) poetry through:\n\n```\npoetry self update \u003cversion\u003e\n```\n\n## Bootstrap Environment\n\nTo easily install the dependencies we created a make file.\n\n### How to use the make file\n\n\u003e [!IMPORTANT]\n\u003e Run `Setup` as your init command (or after `Clean`).\n\n* Check: ```make check```\n    * Use it to check that `which pip3` and `which python3` points to the right path.\n* Setup:\n    * Setup with NVIDIA CUDA acceleration: ```make setup_cuda```\n        * Creates an environment and installs all dependencies with NVIDIA CUDA acceleration.\n    * Setup with Metal GPU acceleration: ```make setup_metal```\n        * Creates an environment and installs all dependencies with Metal GPU acceleration for macOS system only.\n* Update: ```make update```\n    * Update an environment and installs all updated dependencies.\n* Tidy up the code: ```make tidy```\n    * Run Ruff check and format.\n* Clean: ```make clean```\n    * Removes the environment and all cached files.\n* Test: ```make test```\n    * Runs all tests.\n    * Using [pytest](https://pypi.org/project/pytest/)\n\n## Using the Open-Source Models Locally\n\nWe utilize the open-source library [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a binding\nfor [llama-cpp](https://github.com/ggerganov/llama.cpp),\nallowing us to utilize it within a Python environment.\n`llama-cpp` serves as a C++ backend designed to work efficiently with transformer-based models.\nRunning the LLMs architecture on a local PC is impossible due to the large (~7 billion) number of parameters.\nThis library enable us to run them either on a `CPU` or `GPU`.\nAdditionally, we use the `Quantization and 4-bit precision` to reduce number of bits required to represent the numbers.\nThe quantized models are stored in [GGML/GGUF](https://medium.com/@phillipgimmi/what-is-gguf-and-ggml-e364834d241c)\nformat.\n\n### Supported Models\n\n| 🤖 Model                                   | Supported | Model Size | Max Context Window | Notes and link to the model card                                                                                                                                     |\n|--------------------------------------------|-----------|------------|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `llama-3.2` Meta Llama 3.2 Instruct        | ✅         | 1B         | 128k               | Optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF)                                            |\n| `llama-3.2` Meta Llama 3.2 Instruct        | ✅         | 3B         | 128k               | Optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF)                                            |\n| `llama-3.1` Meta Llama 3.1 Instruct        | ✅         | 8B         | 128k               | **Recommended model** [Card](https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF)                                                                       |\n| `openchat-3.6` - OpenChat 3.6              | ✅         | 8B         | 8192               | [Card](https://huggingface.co/bartowski/openchat-3.6-8b-20240522-GGUF)                                                                                               |\n| `openchat-3.5` - OpenChat 3.5              | ✅         | 7B         | 8192               | [Card](https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF)                                                                                                       |\n| `starling` Starling Beta                   | ✅         | 7B         | 8192               | Is trained from `Openchat-3.5-0106`. It's recommended if you prefer more verbosity over OpenChat - [Card](https://huggingface.co/bartowski/Starling-LM-7B-beta-GGUF) |\n| `phi-3.5` Phi-3.5 Mini  Instruct           | ✅         | 3.8B       | 128k               | [Card](https://huggingface.co/MaziyarPanahi/Phi-3.5-mini-instruct-GGUF)                                                                                              |\n| `stablelm-zephyr` StableLM Zephyr OpenOrca | ✅         | 3B         | 4096               | [Card](https://huggingface.co/TheBloke/stablelm-zephyr-3b-GGUF)                                                                                                      |\n\n## Supported Response Synthesis strategies\n\n| ✨ Response Synthesis strategy                                           | Supported | Notes |\n|-------------------------------------------------------------------------|-----------|-------|\n| `create-and-refine` Create and Refine                                   | ✅         |       |\n| `tree-summarization` Tree Summarization                                 | ✅         |       |\n| `async-tree-summarization` - **Recommended** - Async Tree Summarization | ✅         |       |\n\n## Example Data\n\nYou could download some Markdown pages from\nthe [Blendle Employee Handbook](https://blendle.notion.site/Blendle-s-Employee-Handbook-7692ffe24f07450785f093b94bbe1a09)\nand put them under `docs`.\n\n## Build the memory index\n\nRun:\n\n```shell\npython chatbot/memory_builder.py --chunk-size 1000 --chunk-overlap 50\n```\n\n## Run the Chatbot\n\nTo interact with a GUI type:\n\n```shell\nstreamlit run chatbot/chatbot_app.py -- --model llama-3.1 --max-new-tokens 1024\n```\n\n![conversation-aware-chatbot.gif](images/conversation-aware-chatbot.gif)\n\n## Run the RAG Chatbot\n\nTo interact with a GUI type:\n\n```shell\nstreamlit run chatbot/rag_chatbot_app.py -- --model llama-3.1 --k 2 --synthesis-strategy async-tree-summarization\n```\n\n![rag_chatbot_example.gif](images%2Frag_chatbot_example.gif)\n\n## How to debug the Streamlit app on Pycharm\n\n![debug_streamlit.png](images/debug_streamlit.png)\n\n## References\n\n* Large Language Models (LLMs):\n    * [LLMs as a repository of vector programs](https://fchollet.substack.com/p/how-i-think-about-llm-prompt-engineering)\n    * [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)\n    * [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/)\n    * [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)\n    * [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration)\n    * [Understanding Multimodal LLMs](https://www.linkedin.com/comm/pulse/understanding-multimodal-llms-sebastian-raschka-phd-t7h5c)\n    * [Direct preference optimization (DPO): Complete overview](https://www.superannotate.com/blog/direct-preference-optimization-dpo)\n* LLM Frameworks:\n    * llama.cpp:\n        * [llama.cpp](https://github.com/ggerganov/llama.cpp)\n        * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)\n    * Ollama:\n        * [Ollama](https://github.com/ollama/ollama/tree/main)\n        * [Ollama Python Library](https://github.com/ollama/ollama-python/tree/main)\n        * [On the architecture of ollama](https://blog.inoki.cc/2024/04/15/Ollama/)\n        * [Analysis of Ollama Architecture and Conversation Processing Flow for AI LLM Tool](https://medium.com/@rifewang/analysis-of-ollama-architecture-and-conversation-processing-flow-for-ai-llm-tool-ead4b9f40975)\n        * [How to Customize Ollama’s Storage Directory](https://medium.com/@chhaybunsy/unleash-your-machine-learning-models-how-to-customize-ollamas-storage-directory-c9ea1ea2961a#:~:text=By%20default%2C%20Ollama%20saves%20its,making%20predictions%20or%20further%20training)\n        * Use [CodeGPT](https://plugins.jetbrains.com/plugin/21056-codegpt) to access self-hosted models from Ollama for\n          a code assistant in PyCharm. More info [here](https://docs.codegpt.ee/providers/local/ollama).\n    * Deepval - A framework for evaluating LLMs:\n      * https://github.com/confident-ai/deepeval\n* LLM Datasets:\n    * [High-quality datasets](https://github.com/mlabonne/llm-datasets)\n* Agents:\n    * [Agents](https://huyenchip.com//2025/01/07/agents.html)\n    * [Building effective agents](https://www.anthropic.com/research/building-effective-agents)\n* Agent Frameworks:\n    * [PydanticAI](https://ai.pydantic.dev/)\n    * [Atomic Agents](https://github.com/BrainBlend-AI/atomic-agents)\n      * [Want to Build AI Agents? Tired of LangChain, CrewAI, AutoGen \u0026 Other AI Agent Frameworks?](https://ai.gopubby.com/want-to-build-ai-agents-c83ab4535411)\n* Embeddings:\n    * To find the list of best embeddings models for the retrieval task in your language go to\n      the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)\n    * [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)\n        * This is a `sentence-transformers` model: It maps sentences \u0026 paragraphs to a 384 dimensional dense vector\n          space (Max Tokens 512) and can be used for tasks like classification or semantic search.\n* Vector Databases:\n    * Indexing algorithms:\n        * There are many algorithms for building indexes to optimize vector search. Most vector databases\n          implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great\n          articles explaining them, and the trade-off between `speed`, `memory` and `quality`:\n            * [Nearest Neighbor Indexes for Similarity Search](https://www.pinecone.io/learn/series/faiss/vector-indexes/)\n            * [Hierarchical Navigable Small World (HNSW)](https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37)\n            * [From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/)\n            * [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/)\n            * \u003e PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the\n              expense of speed.\n    * [Chroma](https://www.trychroma.com/)\n        * [chroma](https://github.com/chroma-core/chroma)\n    * [Qdrant](https://qdrant.tech/):\n        * [Qdrant Internals: Immutable Data Structures](https://qdrant.tech/articles/immutable-data-structures/)\n        * [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)\n* Retrieval Augmented Generation (RAG):\n    * [Building A Generative AI Platform](https://huyenchip.com/2024/07/25/genai-platform.html)\n    * [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb)\n        * \u003e Because the original query can not be always optimal to retrieve for the LLM, especially in the real world,\n          we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.\n    * [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking)\n    * [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)\n    * [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/)\n    * [RAG is Dead, Again?](https://jina.ai/news/rag-is-dead-again/)\n* Chatbot UI:\n    * [Streamlit](https://discuss.streamlit.io/):\n        * [Build a basic LLM chat app](https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps#build-a-chatgpt-like-app)\n        * [Layouts and Containers](https://docs.streamlit.io/library/api-reference/layout)\n        * [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message)\n        * [Add statefulness to apps](https://docs.streamlit.io/library/advanced-features/session-state)\n            * [Why session state is not persisting between refresh?](https://discuss.streamlit.io/t/why-session-state-is-not-persisting-between-refresh/32020)\n        * [st.cache_resource](https://docs.streamlit.io/library/api-reference/performance/st.cache_resource)\n        * [Handling External Command Line Arguments](https://github.com/streamlit/streamlit/issues/337)\n    * [Open WebUI](https://github.com/open-webui/open-webui)\n        * [Running AI Locally Using Ollama on Ubuntu Linux](https://itsfoss.com/ollama-setup-linux/)\n* Text Processing and Cleaning:\n    * [clean-text](https://github.com/jfilter/clean-text/tree/main)\n    * [Fast Semantic Text Deduplication](https://github.com/MinishLab/semhash)\n* Inspirational Open Source Repositories:\n    * [lit-gpt](https://github.com/Lightning-AI/lit-gpt)\n    * [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm)\n    * [AnythingLLM](https://useanything.com/)\n    * [FastServe - Serve Llama-cpp with FastAPI](https://github.com/aniketmaurya/fastserve)\n    * [Alpaca](https://github.com/Jeffser/Alpaca?tab=readme-ov-file)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fumbertogriffo%2Frag-chatbot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fumbertogriffo%2Frag-chatbot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fumbertogriffo%2Frag-chatbot/lists"}