{"id":23676058,"url":"https://github.com/bessouat40/raglight","last_synced_at":"2026-03-04T18:05:29.426Z","repository":{"id":270133368,"uuid":"902570217","full_name":"Bessouat40/RAGLight","owner":"Bessouat40","description":"RAGLight is a lightweight and modular Python library for implementing Retrieval-Augmented Generation (RAG), Agentic RAG and RAT (Retrieval augmented thinking)..","archived":false,"fork":false,"pushed_at":"2025-03-24T13:53:55.000Z","size":12750,"stargazers_count":23,"open_issues_count":7,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-24T22:40:47.847Z","etag":null,"topics":["agent","agentic-ai","agentic-rag","agentic-workflow","artificial-intelligence","automation","data-science","embeddings","framework","huggingface","inference","llm","lmstudio","mistral-api","mistralai","ollama","rag","retrieval-augmented","retrieval-augmented-generation","vector-database"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/raglight/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Bessouat40.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-12T20:37:11.000Z","updated_at":"2025-03-24T05:28:08.000Z","dependencies_parsed_at":"2025-01-22T12:23:48.918Z","dependency_job_id":"009dafcf-af0d-4eac-9b3e-6b6e634b8e9e","html_url":"https://github.com/Bessouat40/RAGLight","commit_stats":null,"previous_names":["bessouat40/rag-example"],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bessouat40%2FRAGLight","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bessouat40%2FRAGLight/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bessouat40%2FRAGLight/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bessouat40%2FRAGLight/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Bessouat40","download_url":"https://codeload.github.com/Bessouat40/RAGLight/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248325285,"owners_count":21084901,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","agentic-ai","agentic-rag","agentic-workflow","artificial-intelligence","automation","data-science","embeddings","framework","huggingface","inference","llm","lmstudio","mistral-api","mistralai","ollama","rag","retrieval-augmented","retrieval-augmented-generation","vector-database"],"created_at":"2024-12-29T14:41:21.803Z","updated_at":"2026-03-04T18:05:29.418Z","avatar_url":"https://github.com/Bessouat40.png","language":"Python","funding_links":[],"categories":["NLP"],"sub_categories":[],"readme":"# RAGLight\n\n![License](https://img.shields.io/github/license/Bessouat40/RAGLight)\n[![Downloads](https://static.pepy.tech/personalized-badge/raglight?period=total\u0026units=international_system\u0026left_color=grey\u0026right_color=red\u0026left_text=Downloads)](https://pepy.tech/projects/raglight)\n[![Run Test](https://github.com/Bessouat40/RAGLight/actions/workflows/test.yml/badge.svg)](https://github.com/Bessouat40/RAGLight/actions/workflows/test.yml)\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg alt=\"RAGLight\" height=\"200px\" src=\"./media/raglight.png\"\u003e\n\u003c/div\u003e\n\n**RAGLight** is a lightweight and modular Python library for implementing **Retrieval-Augmented Generation (RAG)**. It enhances the capabilities of Large Language Models (LLMs) by combining document retrieval with natural language inference.\n\nDesigned for simplicity and flexibility, RAGLight provides modular components to easily integrate various LLMs, embeddings, and vector stores, making it an ideal tool for building context-aware AI solutions.\n\n---\n\n## 📚 Table of Contents\n\n- [Requirements](#⚠️-requirements)\n\n- [Features](#features)\n\n- [Import library](#import-library-🛠️)\n\n- [Chat with Your Documents Instantly With CLI](#chat-with-your-documents-instantly-with-cli-💬)\n\n  - [Ignore Folders Feature](#ignore-folders-feature-🚫)\n  - [Ignore Folders in Configuration Classes](#ignore-folders-in-configuration-classes-🚫)\n\n- [Deploy as a REST API (raglight serve)](#deploy-as-a-rest-api-raglight-serve-🌐)\n\n  - [Start the server](#start-the-server)\n  - [Endpoints](#endpoints)\n  - [Configuration via environment variables](#configuration-via-environment-variables)\n  - [Deploy with Docker Compose](#deploy-with-docker-compose)\n\n- [Environment Variables](#environment-variables)\n\n- [Providers and Databases](#providers-and-databases)\n\n  - [LLM](#llm)\n  - [Embeddings](#embeddings)\n  - [Vector Store](#vector-store)\n\n- [Quick Start](#quick-start-🚀)\n\n  - [Knowledge Base](#knowledge-base)\n  - [RAG](#rag)\n  - [Agentic RAG](#agentic-rag)\n  - [MCP Integration](#mcp-integration)\n  - [Use Custom Pipeline](#use-custom-pipeline)\n  - [Override Default Processors](#override-default-processors)\n  - [Hybrid Search](#hybrid-search-bm25--semantic--rrf-)\n\n- [Use RAGLight with Docker](#use-raglight-with-docker)\n\n  - [Build your image](#build-your-image)\n  - [Run your image](#run-your-image)\n\n---\n\n\u003e ## ⚠️ Requirements\n\u003e\n\u003e Actually RAGLight supports :\n\u003e\n\u003e - Ollama\n\u003e - Google\n\u003e - LMStudio\n\u003e - vLLM\n\u003e - OpenAI API\n\u003e - Mistral API\n\u003e\n\u003e If you use LMStudio, you need to have the model you want to use loaded in LMStudio.\n\n## Features\n\n- **Embeddings Model Integration**: Plug in your preferred embedding models (e.g., HuggingFace **all-MiniLM-L6-v2**) for compact and efficient vector embeddings.\n- **LLM Agnostic**: Seamlessly integrates with different LLMs from different providers (Ollama and LMStudio supported).\n- **RAG Pipeline**: Combines document retrieval and language generation in a unified workflow.\n- **Agentic RAG Pipeline**: Use Agent to improve your RAG performances.\n- 🔌 **MCP Integration**: Add external tool capabilities (e.g. code execution, database access) via MCP servers.\n- **Flexible Document Support**: Ingest and index various document types (e.g., PDF, TXT, DOCX, Python, Javascript, ...).\n- **Extensible Architecture**: Easily swap vector stores, embedding models, or LLMs to suit your needs.\n- 🔍 **Hybrid Search (BM25 + Semantic + RRF)**: Combine keyword-based BM25 retrieval with dense vector search using Reciprocal Rank Fusion for best-of-both-worlds results.\n\n---\n\n## Import library 🛠️\n\nTo install the library, run:\n\n```bash\npip install raglight\n```\n\n---\n\n## Chat with Your Documents Instantly With CLI 💬\n\nFor the quickest and easiest way to get started, RAGLight provides an interactive command-line wizard. It will guide you through every step, from selecting your documents to chatting with them, without writing a single line of Python.\nPrerequisite: Ensure you have a local LLM service like Ollama running.\n\nJust run this one command in your terminal:\n\n```bash\nraglight chat\n```\n\nYou can also launch the Agentic RAG wizard with:\n\n```bash\nraglight agentic-chat\n```\n\nThe wizard will guide you through the setup process. Here is what it looks like:\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg alt=\"RAGLight\" src=\"./media/cli.png\"\u003e\n\u003c/div\u003e\n\nThe wizard will ask you for:\n\n- 📂 Data Source: The path to your local folder containing the documents.\n- 🚫 Ignore Folders: Configure which folders to exclude during indexing (e.g., `.venv`, `node_modules`, `__pycache__`).\n- 💾 Vector Database: Where to store the indexed data and what to name it.\n- 🧠 Embeddings Model: Which model to use for understanding your documents.\n- 🤖 Language Model (LLM): Which LLM to use for generating answers.\n\nAfter configuration, it will automatically index your documents and start a chat session.\n\n### Ignore Folders Feature 🚫\n\nRAGLight automatically excludes common directories that shouldn't be indexed, such as:\n\n- Virtual environments (`.venv`, `venv`, `env`)\n- Node.js dependencies (`node_modules`)\n- Python cache files (`__pycache__`)\n- Build artifacts (`build`, `dist`, `target`)\n- IDE files (`.vscode`, `.idea`)\n- And many more...\n\nYou can customize this list during the CLI setup or use the default configuration. This ensures that only relevant code and documentation are indexed, improving performance and reducing noise in your search results.\n\n### Ignore Folders in Configuration Classes 🚫\n\nThe ignore folders feature is also available in all configuration classes, allowing you to specify which directories to exclude during indexing:\n\n- **RAGConfig**: Use `ignore_folders` parameter to exclude folders during RAG pipeline indexing\n- **AgenticRAGConfig**: Use `ignore_folders` parameter to exclude folders during AgenticRAG pipeline indexing\n- **VectorStoreConfig**: Use `ignore_folders` parameter to exclude folders during vector store operations\n\nAll configuration classes use `Settings.DEFAULT_IGNORE_FOLDERS` as the default value, but you can override this with your custom list:\n\n```python\n# Example: Custom ignore folders for any configuration\ncustom_ignore_folders = [\n    \".venv\",\n    \"venv\",\n    \"node_modules\",\n    \"__pycache__\",\n    \".git\",\n    \"build\",\n    \"dist\",\n    \"temp_files\",  # Your custom folders\n    \"cache\"\n]\n\n# Use in any configuration class\nconfig = RAGConfig(\n    llm=Settings.DEFAULT_LLM,\n    provider=Settings.OLLAMA,\n    ignore_folders=custom_ignore_folders  # Override default\n)\n```\n\nSee the complete example in [examples/ignore_folders_config_example.py](examples/ignore_folders_config_example.py) for all configuration types.\n\n---\n\n## Deploy as a REST API (raglight serve) 🌐\n\n`raglight serve` starts a **FastAPI** server configured entirely via environment variables — no Python code required.\n\n### Start the server\n\n```bash\nraglight serve\n```\n\nOptions :\n\n```\n--host    Host to bind (default: 0.0.0.0)\n--port    Port to listen on (default: 8000)\n--reload  Enable auto-reload for development (default: false)\n--workers Number of worker processes (default: 1)\n```\n\nExample :\n\n```bash\nRAGLIGHT_LLM_MODEL=mistral-small-latest \\\nRAGLIGHT_LLM_PROVIDER=Mistral \\\nraglight serve --port 8080\n```\n\n### Endpoints\n\n| Method | Path | Body | Response |\n|---|---|---|---|\n| `GET` | `/health` | — | `{\"status\": \"ok\"}` |\n| `POST` | `/generate` | `{\"question\": \"...\"}` | `{\"answer\": \"...\"}` |\n| `POST` | `/ingest` | `{\"data_path\": \"...\", \"file_paths\": [...], \"github_url\": \"...\", \"github_branch\": \"main\"}` | `{\"message\": \"...\"}` |\n| `POST` | `/ingest/upload` | `multipart/form-data` — champ `files` (un ou plusieurs fichiers) | `{\"message\": \"...\"}` |\n| `GET` | `/collections` | — | `{\"collections\": [...]}` |\n\nThe interactive API documentation (Swagger UI) is automatically available at `http://localhost:8000/docs`.\n\n#### Examples with curl\n\n```bash\n# Health check\ncurl http://localhost:8000/health\n\n# Ask a question\ncurl -X POST http://localhost:8000/generate \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"question\": \"What is RAGLight?\"}'\n\n# Ingest a local folder\ncurl -X POST http://localhost:8000/ingest \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"data_path\": \"./my_documents\"}'\n\n# Ingest a GitHub repository\ncurl -X POST http://localhost:8000/ingest \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"github_url\": \"https://github.com/Bessouat40/RAGLight\", \"github_branch\": \"main\"}'\n\n# Upload files directly (multipart)\ncurl -X POST http://localhost:8000/ingest/upload \\\n  -F \"files=@./rapport.pdf\" \\\n  -F \"files=@./notes.txt\"\n\n# List collections\ncurl http://localhost:8000/collections\n```\n\n### Configuration via environment variables\n\nAll server settings are read from `RAGLIGHT_*` environment variables. Copy `examples/serve_example/.env.example` to `.env` and adjust the values.\n\n| Variable | Default | Description |\n|---|---|---|\n| `RAGLIGHT_LLM_MODEL` | `llama3` | LLM model name |\n| `RAGLIGHT_LLM_PROVIDER` | `Ollama` | LLM provider (`Ollama`, `Mistral`, `OpenAI`, `LmStudio`, `GoogleGemini`) |\n| `RAGLIGHT_LLM_API_BASE` | `http://localhost:11434` | LLM API base URL |\n| `RAGLIGHT_EMBEDDINGS_MODEL` | `all-MiniLM-L6-v2` | Embeddings model name |\n| `RAGLIGHT_EMBEDDINGS_PROVIDER` | `HuggingFace` | Embeddings provider (`HuggingFace`, `Ollama`, `OpenAI`, `GoogleGemini`) |\n| `RAGLIGHT_EMBEDDINGS_API_BASE` | `http://localhost:11434` | Embeddings API base URL |\n| `RAGLIGHT_PERSIST_DIR` | `./raglight_db` | Local ChromaDB persistence directory |\n| `RAGLIGHT_COLLECTION` | `default` | ChromaDB collection name |\n| `RAGLIGHT_K` | `5` | Number of documents retrieved per query |\n| `RAGLIGHT_SYSTEM_PROMPT` | *(default prompt)* | Custom system prompt for the LLM |\n| `RAGLIGHT_CHROMA_HOST` | — | Remote Chroma host (leave unset for local storage) |\n| `RAGLIGHT_CHROMA_PORT` | — | Remote Chroma port |\n\n### Deploy with Docker Compose\n\nThe quickest way to deploy in production :\n\n```bash\ncd examples/serve_example\ncp .env.example .env   # edit values as needed\ndocker-compose up\n```\n\nThe `docker-compose.yml` uses `extra_hosts: host.docker.internal:host-gateway` so the container can reach an Ollama instance running on the host machine.\n\n---\n\n## Environment Variables\n\nYou can set several environment variables to change **RAGLight** settings :\n\n**Provider credentials \u0026 URLs**\n\n- `MISTRAL_API_KEY` if you want to use Mistral API\n- `OLLAMA_CLIENT_URL` if you have a custom Ollama URL\n- `LMSTUDIO_CLIENT` if you have a custom LMStudio URL\n- `OPENAI_CLIENT_URL` if you have a custom OpenAI URL or vLLM URL\n- `OPENAI_API_KEY` if you need an OpenAI key\n- `GEMINI_API_KEY` if you need a Google Gemini API key\n\n**REST API server (`raglight serve`)**\n\nSee the full list in the [Configuration via environment variables](#configuration-via-environment-variables) section above.\n\n## Providers and databases\n\n### LLM\n\nFor your LLM inference, you can use these providers :\n\n- LMStudio (`Settings.LMSTUDIO`)\n- Ollama (`Settings.OLLAMA`)\n- Mistral API (`Settings.MISTRAL`)\n- vLLM (`Settings.VLLM`)\n- OpenAI (`Settings.OPENAI`)\n- Google (`Settings.GOOGLE_GEMINI`)\n\n### Embeddings\n\nFor embeddings models, you can use these providers :\n\n- Huggingface (`Settings.HUGGINGFACE`)\n- Ollama (`Settings.OLLAMA`)\n- vLLM (`Settings.VLLM`)\n- OpenAI (`Settings.OPENAI`)\n- Google (`Settings.GOOGLE_GEMINI`)\n\n### Vector Store\n\nFor your vector store, you can use :\n\n- Chroma (`Settings.CHROMA`)\n\n## Quick Start 🚀\n\n### Knowledge Base\n\nKnowledge Base is a way to define data you want to ingest inside your vector store during the initialization of your RAG.\nIt's the data ingest when you call `build` function :\n\n```python\nfrom raglight import RAGPipeline\npipeline = RAGPipeline(knowledge_base=[\n    FolderSource(path=\"\u003cpath to your folder with pdf\u003e/knowledge_base\"),\n    GitHubSource(url=\"https://github.com/Bessouat40/RAGLight\")\n    ],\n    model_name=\"llama3\",\n    provider=Settings.OLLAMA,\n    k=5)\n\npipeline.build()\n```\n\nYou can define two different knowledge base :\n\n1. Folder Knowledge Base\n\nAll files/folders into this directory will be ingested inside the vector store :\n\n```python\nfrom raglight import FolderSource\nFolderSource(path=\"\u003cpath to your folder with pdf\u003e/knowledge_base\"),\n```\n\n2. Github Knowledge Base\n\nYou can declare Github Repositories you want to store into your vector store :\n\n```python\nfrom raglight import GitHubSource\nGitHubSource(url=\"https://github.com/Bessouat40/RAGLight\")\n```\n\n### RAG\n\nYou can setup easily your RAG with RAGLight :\n\n```python\nfrom raglight.rag.simple_rag_api import RAGPipeline\nfrom raglight.models.data_source_model import FolderSource, GitHubSource\nfrom raglight.config.settings import Settings\nfrom raglight.config.rag_config import RAGConfig\nfrom raglight.config.vector_store_config import VectorStoreConfig\n\nSettings.setup_logging()\n\nknowledge_base=[\n    FolderSource(path=\"\u003cpath to your folder with pdf\u003e/knowledge_base\"),\n    GitHubSource(url=\"https://github.com/Bessouat40/RAGLight\")\n    ]\n\nvector_store_config = VectorStoreConfig(\n    embedding_model = Settings.DEFAULT_EMBEDDINGS_MODEL,\n    api_base = Settings.DEFAULT_OLLAMA_CLIENT,\n    provider=Settings.HUGGINGFACE,\n    database=Settings.CHROMA,\n    persist_directory = './defaultDb',\n    collection_name = Settings.DEFAULT_COLLECTION_NAME\n)\n\nconfig = RAGConfig(\n        llm = Settings.DEFAULT_LLM,\n        provider = Settings.OLLAMA,\n        # k = Settings.DEFAULT_K,\n        # cross_encoder_model = Settings.DEFAULT_CROSS_ENCODER_MODEL,\n        # system_prompt = Settings.DEFAULT_SYSTEM_PROMPT,\n        # knowledge_base = knowledge_base\n    )\n\npipeline = RAGPipeline(config, vector_store_config)\n\npipeline.build()\n\nresponse = pipeline.generate(\"How can I create an easy RAGPipeline using raglight framework ? Give me python implementation\")\nprint(response)\n```\n\nYou just have to fill the model you want to use.\n\n\u003e ⚠️\n\u003e By default, LLM Provider will be Ollama\n\n### Agentic RAG\n\nThis pipeline extends the Retrieval-Augmented Generation (RAG) concept by incorporating\nan additional Agent. This agent can retrieve data from your vector store.\n\nYou can modify several parameters in your config :\n\n- `provider` : Your LLM Provider (Ollama, LMStudio, Mistral)\n- `model` : The model you want to use\n- `k` : The number of document you'll retrieve\n- `max_steps` : Max reflexion steps used by your Agent\n- `api_key` : Your Mistral API key\n- `api_base` : Your API URL (Ollama URL, LM Studio URL, ...)\n- `num_ctx` : Your context max_length\n- `verbosity_level` : Your logs' verbosity level\n- `ignore_folders` : List of folders to exclude during indexing (e.g., [\".venv\", \"node_modules\", \"**pycache**\"])\n\n```python\nfrom raglight.config.settings import Settings\nfrom raglight.rag.simple_agentic_rag_api import AgenticRAGPipeline\nfrom raglight.config.agentic_rag_config import AgenticRAGConfig\nfrom raglight.config.vector_store_config import VectorStoreConfig\nfrom raglight.config.settings import Settings\nfrom dotenv import load_dotenv\n\nload_dotenv()\nSettings.setup_logging()\n\npersist_directory = './defaultDb'\nmodel_embeddings = Settings.DEFAULT_EMBEDDINGS_MODEL\ncollection_name = Settings.DEFAULT_COLLECTION_NAME\n\nvector_store_config = VectorStoreConfig(\n    embedding_model = model_embeddings,\n    api_base = Settings.DEFAULT_OLLAMA_CLIENT,\n    database=Settings.CHROMA,\n    persist_directory = persist_directory,\n    # host='localhost',\n    # port='8001',\n    provider = Settings.HUGGINGFACE,\n    collection_name = collection_name\n)\n\n# Custom ignore folders - you can override the default list\ncustom_ignore_folders = [\n    \".venv\",\n    \"venv\",\n    \"node_modules\",\n    \"__pycache__\",\n    \".git\",\n    \"build\",\n    \"dist\",\n    \"my_custom_folder_to_ignore\"  # Add your custom folders here\n]\n\nconfig = AgenticRAGConfig(\n            provider = Settings.MISTRAL,\n            model = \"mistral-large-2411\",\n            k = 10,\n            system_prompt = Settings.DEFAULT_AGENT_PROMPT,\n            max_steps = 4,\n            api_key = Settings.MISTRAL_API_KEY, # os.environ.get('MISTRAL_API_KEY')\n            ignore_folders = custom_ignore_folders,  # Use custom ignore folders\n            # api_base = ... # If you have a custom client URL\n            # num_ctx = ... # Max context length\n            # verbosity_level = ... # Default = 2\n            # knowledge_base = knowledge_base\n        )\n\nagenticRag = AgenticRAGPipeline(config, vector_store_config)\nagenticRag.build()\n\nresponse = agenticRag.generate(\"Please implement for me AgenticRAGPipeline inspired by RAGPipeline and AgenticRAG and RAG\")\n\nprint('response : ', response)\n```\n\n### MCP Integration\n\nRAGLight supports MCP Server integration to enhance the reasoning capabilities of your agent. MCP allows the agent to interact with external tools (e.g., code execution environments, database tools, or search agents) via a standardized server interface.\n\nTo use MCP, simply pass a mcp_config parameter to your AgenticRAGConfig, where each config defines the url (and optionally transport) of the MCP server.\n\nJust add this parameter to your AgenticRAGPipeline :\n\n```python\nconfig = AgenticRAGConfig(\n    provider = Settings.OPENAI,\n    model = \"gpt-4o\",\n    k = 10,\n    mcp_config = [\n        {\"url\": \"http://127.0.0.1:8001/sse\"}  # Your MCP server URL\n    ],\n    ...\n)\n```\n\n\u003e 📚 Documentation: Learn how to configure and launch an MCP server using [MCPClient.server_parameters](https://huggingface.co/docs/smolagents/en/reference/tools#smolagents.MCPClient.server_parameters)\n\n### Use Custom Pipeline\n\n**1. Configure Your Pipeline**\n\nYou can also setup your own Pipeline :\n\n```python\nfrom raglight.rag.builder import Builder\nfrom raglight.config.settings import Settings\n\nrag = Builder() \\\n    .with_embeddings(Settings.HUGGINGFACE, model_name=model_embeddings) \\\n    .with_vector_store(Settings.CHROMA, persist_directory=persist_directory, collection_name=collection_name) \\\n    .with_llm(Settings.OLLAMA, model_name=model_name, system_prompt_file=system_prompt_directory, provider=Settings.LMStudio) \\\n    .build_rag(k = 5)\n```\n\n**2. Ingest Documents Inside Your Vector Store**\n\nThen you can ingest data into your vector store.\n\n1. You can use default pipeline that'll ingest no code data :\n\n```python\nrag.vector_store.ingest(data_path='./data')\n```\n\n2. Or you can use code pipeline :\n\n```python\nrag.vector_store.ingest(repos_path=['./repository1', './repository2'])\n```\n\nThis pipeline will ingest code embeddings into your collection : **collection_name**.\nBut this pipeline will also extract all signatures from your code base and ingest it into : **collection_name_classes**.\n\nYou have access to two different functions inside `VectorStore` class : `similarity_search` and `similarity_search_class` to search into different collection.\n\n**3. Query the Pipeline**\n\nRetrieve and generate answers using the RAG pipeline:\n\n```python\nresponse = rag.generate(\"How can I optimize my marathon training?\")\nprint(response)\n```\n\n\u003e ### ✚ More Examples\n\u003e\n\u003e You can find more examples for all these use cases in the [examples](https://github.com/Bessouat40/RAGLight/blob/main/examples) directory.\n\n### Override Default Processors\n\nRAGLight ships with built-in document processors based on file extension:\n\n- `pdf` → `PDFProcessor`\n- `py`, `js`, `ts`, `java`, `cpp`, `cs` → `CodeProcessor`\n- `txt`, `md`, `html` → `TextProcessor`\n\nYou can override these defaults using the `custom_processors` argument when building your vector store. This is especially useful if you want to handle certain file types with a custom logic, such as using a **Vision-Language Model (VLM)** for PDFs with diagrams and images. RAGLight provides a VLM based Processor too.\n\n#### Register the Custom Processor in the Builder\n\n```python\nfrom raglight.document_processing.vlm_pdf_processor import VlmPDFProcessor\nfrom raglight.llm.ollama_model import OllamaModel\nfrom raglight.rag.builder import Builder\nfrom raglight.config.settings import Settings\n\nfrom dotenv import load_dotenv\nimport os\n\nload_dotenv()\nSettings.setup_logging()\n\npersist_directory = './defaultDb'\nmodel_embeddings = Settings.DEFAULT_EMBEDDINGS_MODEL\ncollection_name = Settings.DEFAULT_COLLECTION_NAME\ndata_path = os.environ.get('DATA_PATH')\n\n# Vision-Language Model (example with Ollama)\nvlm = OllamaModel(\n    model_name=\"ministral-3:3b\",\n    system_prompt=\"You are a technical documentation visual assistant.\",\n)\n\ncustom_processors = {\n    \"pdf\": VlmPDFProcessor(vlm),  # Override default PDF processor\n}\n\nvector_store = Builder() \\\n    .with_embeddings(Settings.HUGGINGFACE, model_name=model_embeddings) \\\n    .with_vector_store(\n        Settings.CHROMA,\n        persist_directory=persist_directory,\n        collection_name=collection_name,\n        custom_processors=custom_processors,\n    ) \\\n    .build_vector_store()\n\nvector_store.ingest(data_path=data_path)\n```\n\nWith this setup, all `.pdf` files will be processed by your custom `VlmPDFProcessor`, while other file types keep using the default processors.\n\n### Hybrid Search (BM25 + Semantic + RRF) 🔍\n\nRAGLight supports three retrieval strategies, configurable via the `search_type` parameter:\n\n| Mode | Description |\n|---|---|\n| `\"semantic\"` | Dense vector similarity search (default) |\n| `\"bm25\"` | Keyword-based BM25 search |\n| `\"hybrid\"` | BM25 + semantic merged with Reciprocal Rank Fusion (RRF) |\n\n#### With the Builder API\n\n```python\nfrom raglight.rag.builder import Builder\nfrom raglight.config.settings import Settings\n\nrag = (\n    Builder()\n    .with_embeddings(Settings.HUGGINGFACE, model_name=\"all-MiniLM-L6-v2\")\n    .with_vector_store(\n        Settings.CHROMA,\n        persist_directory=\"./myDb\",\n        collection_name=\"my_collection\",\n        search_type=Settings.SEARCH_HYBRID,  # \"semantic\" | \"bm25\" | \"hybrid\"\n        alpha=0.5,                           # weight between semantic and BM25 in RRF\n    )\n    .with_llm(Settings.OLLAMA, model_name=\"llama3.1:8b\")\n    .build_rag(k=5)\n)\n\nrag.vector_store.ingest(data_path=\"./docs\")\nresponse = rag.generate(\"What is Reciprocal Rank Fusion?\")\nprint(response)\n```\n\n#### With the high-level RAGPipeline API\n\n```python\nfrom raglight.rag.simple_rag_api import RAGPipeline\nfrom raglight.config.rag_config import RAGConfig\nfrom raglight.config.vector_store_config import VectorStoreConfig\nfrom raglight.config.settings import Settings\nfrom raglight.models.data_source_model import FolderSource\n\nvector_store_config = VectorStoreConfig(\n    embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,\n    provider=Settings.HUGGINGFACE,\n    database=Settings.CHROMA,\n    persist_directory=\"./myDb\",\n    collection_name=\"my_collection\",\n    search_type=Settings.SEARCH_HYBRID,   # or SEARCH_SEMANTIC / SEARCH_BM25\n    hybrid_alpha=0.5,\n)\n\nconfig = RAGConfig(\n    llm=\"llama3.1:8b\",\n    provider=Settings.OLLAMA,\n    k=5,\n    knowledge_base=[FolderSource(path=\"./docs\")],\n)\n\npipeline = RAGPipeline(config, vector_store_config)\npipeline.build()\nresponse = pipeline.generate(\"Explain the retrieval pipeline\")\nprint(response)\n```\n\n\u003e **How RRF works**: each search mode returns its own ranked list of documents. RRF assigns a score of `1 / (k + rank)` to each document per list and sums them — documents appearing high in both lists are promoted, while documents unique to one list are kept but ranked lower. This gives the hybrid mode better recall and precision than either mode alone.\n\n\u003e See the full working example in [examples/hybrid_search_example.py](examples/hybrid_search_example.py).\n\n---\n\n## Use RAGLight with Docker\n\nYou can use RAGLight inside a Docker container easily.\nFind Dockerfile example here : [examples/Dockerfile.example](https://github.com/Bessouat40/RAGLight/blob/main/examples/Dockerfile.example)\n\n### Build your image\n\nJust go to **examples** directory and run :\n\n```bash\ndocker build -t docker-raglight -f Dockerfile.example .\n```\n\n## Run your image\n\nIn order your container can communicate with Ollama or LMStudio, you need to add a custom host-to-IP mapping :\n\n```bash\ndocker run --add-host=host.docker.internal:host-gateway docker-raglight\n```\n\nWe use `--add-host` flag to allow Ollama call.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbessouat40%2Fraglight","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbessouat40%2Fraglight","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbessouat40%2Fraglight/lists"}