{"id":23510800,"url":"https://github.com/adaptaware/ragit","last_synced_at":"2025-04-18T15:03:08.738Z","repository":{"id":267043847,"uuid":"900078325","full_name":"adaptaware/ragit","owner":"adaptaware","description":"A RAG back and front end application","archived":false,"fork":false,"pushed_at":"2025-01-14T21:23:32.000Z","size":1003,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-29T06:23:08.207Z","etag":null,"topics":["ai","chatbot","chroma","embeddings","langchain","llama","llm","machine-learning","milvus","openai","pdf-to-text","postrgresql","python","rag","sqlite3","vagrant"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adaptaware.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-07T19:58:20.000Z","updated_at":"2025-02-08T07:37:06.000Z","dependencies_parsed_at":"2025-01-06T23:43:01.653Z","dependency_job_id":null,"html_url":"https://github.com/adaptaware/ragit","commit_stats":null,"previous_names":["adaptaware/ragit"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adaptaware%2Fragit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adaptaware%2Fragit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adaptaware%2Fragit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adaptaware%2Fragit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adaptaware","download_url":"https://codeload.github.com/adaptaware/ragit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249509254,"owners_count":21283555,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","chatbot","chroma","embeddings","langchain","llama","llm","machine-learning","milvus","openai","pdf-to-text","postrgresql","python","rag","sqlite3","vagrant"],"created_at":"2024-12-25T12:12:11.376Z","updated_at":"2025-04-18T15:03:08.717Z","avatar_url":"https://github.com/adaptaware.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg src=\"ragit/front_end/static/ragit.jpeg\"  width=\"100\" height=\"100\"\u003e\n\u003c/p\u003e\n\n[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/dev_ragit.svg?style=social\u0026label=Follow%20%40dev_ragit)](https://twitter.com/dev_ragit)\n[![RAGit](https://img.shields.io/badge/RAGit-Chat%20with%20Ragit-green?labelColor=blue\u0026style=flat\u0026link=https://adaptaware.org)](https://adaptaware.org)\n[![Docker](https://img.shields.io/badge/Docker-Front%20End-green?labelColor=blue\u0026style=flat\u0026link=https://hub.docker.com/repository/docker/adaptaware/ragit/general)](https://hub.docker.com/repository/docker/adaptaware/ragit/general)\n[![Docker](https://img.shields.io/badge/Docker-Back%20End-green?labelColor=blue\u0026style=flat\u0026link=https://hub.docker.com/repository/docker/adaptaware/ragit-back-end/general)](https://hub.docker.com/repository/docker/adaptaware/ragit-back-end/general)\n[![Licence](https://img.shields.io/badge/Licence-GNU%20v3.0-green?labelColor=blue\u0026style=flat\u0026link=https://github.com/adaptaware/ragit/blob/main/LICENSE)](https://github.com/adaptaware/ragit/blob/main/LICENSE)\n\n\n# Table of Contents\n\n- [What is RAGit](#what-is-ragit)\n- [Development Environment](#development-environment)\n- [Running RAGit using docker](#running-ragit-using-docker)\n- [RAG Collection](#rag-collection)\n- [The env file for Docker](#the-env-file-for-docker)\n- [Back up and Restore](#back-up-and-restore)\n- [Data pipeline overview](#data-pipeline-overview)\n- [Why Use Vagrant For Development](#why-use-vagrant-for-development)\n\n# What is RAGit\n\nWelcome to the official documentation for **RAGit**, an open-source framework designed to streamline the development and management of Retrieval-Augmented Generation (RAG) solutions. RAGit eliminates the complexities associated with data handling, model selection, and infrastructure setup, empowering developers to concentrate on application logic and customization.\n\nWhether you're working for a small to medium-sized business or seeking a personal solution for creating custom chatbots, RAGit offers a versatile platform to meet your needs. It supports document integration through a command-line interface capable of handling various Large Language Models (LLMs) and vector databases. Meanwhile, the intuitive, web-based front end ensures a user-friendly experience for deploying and managing your applications.\n\n\nThe core values and principles behind RAGit can be summarized as follows:\n\n**Open Source**\n\nRAGit is proudly offered under the GPL license, ensuring it remains open source\nfor all current and future users.\n\n**Generality**\n\nRAGit is adaptable to any dataset, accommodating a wide array of data types.\nThis flexibility provides a robust foundation for crafting customized RAG\napplications.\n\n**Simplicity**\n\nOur framework prioritizes user-friendliness by abstracting complex data\nmanagement processes. With RAGit, you can focus on refining document selection\nand optimizing outcomes without delving into intricate implementation details.\n\n**Configurability**\n\nRAGit offers a high degree of customization. Experiment with hyperparameters,\nexplore various chunk splitting strategies, adjust vector distance algorithms,\nand apply prompt engineering to gain full control over your RAG pipeline.\n\n**Comprehensiveness**\n\nBeyond model training and inference, RAGit equips you with tools for efficient\ndata ingestion, processing, and management, supporting every phase of your\nproject.\n\n**Vendor Neutrality**\n\nRAGit remains agnostic to specific technologies, allowing for easy integration\nand switching between diverse components and services.\n\nBy embracing these guiding principles, RAGit accelerates the creation of\neffective and robust RAG solutions. Explore our framework to harness its full\npotential and drive your projects forward.\n\n\n# Development Environment\n\nThe recommended development platform for RAGit is based on Vagrant.  To have a fully\nfunctionall Vagrant box you can follow these:\n## Build the virtual machine\n\nTo install and run RAGIT locally the easiest way is to use a virtual machine\nthat can be created using [vagrant](https://developer.hashicorp.com/vagrant/install#darwin).\nYou will need to have vagrant installed on your machine .\n\nAssuming you already have vagrant installed then you need to follow these steps\nto install the repository under your home directory (you can always install it\nin any other directory if needed).\n\n```\ncd ~\ngit clone git@github.com:adaptaware/ragit.git\nmkdir ~/ragit-data\ncd ragit\nvagrant up\nvagrant ssh\n```\n\nNow you can ssh to the newly created virtual machine which should be ready\nto go.\n\n## The ragit-data directory\n\nThe ragit-data directory is shared between the host and guest machine and it\ncontains all the RAG collections.\n\nEach RAG Collection is stored under a subdirectory of the ragit-data directory.\nThe name of the subdirectory is the name of the RAG collection as well.\nInside this subdirectory must be another subdirectory called documents which\ncontains all the documents (pdf, markdown, docx etc) that are used for the RAG\ncollection.\n\n**The dummy RAG Collection**\nFor testing purposes we need one special collection named dummy which will be\nused by some of the intergration and functional tests for the application.\n\nTo create the dummy collection and assuming you are inside the vagrant box\nfollow these steps:\n\n```\nmkdir -p ~/ragit-data/dummy/documents\ncp -R /vagrant/ragit/libs/testing_data/* ~/ragit-data/dummy/documents\n```\n\nAt this point your ragid-data directory should look similar to the following:\n\n```\nragit-data/\n├── dummy\n│   └── documents\n│       ├── hello-world.docx\n│       ├── hello_world.py\n│       ├── nested_dir\n│       │   ├── method-chaining.md\n│       │   └── nested_2\n│       │       └── sample1.pdf\n│       ├── patents.pdf\n│       ├── sql-alchemy-sucks.md\n│       └── sunhaven.md\n```\n\n## Install the python dependencies\n\nThe required python libraries can be installed as follows:\n\n```\npip3 install -r /vagrant/requirements.txt\n```\n\n## Create the settings file.\n\nUnder your vagrant's machine home directory `/home/vagrant` create a new file\ncalled `settings.json` and store into it a valid OpenAI key in the following\nformat:\n\n```json\n{\n    \"OPENAI_API_KEY\": \"\u003cvalid-open-ai-key\u003e\",\n    \"VECTOR_DB_PROVIDER\": \"\u003csupported-vector-db\u003e\",\n    \"LLAMA_CLOUD_API_KEY\": \u003cLLAMA_CLOUD_API_KEY\u003e\n\n}\n```\n\nYou can get the LLAMA_CLOUD_API_KEY from this link: https://cloud.llamaindex.ai/\n\nThe LLAMA_CLOUD_API_KEY is needed for the pdf to markdown transformation.\n\nThe supported vector databases are the following:\n\n- CHROMA\n- MILVUS\n\n\n## Run the tests\n\nYou can verify your setup by running all the tests by following these commands:\n\n```\ncd /vagrant/ragit/\npt\n```\n\n## Run Ragit Backend Utility\n\nAt this point you should have a successfully installed RAGit application along\nwith a RAG collection that you can use similarly to any other RAG collection\nyou are going to create.\n\n**Build the backend**\n\nTo create the necessary backend processing which includes:\n\n- Preprocessing the documents under the RAG collection\n- Splitting the documents to chunks\n- Creating the embeddings for each chunk\n- Insert the updates to the vector database\n\nyou should use the `ragit` command line application which can start by running\nthe command `ragit` anywhere from insider the vagrant box which should bring up\nthe following options:\n\n```\n(ssh)vagrant@ragit:/vagrant/ragit$ ragit\nWelcome to the RAG Collection Tracker. Type help or ? to list commands.\n\n(RAGit)\n```\n\n**Show all the commands**\n\nentering the `help` or `?` we can see all the available commands:\n\n```\n(RAGit) ?\n\nl (list): List all available collections.\ns (stats) \u003cname\u003e: Print its stats for the pass in collection.\np (process): Process the data the passed in collection.\nm (create_markdowns): Creates missing markdowns.\nh (help): Prints this help message.\ne (exit): Exit.\n```\n\n**Show all the RAG collections**\n\nto see all the available collections we enter `list` or `l`:\n\n```\n(RAGit) list\ndummy\n(RAGit)\n```\nsince we only have created only the `dummy` collection this the only one we\nsee.\n\n**Show all the statistics for a RAG collection**\n\nIf we press `stats dummy` or `s dummy` we see the stats of the statistics for\nthe `dummy` collection:\n\n```\n\n(RAGit) s dummy\nname.....................: dummy\nfull path................: /home/vagrant/ragit-data/dummy/documents\ntotal documents..........: 5\ntotal documents in db....: 0\ntotal chunks.............: 0\nwith embeddings..........: 0\nwithout embeddings.......: 0\ninserted to vectordb.....: 0\nto insert to vector db...: 0\ntotal pdf files..........: 2\npdf missing markdowns....: 2\n(RAGit)\n\n```\n\nThe above result tells us that we have 5 non pdf documents and 2 pdf in the RAG\ncollection.\n\nIt also tell us that that the pdf files are missing both their conversion to\nmarkdowns while nothing is inserted in the database of the vectordb.\n\n**Create the markdowns for the pdf**\n\nTo create the missing markdowns for the dummy collection we need to enter the\nfollowing command: `m dummy` or `create_markdowns dummy` as can be see here:\n\n```\n(RAGit) create_markdowns dummy\n```\nor \n\n```\n(RAGit) m dummy\n```\n\nNow the statistics for the dummy collection look as follows:\n\n```\n(RAGit) s dummy\nname.....................: dummy\nfull path................: /home/vagrant/ragit-data/dummy/documents\ntotal documents..........: 9\ntotal documents in db....: 0\ntotal chunks.............: 0\nwith embeddings..........: 0\nwithout embeddings.......: 0\ninserted to vectordb.....: 0\nto insert to vector db...: 0\ntotal pdf files..........: 2\npdf missing markdowns....: 0\n(RAGit)\n\n```\n\nas we can see we have no more missing markdowns. Still, as we can see\nwe need to process the documents and insert them to the vector db which will\nmake the RAG collection ready to serve client queries.\n\n**Process the documents**\n\nTo process the documents we need to enter the `process dummy` or `p dummy`\ncommand:\n\n\n```\nfix this\n```\n\nNow the stats for the collection look as follows:\n\n```\n(RAGit) s dummy\nname.....................: dummy\nfull path................: /home/vagrant/ragit-data/dummy/documents\ntotal documents..........: 9\ntotal documents in db....: 9\ntotal chunks.............: 73\nwith embeddings..........: 73\nwithout embeddings.......: 0\ninserted to vectordb.....: 73\nto insert to vector db...: 0\ntotal pdf files..........: 2\npdf missing markdowns....: 0\n(RAGit)\n\n```\n\nand as we can see we have 73 chunks while all of the were inserted to the\nvector db and are ready for processing.\n\n## Run Ragit Front End\n\n**Starting RAGit web server**\n\nThe front end of RAGit is a web-based service that can be started as follows:\n\n```\n(ssh)vagrant@ragit:/vagrant/ragit$ cd /vagrant/ragit/front_end/\n(ssh)vagrant@ragit:.../ragit/front_end$ python3 app.py dummy\n```\n\nas we can see we can pass the RAG collection name that we need to process, here\nis `dummy` in the command line; doing so will create some output similar to the\nfollowing:\n\n```\nRunning the RAGIT UI as not ADMIN\nLoading vector db, using collection dummy\nfix this\n```\n\nThe query / answer we see here is just a self test to validate the connectivity\nto the LLM and the vectordb and can differ from run to run.\n\nThe server was started in port `13131` while the port to access it from the\n'outside' is the one we have specified in the Vagrant file in the following\nline:\n\n```\nconfig.vm.network \"forwarded_port\", guest: 13131, host: 13132\n```\n\nthus in this case the port will be 13132 (it can be anything else is convinient\nfor the host machine as well).\n\n**Accessing the RAGit web page**\n\nFrom the browser we can access the RAGit web page using the following url:\n\n```\nlocalhost:13132\n```\n\n**Sign up / Login**\nThe first time we are accessing the RAGit webserver we will need to sign up and\nthen use the user name and password to login to the front end.\n\n**Start quering the RAG collection**\n\nOnce we are signed in, the environment should look familier to other chatbots.\n\n\n![image](https://github.com/user-attachments/assets/6d8951a6-c382-44b3-8689-9b0d70a8243a)\n\n**Checking the chunks for the response**\n\nIf we want to see the documents where the above response was originated from we\nclick in `History` option as can be seen here:\n\n![image](https://github.com/user-attachments/assets/33400191-ab3d-44c7-b644-5caac3640c34)\n\n\n**Viewing the full document where the chunk is coming from**\n\nBy clicking the link to the name of the document we can see the full document where the chunk is coming from:\n\n![image](https://github.com/user-attachments/assets/69b93cd7-ded8-4262-8ed2-fb3e92c91bc7)\n\n\n# Running RAGit using docker\n\n# Deployment Steps\n\nThis documentation provides instructions for deploying a standalone  RAGIT application using Docker.\n\n## Create the docker-compose file\n\n- Create a new directory in any location you prefer where you will store all necessary files for deployment.\n\n- Under this directory create a file named `docker-compose.yaml` with the following content:\n\n```\nversion: \"3.8\"\n\nservices:\n    frontend:\n      image: adaptaware/ragit:1.0\n      environment:\n        - OPENAI_API_KEY=${OPENAI_API_KEY}\n        - SERVICE_PORT=${INTERNAL_FRONT_END_PORT}\n        - RAG_COLLECTION=${RAG_COLLECTION}\n        - VECTOR_DB_PROVIDER=${VECTOR_DB_PROVIDER}\n      volumes:\n        - ${SHARED_DIR}:/root/ragit-data\n      ports:\n        - \"${EXTERNAL_FRONT_END_PORT}:${INTERNAL_FRONT_END_PORT}\"\n    backend:\n        image: adaptaware/ragit-back-end:1.0\n        environment:\n            - OPENAI_API_KEY=${OPENAI_API_KEY}\n            - POSTGRES_DB=${POSTGRES_DB}\n            - POSTGRES_USER=${POSTGRES_USER}\n            - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}\n            - POSTGRES_PORT=${POSTGRES_PORT}\n            - POSTGRES_HOST=${POSTGRES_HOST}\n            - VECTOR_DB_PROVIDER=${VECTOR_DB_PROVIDER}\n        volumes:\n            - ${SHARED_DIR}:/root/ragit-data\n        stdin_open: true  # Keep stdin open even if not attached\n        tty: true         # Allocate a pseudo-TTY\n    db_host:\n      image: postgres:latest\n      environment:\n        - POSTGRES_DB=${POSTGRES_DB}\n        - POSTGRES_USER=${POSTGRES_USER}\n        - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}\n        - POSTGRES_PORT=${POSTGRES_PORT}\n        - POSTGRES_HOST=${POSTGRES_HOST}\n      volumes:\n        - pgdata:/var/lib/postgresql/data\nvolumes:\n  pgdata:\n```\n\n## Specify the Environment Variables\n\n   - In the directory containing `docker-compose.yaml`, create a file named `.env` with the following content:\n\n     ```plaintext\n     OPENAI_API_KEY=\u003cvalid-api-key\u003e\n     BIND_ADDRESS=0.0.0.0\n     EXTERNAL_FRONT_END_PORT=13133\n     INTERNAL_FRONT_END_PORT=13131\n     VECTOR_DB_PROVIDER=CHROMA\n     SHARED_DIR=\u003cyour-home-directory\u003e/ragit-data\n     RAG_COLLECTION=\u003ccollection-name\u003e\n     ```\n     In the above content replace `\u003cvalid-api-key\u003e` with a valid API key and the `\u003ccollection-name\u003e` with a valid collection name (following the above example the collection name would be stories).\n\n\n   - **Customize Variables**:\n     - Replace `valid-api-key` with your actual OpenAI API key.\n     - Substitute `\u003cyour-home-directory\u003e` with the full path to your **home directory**.\n     - Set `collection-name` to your specific RAG collection name (e.g., `stories`).\n\n\n## Create Shared Directory\n\nUnder the **HOME directory** create a `ragit-data` directory where the data for\nall the collections will exist.\n\nThe structure should be as follows:\n\n\n     \u003cHOME-DIR\u003e\n     └── ragit-data\n         └── \u003cyour-collection-name\u003e\n             └── documents\n\n\n   - Note: Replace `your-collection-name` with your actual RAG collection name.\n\nThe directory name `documents` must always exist under the collection and this\nis where you are placing all the documents that will consist the RAG\nCollection. Documents can be nested and the directory structure of the\n`documents` directory is completely up to the user of  RAGit to decide (again\nthe only restriction is to be under the `documents` directory as seen above.\n\n\n\n## Running the Service\n\n Navigate to your deployment directory (where `docker-compose.yaml` is located) and execute the following:\n\n#### Start the database\n\n     ```bash\n     docker-compose up -d db_host\n     ```\n\n#### Start the backend processor\n\n     ```bash\n     docker-compose run backend\n     ```\n#### From another CLI window start the front end\n\n   - Navigate to your deployment directory (where `docker-compose.yaml` is located) and execute the following command to launch your Dockerized front end:\n\n     ```bash\n     docker-compose up -d frontend\n     ```\n\n#### Access the service\n   Your RAGIT front end should now be operational and accessible locally through the specified external port as follows:\n\n```\nlocalhost:13133\n```\n\nor remotely\n```\n\u003cip-address\u003e:13133\n```\n\n## Clean Up\n\nIf you need to start fresh with the Docker installation, you can run the following commands:\n\n```sh\ndocker stop $(docker ps -aq); docker rm $(docker ps -aq); docker image rm -f $(docker images -q)\ndocker compose down -v\n```\n\n\n# RAG Collection\n\n## Overview\n\nA RAG collection is a fundamental component of the RAGit system. It is uniquely\nidentified by a `collection name` or simply `name`. This document outlines the\nsteps involved in creating and managing a custom RAG collection.\n\n## Definition: RAG Collection (or simply Collection)\n\nA `RAG Collection` is a collection of documents stored under the shared\ndirectory (`ragit-data`). Assuming we have a collection called `mydata`, its\nrelated data will exist under the following directory:\n\n```sh\n~/ragit-data/mydata/documents\n```\n\n## Prepare the Documents Directory\n\nTo create a new RAG collection, you need to prepare the documents directory\nwhere your collection's documents will be stored.\n\n**Create a Directory**\n\nCreate a directory to store your collection's documents:\n\n```bash\nmkdir -p ~/ragit-data/\u003ccollection-name\u003e/documents\n```\n\nReplace `\u003ccollection-name\u003e` with the desired name for your collection.\n\n## Copy Relevant Documents\n\nAfter you create the above directory, copy all relevant documents into the\nnewly created `documents` directory:\n\n```bash\ncp path/to/your/documents/* ~/ragit-data/\u003ccollection-name\u003e/documents/\n```\n\n## Process Documents and Create Index\n\nThe `ragit` command is available from anywhere under the VM and can be used to\ninteract with the backend of the RAGit service. More precisely, the following\nis the available functionality:\n\n## Display Available RAG Collections\n\nList all available RAG collections using the following command:\n\n```sh\nragit -l\n```\n\nExample output:\n\n```sh\ndummy\nmycode\nstories\n```\n\n## Show the Statistics for a RAG Collection\n\nDisplay statistics for a specific RAG collection using the following command,\nreplacing `\u003ccollection-name\u003e` with your collection's name:\n\n```sh\nragit -n \u003ccollection-name\u003e\n```\n\nExample output:\n\n```sh\nname.....................: stories\nfull path................: /home/vagrant/ragit-data/stories/documents\ntotal documents..........: 4\ntotal documents in db....: 4\ntotal chunks.............: 21\nwith embeddings..........: 21\nwithout embeddings.......: 0\ninserted to vectordb.....: 21\nto insert to vector db...: 0\n```\n\n## Process the Data for a RAG Collection\n\nProcess the available documents for a specific RAG collection using the\nfollowing command, replacing `\u003ccollection-name\u003e` with your collection's name:\n\n```sh\nragit -n \u003ccollection-name\u003e -p\n```\n\nExample output:\n\n```sh\nWill insert all available chunks to the database.\nInserted 0 chunks.\nWill insert all available embeddings to the database.\nInserted 0 embeddings.\nupdating the vector db.\nTotally inserted records: 0\nInserted 0 chunks to the vector db.\n```\n\n## Summary\n\nBy following these steps, you can create and manage a custom RAG collection\nwithin the RAGit framework. This process involves setting up the documents\ndirectory, copying relevant documents, and using RAGit's command-line tools to\nprocess and manage your collection. This ensures that your data is properly\nindexed and ready for use in RAG-based applications.\n\n\n# The env file (for Docker)\n\n## Overview\n\nThe `.env` file is crucial for configuring environment-specific variables for\nthe RAGIT application when running inside DOCKER.\n\nThis file allows you to manage various settings such as\nAPI keys, database credentials, and application ports in a centralized manner.\nThis guide explains the purpose of each variable in the `.env` file and how to\nset them up correctly.\n\n## Example .env File\n\nBelow is an example of a `.env` file:\n\n```env\nOPENAI_API_KEY=\u003cvalid-openai-key\u003e\nPOSTGRES_DB=postgres\nPOSTGRES_USER=postgres\nPOSTGRES_PASSWORD=mypassword\nPOSTGRES_PORT=5432\nPOSTGRES_HOST=db_host\nEXTERNAL_FRONT_END_PORT=13133\nINTERNAL_FRONT_END_PORT=8789\nVECTOR_DB_PROVIDER=\u003cCHROMA or MILVUS\u003e\nSHARED_DIR=\u003cpath-to-shared-directory\u003e\nRAG_COLLECTION=\u003cyour-rag-collection-name\u003e\n```\n\nThe Configuration Parameters are the following:\n\n## OPENAI_API_KEY\n- **Description**: The API key for accessing OpenAI's services.\n- **Format**: String\n- **Example**: `OPENAI_API_KEY=sk-xxxxxx`\n\n## POSTGRES_DB\n- **Description**: The name of the PostgreSQL database.\n- **Format**: String\n- **Example**: `POSTGRES_DB=postgres`\n\n## POSTGRES_USER\n- **Description**: The username for accessing the PostgreSQL database.\n- **Format**: String\n- **Example**: `POSTGRES_USER=postgres`\n\n## POSTGRES_PASSWORD\n- **Description**: The password for accessing the PostgreSQL database.\n- **Format**: String\n- **Example**: `POSTGRES_PASSWORD=mypassword`\n\n## POSTGRES_PORT\n- **Description**: The port number on which the PostgreSQL database is running.\n- **Format**: Integer\n- **Example**: `POSTGRES_PORT=5432`\n\n## POSTGRES_HOST\n- **Description**: The hostname or IP address where the PostgreSQL database is hosted.\n- **Format**: String\n- **Example**: `POSTGRES_HOST=db_host`\n\n## EXTERNAL_FRONT_END_PORT\n- **Description**: The external port number for accessing the RAGIT web application.\n- **Format**: Integer\n- **Example**: `EXTERNAL_FRONT_END_PORT=13133`\n\n## INTERNAL_FRONT_END_PORT\n- **Description**: The internal port number the frontend service listens on.\n- **Format**: Integer\n- **Example**: `INTERNAL_FRONT_END_PORT=8789`\n\n## VECTOR_DB_PROVIDER\n- **Description**: Specifies the vector database provider to be used.\n- **Options**: `CHROMA` or `MILVUS`\n- **Example**: `VECTOR_DB_PROVIDER=CHROMA`\n\n## SHARED_DIR\n- **Description**: The path to the shared directory that holds the data collection.\n- **Format**: Full path (String)\n- **Example**: `SHARED_DIR=/home/user/ragit-data`\n- **Note**: This path should be accessible by both the host and guest machine/containers.\n\n## RAG_COLLECTION\n- **Description**: The name of the RAG collection to be used.\n- **Format**: String\n- **Example**: `RAG_COLLECTION=stories`\n- **Note**: This should correspond to the sub-directory under `SHARED_DIR` where the collection data is stored.\n\n## Setting Up the .env File\n\n1. **Create the .env File**: In the root directory of the RAGIT repository, create a file named `.env`.\n\n   ```sh\n   touch .env\n   ```\n\n2. **Add Configuration Parameters**: Open the `.env` file in your preferred text editor and add the configuration parameters as shown in the example above. Replace placeholder values (like `\u003cvalid-openai-key\u003e`, `\u003cpath-to-shared-directory\u003e`, and `\u003cyour-rag-collection-name\u003e`) with actual values appropriate for your environment.\n\nExample:\n\n```env\nOPENAI_API_KEY=sk-abcdefghijklmnopqrstuvwxy1234567890\nPOSTGRES_DB=postgres\nPOSTGRES_USER=postgres\nPOSTGRES_PASSWORD=secretpassword\nPOSTGRES_PORT=5432\nPOSTGRES_HOST=db_host\nEXTERNAL_FRONT_END_PORT=13133\nINTERNAL_FRONT_END_PORT=8789\nVECTOR_DB_PROVIDER=CHROMA\nSHARED_DIR=/home/user/ragit-data\nRAG_COLLECTION=stories\n```\n\n3. **Save and Close**: Save the changes and close the text editor.\n\n## Summary\n\nThe `.env` file plays a vital role in configuring the RAGIT application by\ncentralizing critical environment variables. By properly setting up this file,\nyou ensure that all components of the RAGIT application can easily access the\nnecessary configuration settings, leading to a smoother and more efficient\ndeployment process.\n\n\n# Back up and Restore\n\nThe structure of the directory holding your ragit collection should be the\nfollowing:\n\n```\n\u003cyour-collection-name\u003e\n├── documents\n├── registry\n└── vectordb\n    └── \u003cyour-collection-name\u003e-chroma-vector.db\n```\n\n## Backup the psql database\n\nIn postgres there should be a database named \u003cyour-collection-name\u003e which holds\nthe chunks used and also information about what has been inserted to the\nvectordb to keep housekpeeing clean. We need to create a backup of it to add it\nto the registry and the vectordb files so we will be able to completely restore\nall the related data.\n\nFrom the vagrant machine to backup a RAG collection you need to run the\n`make_backup.sh` utility from the `/vagrant/ragit/utils` passing the name of the\ncollection as the only argument to the command as can be seen here:\n\n```\ncd /vagrant/ragit/utils\n ./make_backup.sh dummy\n```\n\nThe above command will create a tarball named `dummy.tar.gz` under the roor\ndirectory of the shared directory (by default `~/ragit-data`)\n\nThis tarball we can move it to any directory we like to use as the backup\ncontainer.\n\n\n## Restoring a RAG collection\n\nTo restore a RAG collection using its backup tarball we can use the\n`restore_backup.py` python script that is locatio under the\n`/vagrant/ragit/utils` directory as can be seen here:\n\n\n```\ncd /vagrant/ragit/utils\npython3 restore_backup.py dummy.tar.gz\n```\n\nUpon completion of the above script the `dummy` directory will be created under\nthe shared directory (by default `~/ragit-data`).\n\nStill we will need to create the database manually as can be seen here:\n\n```\ncd ~/ragit-data/dummy\ndropdb -U postgres dummy\ncreatedb -U postgres dummy\npsql -U postgres -d dummy -f dummy.sql\n```\n\nAt this point you should have a fully functional database and RAG collection in\ngeneral and you should be able to start the front end and run it against it.\n\n# Data pipeline overview\n\nThe data pipeline in RAGit is engineered to ensure the seamless transformation of raw documents into actionable insights within a Retrieval Augmented Generation (RAG) solution. This document provides a high-level overview of each stage in the pipeline, emphasizing key processes involved.\n\n## Document Ingestion\n\nDocuments can be ingested simply by placing them into the designated documents directory for the RAG collection within the shared directory. The system currently supports PDF, DOCX, Python, and Markdown formats, with additional data types to be incorporated as the project evolves.\n\n## Document ETL Process\n\nAfter documents are placed in the designated directory, they must undergo processing to facilitate splitting, embedding, and storage into the vector database, enabling RAG querying capabilities.\n\n## PDF File Processing\n\nPDF files are initially converted to images, with each page represented as a separate image. These images are then transformed into Markdown format. Following this conversion, the subsequent steps—embedding calculation and storage in the vector database—proceed just as they do for other Markdown documents.\n\n## Document Splitting\n\nOnce collected, documents are divided into smaller chunks to enhance processing efficiency and searchability. Splitting is based on specific criteria, such as paragraph breaks or sentence boundaries.\n\n## Database Insertion\n\nThe resultant chunks are stored incrementally in a relational database, allowing for updates without the necessity of upfront ingestion of all documents.\n\n## Embedding Calculation and Storage\n\nTo facilitate vector-based search, embeddings—numerical representations capturing the semantic meaning of text—are computed for each chunk in the database.\n\n- **Embedding Calculation**: A dedicated process identifies chunks lacking embeddings.\n- **Embedding Storage**: Calculated embeddings are stored back in the database.\n\n## Vector Database Construction\n\nUsing the stored embeddings, the vector database is either constructed or updated, enabling efficient retrieval of relevant chunks based on semantic similarity.\n\n- **Vector Database Update**: Embeddings are indexed, allowing for search operations that return similar chunks in response to a query.\n\n## Frontend Deployment\n\nThe vector database and web service frontend are deployed on a web server, making the RAG solution accessible to users.\n\n- **Web Service**: This service interfaces with the vector database to fetch relevant chunks based on user queries.\n\n## Evaluation and Enhancement\n\nUser interactions with the frontend are monitored to gather feedback and evaluate the RAG solution’s performance. This feedback loop informs regular updates and enhancements to the data pipeline.\n\n- **User Feedback**: Captures user assessments (e.g., thumbs up or down) to evaluate response quality.\n- **Periodic Updates**: Utilizes feedback to regularly update the vector database, refine prompts, and improve other solution components.\n\n## High-Level Data Pipeline Workflow\n\n1. **Document Collection**\n   - Gather supported documents (PDF, DOCX, Markdown).\n\n2. **Document Splitting and Database Insertion**\n   - Split documents into chunks.\n   - Insert chunks into the relational database.\n\n3. **Embedding Calculation and Storage**\n   - Identify chunks lacking embeddings.\n   - Calculate and store embeddings in the database.\n\n4. **Vector Database Construction**\n   - Index embeddings within the vector database.\n\n5. **Frontend Deployment**\n   - Deploy the vector database and web service frontend on a web server.\n\n6. **Evaluation and Enhancement**\n   - Gather user feedback for performance evaluation.\n   - Regularly update and improve the data pipeline based on feedback.\n\n## Conclusion\n\nThe data pipeline in RAGit is a detailed and iterative process that transforms raw documents into a robust and efficient RAG solution. By adhering to these stages, RAGit ensures precise processing, indexing, and accessibility of data for retrieval-augmented generation tasks, supporting effective applications.\n\n # Why Use Vagrant For Development\n\nVagrant is a powerful tool for building and managing virtualized development environments. By using Vagrant, developers can ensure a consistent and reproducible environment across all stages of development. This is particularly useful for complex applications like RAGIT, which involve multiple components and dependencies.\n\n## Key Benefits\n\n## 1. Consistent Development Environment\n\nVagrant allows developers to create and distribute a consistent development environment. This ensures that all team members work in the same setup, eliminating the \"it works on my machine\" problem. With a Vagrantfile, the entire development environment can be version-controlled and shared.\n\n- **Example**: Every developer working on RAGIT can use the same OS version, dependencies, and configurations, ensuring consistency across different setups.\n\n## 2. Simplified Setup\n\nSetting up a development environment manually can be time-consuming and error-prone. Vagrant automates this process, allowing developers to get up and running quickly with a single command.\n\n- **Example**: Running `vagrant up` in the RAGIT repository can automatically provision a virtual machine with all the necessary dependencies, making it easier for new developers to get started.\n\n## 3. Isolation of Development Environments\n\nVagrant provides isolated environments for different projects. This ensures that dependencies and configurations for RAGIT do not interfere with other projects on the developer's machine.\n\n- **Example**: RAGIT can run in its Vagrant-managed virtual machine, completely isolated from other applications the developer may be working on.\n\n## 4. Enhanced Testing\n\nBy using Vagrant, developers can easily create and manage multiple virtual machines, which is useful for testing different configurations and environments. This can be particularly useful for testing RAGIT in various deployment scenarios.\n\n- **Example**: Developers can test RAGIT on different operating systems and configurations by simply modifying the Vagrantfile and spinning up new VMs.\n\n## 5. Reproducible Builds\n\nVagrant ensures that the development environment is reproducible. By providing a consistent environment, it reduces the chances of environment-related bugs and issues.\n\n- **Example**: Bugs that appear in a developer's environment can be reproduced by others using the same Vagrant setup, making debugging and collaboration more efficient.\n\n## 6. Integration with Provisioning Tools\n\nVagrant integrates well with provisioning tools like Ansible, Puppet, and Chef. This allows for more complex setups and configurations to be automated and managed easily.\n\n- **Example**: For RAGIT, additional services and dependencies can be provisioned using Ansible scripts directly within the Vagrantfile.\n\n## 7. Ease of Collaboration\n\nVagrant makes it easy to share development environments. Team members can share the Vagrantfile, ensuring everyone is working in the same environment.\n\n- **Example**: The RAGIT team can distribute the Vagrantfile along with the code repository. Any developer can clone the repository, run `vagrant up`, and start developing without manual configuration.\n\n## 8. Production Parity\n\nVagrant allows developers to create environments that closely match production. This reduces the risk of issues when deploying the application to production.\n\n- **Example**: RAGIT can be developed and tested on an environment that mirrors the production setup, ensuring compatibility and minimizing deployment issues.\n\n## Summary\n\nUsing Vagrant for developing an application like RAGIT provides numerous benefits, including consistent and reproducible environments, simplified setup, isolation, enhanced testing, and ease of collaboration. By leveraging Vagrant, developers can ensure that their development and testing environments closely mimic production, reducing the risk of deployment issues and improving overall efficiency and collaboration.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadaptaware%2Fragit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadaptaware%2Fragit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadaptaware%2Fragit/lists"}