{"id":30188694,"url":"https://github.com/rhecosystemappeng/rag-blueprint","last_synced_at":"2025-08-12T17:45:57.340Z","repository":{"id":284796022,"uuid":"955511824","full_name":"RHEcosystemAppEng/RAG-Blueprint","owner":"RHEcosystemAppEng","description":"RAG blueprint","archived":false,"fork":false,"pushed_at":"2025-05-07T15:48:20.000Z","size":10301,"stargazers_count":4,"open_issues_count":9,"forks_count":20,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-07T16:50:03.900Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RHEcosystemAppEng.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-26T18:52:41.000Z","updated_at":"2025-05-07T15:48:25.000Z","dependencies_parsed_at":"2025-04-30T03:27:21.120Z","dependency_job_id":"fc9c636d-b66f-4da7-a330-3b0b16edb219","html_url":"https://github.com/RHEcosystemAppEng/RAG-Blueprint","commit_stats":null,"previous_names":["rhecosystemappeng/rag-blueprint"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/RHEcosystemAppEng/RAG-Blueprint","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RHEcosystemAppEng%2FRAG-Blueprint","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RHEcosystemAppEng%2FRAG-Blueprint/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RHEcosystemAppEng%2FRAG-Blueprint/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RHEcosystemAppEng%2FRAG-Blueprint/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RHEcosystemAppEng","download_url":"https://codeload.github.com/RHEcosystemAppEng/RAG-Blueprint/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RHEcosystemAppEng%2FRAG-Blueprint/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270108978,"owners_count":24528772,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-12T02:00:09.011Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-12T17:45:55.988Z","updated_at":"2025-08-12T17:45:57.184Z","avatar_url":"https://github.com/RHEcosystemAppEng.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RAG Reference Architecture using LLaMA Stack, OpenShift AI, and PGVector\n\n## Description\n\nRetrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant external knowledge to improve accuracy, reduce hallucinations, and support domain-specific conversations. This architecture uses:\n\n- **OpenShift AI** for orchestration\n- **LLaMA Stack** for standardizing the core building blocks and simplifying AI application development\n- **PGVector** for semantic search\n- **Kubeflow Pipelines** for data ingestion\n- **Streamlit UI** for a user-friendly chatbot interface\n\n\n---\n\n## Architecture Diagram\n\n![RAG System Architecture](docs/img/rag-architecture.png)\n\n*The architecture illustrates both the ingestion pipeline for document processing and the RAG pipeline for query handling. For more details click [here](docs/rag-reference-architecture.md).*\n\n---\n\n## Features\n\n- Multi-Modal Data Ingestion for ingesting unstructured data\n- Preprocessing pipelines for cleaning, chunking, and embedding generation using language models\n- Vector Store Integration to store dense embeddings\n- Integrates with LLMs to generate responses based on retrieved documents\n- Streamlit based web application\n- Runs on OpenShift AI for container orchestration and GPU acceleration\n- Llama Stack to standardize the core building blocks and simplify AI application development\n- Safety Guardrail to block harmful request / response\n- Integration with MCP servers\n\n---\n\n## Ingestion Use Cases\n\n### 1. BYOD (Bring Your Own Document)\n\nEnd users can upload files through a UI and receive contextual answers based on uploaded content.\n\n### 2. Pre-Ingestion\n\nEnterprise documents are pre-processed and ingested into the system for later querying via OpenShift AI/Kubeflow Pipelines.\n\n---\n\n## Key Components\n\n| Layer            | Component                      | Description |\n|------------------|--------------------------------|-------------|\n| **UI Layer**     | Streamlit / React              | Chat-based user interaction |\n| **Retrieval**    | Retriever                      | Vector search |\n| **Embedding**    | `all-MiniLM-L6-v2`             | Converts text to vectors |\n| **Vector DB**    | PostgreSQL + PGVector          | Stores embeddings |\n| **LLM**          | `Llama-3.2-3B-Instruct`        | Generates responses |\n| **Ingestor**     |  Kubeflow Pipeline             | Embeds documents and stores vectors |\n| **Storage**      |  S3 Bucket                     | Document source |\n\n---\n\n## Scalability \u0026 Performance\n\n- KServe for auto-scaling the model and embedding pods\n- GPU-based inference optimized using node selectors\n- Horizontal scaling of ingestion and retrieval components\n\n---\n\nThe kickstart supports two modes of deployments\n\n- Local\n- Openshift\n\n## OpenShift Installation\n\n### Minimum Requirements\n\n- OpenShift Cluster 4.16+ with OpenShift AI\n- OpenShift Client CLI - [oc](https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/cli_tools/openshift-cli-oc#installing-openshift-cli)\n- Helm CLI - helm\n- [huggingface-cli](https://huggingface.co/docs/huggingface_hub/guides/cli) (Optional)\n- 1 GPU with 24GB of VRAM for the LLM, refer to the chart below\n- 1 GPU with 24GB of VRAM for the safety/shield model (optional)\n- [Hugging Face Token](https://huggingface.co/settings/tokens)\n- Access to [Meta Llama](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/) model.\n- Access to [Meta Llama Guard](https://huggingface.co/meta-llama/Llama-Guard-3-8B/) model.\n- Some of the example scripts use `jq` a JSON parsing utility which you can acquire via `brew install jq`\n\n### Supported Models\n\n| Function    | Model Name                             | GPU         | AWS\n|-------------|----------------------------------------|-------------|-------------\n| Embedding   | `all-MiniLM-L6-v2`                     | CPU or GPU  |\n| Generation  | `meta-llama/Llama-3.2-3B-Instruct`     | L4          | g6.2xlarge\n| Generation  | `meta-llama/Llama-3.1-8B-Instruct`     | L4          | g6.2xlarge\n| Generation  | `meta-llama/Meta-Llama-3-70B-Instruct` | A100 x2     | p4d.24xlarge\n| Safety      | `meta-llama/Llama-Guard-3-8B`          | L4          | g6.2xlarge\n\nNote: the 70B model is NOT required for initial testing of this example.  The safety/shield model `Llama-Guard-3-8B` is also optional. \n\n---\n\n#### Installation steps\n\n1. Clone the repo so you have a working copy\n\n```bash\ngit clone https://github.com/RHEcosystemAppEng/RAG-Blueprint\n```\n\n2. Login to your OpenShift Cluster\n\n```bash\noc login --server=\"\u003ccluster-api-endpoint\u003e\" --token=\"sha256~XYZ\"\n```\n\n3. If the GPU nodes are tainted, find the taint key. You will have to pass in the\n   make command to ensure that the llm pods are deployed on the tainted nodes with GPU.\n   In the example below the key for the taint is `nvidia.com/gpu`\n\n\n```bash\noc get nodes -o yaml | grep -A 3 taint\n```\nThe output of the command will be something like below\n```\n  taints:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      value: \"true\"\n--\n    taints:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      value: \"true\"\n```\n\nYou can work with your OpenShift cluster admin team to determine what labels and taints identify GPU-enabled worker nodes.  It is also possible that all your worker nodes have GPUs therefore have no distinguishing taint.\n\n4. Navigate to Helm deploy directory\n\n```bash\ncd deploy/helm\n```\n\n5. List available models\n\n```bash\nmake list-models\n```\n\nThe above command will list the models to use in the next command\n\n```bash\n(Output)\nmodel: llama-3-1-8b-instruct (meta-llama/Llama-3.1-8B-Instruct)\nmodel: llama-3-2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)\nmodel: llama-3-2-3b-instruct (meta-llama/Llama-3.2-3B-Instruct)\nmodel: llama-3-3-70b-instruct (meta-llama/Llama-3.3-70B-Instruct)\nmodel: llama-guard-3-1b (meta-llama/Llama-Guard-3-1B)\nmodel: llama-guard-3-8b (meta-llama/Llama-Guard-3-8B)\n```\n\nThe \"guard\" models can be used to test shields for profanity, hate speech, violence, etc.\n\n6. Install via make\n\nUse the taint key from above as the `LLM_TOLERATION` and `SAFETY_TOLERATION`\n\nThe namespace will be auto-created\n\nTo install only the RAG example, no shields, use the following command:\n\n```bash\nmake install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION=\"nvidia.com/gpu\"\n```\n\nTo install both the RAG example as well as the guard model to allow for shields, use the following command:\n\n```bash\nmake install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION=\"nvidia.com/gpu\" SAFETY=llama-guard-3-8b SAFETY_TOLERATION=\"nvidia.com/gpu\"\n```\n\nIf you have no tainted nodes, perhaps every worker node has a GPU, then you can use a simplified version of the make command\n\n```bash\nmake install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b\n```\n\nWhen prompted, enter your **[Hugging Face Token]((https://huggingface.co/settings/tokens))**.\n\nNote: This process often takes 11 to 30 minutes\n\n7. Watch/Monitor\n\n```bash\noc get pods -n llama-stack-rag\n```\n\n```\n(Output)\nNAME                                                               READY   STATUS      RESTARTS   AGE\ndemo-rag-vector-db-v1-0-2ssgk                                      0/1     Error       0          7m49s\ndemo-rag-vector-db-v1-0-fhlpw                                      0/1     Completed   0          7m15s\ndemo-rag-vector-db-v1-0-zx9q9                                      0/1     Error       0          8m16s\nds-pipeline-dspa-6899c9df7c-4j459                                  2/2     Running     0          7m53s\nds-pipeline-metadata-envoy-dspa-7659ddc8d9-vh24q                   2/2     Running     0          7m51s\nds-pipeline-metadata-grpc-dspa-8665cd5c6c-4z9g6                    1/1     Running     0          7m51s\nds-pipeline-persistenceagent-dspa-56f888bc78-h2mtr                 1/1     Running     0          7m53s\nds-pipeline-scheduledworkflow-dspa-c94d5c95d-j4874                 1/1     Running     0          7m52s\nds-pipeline-workflow-controller-dspa-5799548b68-bs6pj              1/1     Running     0          7m52s\nfetch-and-store-pipeline-pf6nr-system-container-driver-692373917   0/2     Completed   0          6m38s\nfetch-and-store-pipeline-pf6nr-system-container-impl-2125359307    0/2     Error       0          6m28s\nfetch-and-store-pipeline-pf6nr-system-dag-driver-3719582226        0/2     Completed   0          6m59s\nllama-3-2-3b-instruct-predictor-00001-deployment-6b85857bd4nfhr    3/3     Running     0          12m\nllamastack-6f55c69f7c-ctctl                                        1/1     Running     0          8m54s\nmariadb-dspa-74744d65bd-gqnzb                                      1/1     Running     0          8m17s\nmcp-servers-weather-65cff98c8b-42n8h                               1/1     Running     0          8m58s\nminio-0                                                            1/1     Running     0          8m52s\npgvector-0                                                         1/1     Running     0          8m53s\nrag-pipeline-notebook-0                                            2/2     Running     0          8m17s\nrag-rag-ui-6c756945bf-st6hm                                        1/1     Running     0          8m55s\n```\n\n8. Verify:\n\n```bash\noc get pods -n llama-stack-rag\noc get svc -n llama-stack-rag\noc get routes -n llama-stack-rag\n```\n\n### Using the RAG UI\n\n1. Get the route url for the application\n\n```bash\nURL=http://$(oc get routes -l app.kubernetes.io/name=rag-ui -o jsonpath=\"{range .items[*]}{.status.ingress[0].host}{end}\")\necho $URL\nopen $URL\n```\n\n![RAG UI Main](./docs/img/rag-ui-1.png)\n\n2. Click on RAG\n\n3. Upload your document\n\n4. Create a Vector Database\n\n![RAG UI Main 2](./docs/img/rag-ui-2.png)\n\n5. Once you've recieved `Vector database created successfully!`, select the Vector Database you created\n\n6. Ask a question pertaining to your document!\n\n![RAG UI Main 3](./docs/img/rag-ui-3.png)\n\nRefer to the [post installation](docs/post_installation.md) document for batch document ingestion.\n\n## Uninstalling the RAG application\n\n```bash\nmake uninstall NAMESPACE=llama-stack-rag\n```\nor\n\n```bash\noc delete project llama-stack-rag\n```\n\n## Defining a new model\nTo deploy a new model using the `llm-service` Helm chart or connect to an existing vLLM server, follow these steps:\n\n1. Deploying a Model via `llm-service`\n\n    If you're deploying the model with `llm-service`, edit the file `deploy/helm/llm-service/values-gpu.yaml` and add a new model definition under the `.models` section to specify the model you want deployed with the `llm-service` chart and its args:\n    ```yaml\n      models:\n        llama-3-2-3b-instruct:\n          id: meta-llama/Llama-3.2-3B-Instruct\n          enabled: false\n          inferenceService:\n            args:\n            - --enable-auto-tool-choice\n            - --chat-template\n            - /vllm-workspace/examples/tool_chat_template_llama3.2_json.jinja\n            - --tool-call-parser\n            - llama3_json\n            - --max-model-len\n            - \"30544\"\n    ```\n\n2. Update `llama-stack` Configuration\n\n    Edit the file `deploy/helm/rag-ui/charts/llama-stack/values.yaml` and add a corresponding entry under `.models` for the LLaMA stack configuration.\n    ```yaml\n      llama-3-2-3b-instruct:\n        id: meta-llama/Llama-3.2-3B-Instruct\n        enabled: false\n        url: local-ns\n    ```\n\nNotes:\n* If the model is not deployed with `llm-service` in the same namespace as `llama-stack`, you do not need to modify the `llm-service` values.  Instead, just configure the the external model in `llama-stack` and replace `local-ns` with a url, and an optional `apiToken`.\n* To use the new model, set the `enabled` flags to true.\n\n\n## Local Development Setup\n\nRefer to the [local setup guide](docs/local_setup_guide.md) document for configuring your workstation for code changes and local testing.\n\n1. From the root of the project, switch to the ui directory\n\n```bash\ncd ui\n```\n\n2. Create a virtual environment (Python based development often works better with a virtual environment)\n\n```bash\npython3.11 -m venv venv\nsource venv/bin/activate\n```\n\n3. Download the dependencies\n\n```bash\npip install -r requirements.txt\n```\n\n4. Port forward the service inside of OpenShift to the local machine on port 8321\n\n```bash\noc port-forward svc/llamastack 8321:8321\n```\n\n5. Launch the application and opens a browser tab with the `streamlit` command\n\n```bash\nstreamlit run app.py\n```\n\n6. Give the weather MCP-based tool a test with a US-based city by toggling on \"mcp::weather\" via a click for real-time weather information\n\n![RAG UI MCP weather](./docs/img/rag-ui-3.png)\n\n### Redeploy Changes\n\nMake changes to app.py\n\nDeployment after making changes requires a rebuild of the container image using either `docker` or `podman`.  Replace `docker.io` with your target container registry such as `quay.io`.\n\n```bash\ndocker buildx build --platform linux/amd64,linux/arm64 -t docker.io/burrsutter/rag-ui:v1 -f Containerfile .\n```\n\n```bash\ndocker push docker.io/burrsutter/rag-ui:v1\n```\n\nAdd modification to `deploy/helm/rag-ui/values.yaml`\n\n```\nimage:\n  repository: docker.io/burrsutter/rag-ui\n  pullPolicy: IfNotPresent\n  tag: v1\n```\n\n To redeploy to the cluster run the same `make` command as you did before.\n\n### Shields\n\n```bash\nexport LLAMA_STACK_ENDPOINT=http://localhost:8321\n```\n\nFirst see what models are available\n\n```bash\ncurl -sS $LLAMA_STACK_SERVER/v1/models -H \"Content-Type: application/json\" | jq -r '.data[].identifier'\n```\n\n```\n(Output)\nmeta-llama/Llama-3.2-3B-Instruct\nmeta-llama/Llama-Guard-3-8B\nall-MiniLM-L6-v2\n```\n\nThe \"Guard\" model is the one appropriate for adding as a Llama Stack Shield.\n\nFrom within the `ui` directory or whichever one has the `venv` with the dependencies:\n\n- Register the shield\n\n```\npython ../shields/register-shield.py\n```\n\n- List shields\n\n```\npython ../shields/list-shields.py\n```\n\n- Test the shield\n\n```\npython ../shields/test-shield.py\n```\n\n```\n(Output)\nLLAMA_STACK_ENDPOINT: http://localhost:8321\nLLAMA_STACK_MODEL: meta-llama/Llama-3.2-3B-Instruct\nSafety violation detected: I can't answer that. Can I help with something else?\n'response: \u003cgenerator object Agent._create_turn_streaming at 0x1052ecd60\u003e'\nshield_call\u003e No Violation\ninference\u003e The friendly stranger smiled and said hello as she approached the table where I was sitting alone.\n'response: \u003cgenerator object Agent._create_turn_streaming at 0x1052ed000\u003e'\nshield_call\u003e {'violation_type': 'S1'} I can't answer that. Can I help with something else?\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frhecosystemappeng%2Frag-blueprint","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frhecosystemappeng%2Frag-blueprint","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frhecosystemappeng%2Frag-blueprint/lists"}