{"id":28000917,"url":"https://github.com/kuadrant/inferno","last_synced_at":"2025-08-17T21:12:53.684Z","repository":{"id":290600009,"uuid":"974175625","full_name":"Kuadrant/inferno","owner":"Kuadrant","description":null,"archived":false,"fork":false,"pushed_at":"2025-05-21T11:57:53.000Z","size":74,"stargazers_count":3,"open_issues_count":2,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-07-21T08:43:06.082Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Kuadrant.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-28T11:20:39.000Z","updated_at":"2025-07-11T09:31:56.000Z","dependencies_parsed_at":"2025-05-08T23:55:57.446Z","dependency_job_id":"02fe3e90-71c9-49c5-9657-ff68d9338964","html_url":"https://github.com/Kuadrant/inferno","commit_stats":null,"previous_names":["kuadrant/inferno"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Kuadrant/inferno","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kuadrant%2Finferno","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kuadrant%2Finferno/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kuadrant%2Finferno/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kuadrant%2Finferno/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Kuadrant","download_url":"https://codeload.github.com/Kuadrant/inferno/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kuadrant%2Finferno/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270906556,"owners_count":24665804,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-17T02:00:09.016Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-08T23:55:47.785Z","updated_at":"2025-08-17T21:12:53.656Z","avatar_url":"https://github.com/Kuadrant.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Inferno\n\nA service providing a suite of `ext_proc` services for LLM use-cases:\n\n- **Semantic Cache**: Caches responses based on semantic similarity of prompts\n- **Prompt Guard**: Filters and blocks potentially harmful prompts using LLM-based risk detection\n- **Token Usage Metrics**: Parses token usage for monitoring and rate limiting use-cases\n\n## Running Locally\n\n### Prerequisites\n\n- Go 1.23+\n- Docker / Podman / Kubernetes\n\n### Building and Running\n\nCurrently, we offer a way to run a demo version of the service, alongside a pre-configured Envoy instance.\n\n```bash\n# Builds `inferno` and deploys Envoy \u0026 configures it to use inferno filter\ndocker-compose up --build\n```\n\nLater, we'll offer more options to deploy on Kubernetes, or as part of Kuadrant.\n\n### Environment Variables\n\nThe following environment variables can be configured:\n\n#### General Settings\n- `EXT_PROC_PORT`: Port for the ext_proc server (default: 50051)\n\n#### Semantic Cache Settings\n- `EMBEDDING_MODEL_SERVER`: URL for the embedding model server\n- `EMBEDDING_MODEL_HOST`: Host header for the embedding model server\n- `SIMILARITY_THRESHOLD`: Threshold for semantic similarity (default: 0.75)\n\n#### Prompt Guard Settings\n- `GUARDIAN_API_KEY`: API key for the risk assessment model\n- `GUARDIAN_URL`: Base URL for the risk assessment model\n- `DISABLE_PROMPT_RISK_CHECK`: Set to \"yes\" to disable prompt risk checking\n- `DISABLE_RESPONSE_RISK_CHECK`: Set to \"yes\" to disable response risk checking\n\n#### API Endpoint Settings\n- `OPENAI_API_HOST`: Hostname for OpenAI API requests (default: api.openai.com)\n- `KSERVE_API_HOST`: Hostname/IP for KServe API requests (default: 192.168.97.4)\n- `KSERVE_API_HOST_HEADER`: Host header value for KServe API requests (default: huggingface-llm-default.example.com)\n\n## Sample Requests\n\n### OpenAI proxied requests\n\nThe demo setup with `docker compose` configures Envoy to proxy chat completion and embeddings requests to OpenAI's API, as well as our sample filter chain with the `ext_proc` services we provision and run. Ensure you have a valid OpenAI API key exported as an environment variable:\n\n```bash\nexport OPENAI_API_KEY=xxx\n```\n\n\n#### Completion\n\n```bash\ncurl \"http://localhost:10000/v1/completions\" \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $OPENAI_API_KEY\" \\\n  -d '{\n      \"model\": \"gpt-3.5-turbo-instruct\",\n      \"prompt\": \"Write a one-sentence bedtime story about Kubernetes.\"\n  }'\n```\n\n#### Chat completion\n\nChat completions:\n\n```bash\ncurl -v \"http://localhost:10000/v1/chat/completions\" \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $OPENAI_API_KEY\" \\\n  -d '{\n      \"model\": \"gpt-4.1\",\n      \"messages\": [\n        {\n          \"role\": \"user\",\n          \"content\": \"Write a one-sentence bedtime story about Kubernetes.\"\n        }\n      ]\n  }'\n```\n\n\nResponses:\n\n```bash\ncurl -v http://localhost:10000/v1/responses \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $OPENAI_API_KEY\" \\\n  -d '{\n    \"model\": \"gpt-4.1\",\n    \"input\": \"Tell me a three sentence bedtime story about Kubernetes.\"\n  }'\n```\n\n#### Embeddings\n\n```bash\ncurl http://localhost:10000/v1/embeddings \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer $OPENAI_API_KEY\" \\\n  -d '{\n    \"input\": \"Your text string goes here\",\n    \"model\": \"text-embedding-3-small\"\n  }'\n```\n\n### KServe Hugging Face LLM Runtime\n\nInferno supports KServe's Hugging Face LLM runtime API endpoints. These endpoints use the `/openai/v1/` prefix instead of `/v1/`. You can configure the KServe host using environment variables.\n\n#### Configuration\n\nYou can set the following environment variables to configure the KServe integration, if running embedding and LLM models as inference services:\n\n```bash\n# Set the KServe destination address/IP (default: 192.168.97.4)\nexport KSERVE_API_HOST=192.168.97.4\n\n# Set the KServe Host header separately (default: huggingface-llm-default.example.com)\nexport KSERVE_API_HOST_HEADER=huggingface-llm-default.example.com\n\nexport EMBEDDING_MODEL_SERVER=http://192.168.97.4/v1/models/embedding-model:predict\n\n# Optional: Set the KServe Host header (if different, otherwise don't export/leave blank)\n# export EMBEDDING_MODEL_HOST=\"embedding-model-default.example.com\"\n\n\n# or set these dynamically, for example:\nexport KSERVE_API_HOST=\"$(kubectl get gateway -n kserve kserve-ingress-gateway -o jsonpath='{.status.addresses[0].value}')\"\nexport KSERVE_API_HOST_HEADER=\"$(kubectl get inferenceservice huggingface-llm -o jsonpath='{.status.url}' | cut -d '/' -f 3)\"\nexport EMBEDDING_MODEL_SERVER=\"http://$(kubectl get gateway -n kserve kserve-ingress-gateway -o jsonpath='{.status.addresses[0].value}')/v1/models/embedding-model:predict\"\nexport EMBEDDING_MODEL_HOST=\"$(kubectl get inferenceservice embedding-model -o jsonpath='{.status.url}' | cut -d '/' -f 3)\"\n\n\n# Start Inferno with the KServe configuration\ndocker-compose up --build\n```\n\n**Note:** KServe's Huggingface LLM runtime expects requests at `/openai/v1/...` paths, not `/v1/...` paths - `inferno` preserves these paths and does not rewrite them.\n\nWith this configuration, you can make simplified requests to your local Inferno instance:\n\n```bash\n# Without needing to specify the Host header in each request\ncurl -v http://localhost:10000/openai/v1/completions \\\n  -H \"content-type: application/json\" \\\n  -d '{\"model\": \"llm\", \"prompt\": \"What is Kubernetes\", \"stream\": false, \"max_tokens\": 50}'\n```\n\n#### Completions\n\n```bash\ncurl -v http://localhost:10000/openai/v1/completions \\\n  -H \"content-type: application/json\" \\\n  -d '{\"model\": \"llm\", \"prompt\": \"What is Kubernetes\", \"stream\": false, \"max_tokens\": 50}'\n```\n\n#### Chat Completions\n\n```bash\ncurl -v \"http://localhost:10000/openai/v1/chat/completions\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n      \"model\": \"llm\",\n      \"messages\": [\n        {\n          \"role\": \"system\",\n          \"content\": \"You are an assistant that knows everything about Kubernetes.\"\n        },\n        {\n          \"role\": \"user\",\n          \"content\": \"What is Kubernetes\"\n        }\n      ],\n      \"max_tokens\": 30,\n      \"stream\": false\n  }'\n```\n\nThe responses from the KServe Hugginfface LLM server follow the OpenAI-style APIs, and include token usage metrics that Inferno will extract and add as headers in responses.\n\n### Semantic Cache\n\n```bash\ncurl -v \n```\n\n### Prompt Guard\n\n```bash\ncurl -v \n```\n\n### Token Usage Metrics\n\n```bash\ncurl -v \n```\n\n## Testing\n\nTo run the unit tests locally, use the following command:\n\n```bash\nmake test\n````\n**Note:** The tests are only starting to be written, and are not comprehensive yet. We will be adding more tests in the future.\n\n\n  # TODO completions vs chat-completions\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuadrant%2Finferno","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkuadrant%2Finferno","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuadrant%2Finferno/lists"}