{"id":42196534,"url":"https://github.com/shengyanli1982/llmproxy","last_synced_at":"2026-01-27T00:17:46.586Z","repository":{"id":293951035,"uuid":"985580445","full_name":"shengyanli1982/llmproxy","owner":"shengyanli1982","description":"🧭🧭 An intelligent load balancer with smart scheduling that unifies diverse LLMs.","archived":false,"fork":false,"pushed_at":"2025-06-23T07:10:13.000Z","size":1862,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-23T07:39:46.992Z","etag":null,"topics":["agent","agentic-ai","ai-gateway","azure","gateway","langchain","large-language-model","llm","llm-gateway","llmapi","llmops","moa","openai","openai-api","optimization","prompt-engineering","proxy","proxy-server","vertex-ai"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shengyanli1982.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-18T04:31:04.000Z","updated_at":"2025-06-23T07:10:17.000Z","dependencies_parsed_at":"2025-05-18T06:24:01.141Z","dependency_job_id":"808e53eb-e4cf-4ecf-be70-bc882da9c96a","html_url":"https://github.com/shengyanli1982/llmproxy","commit_stats":null,"previous_names":["shengyanli1982/llmproxy"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/shengyanli1982/llmproxy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shengyanli1982%2Fllmproxy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shengyanli1982%2Fllmproxy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shengyanli1982%2Fllmproxy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shengyanli1982%2Fllmproxy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shengyanli1982","download_url":"https://codeload.github.com/shengyanli1982/llmproxy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shengyanli1982%2Fllmproxy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28792666,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-26T21:49:50.245Z","status":"ssl_error","status_checked_at":"2026-01-26T21:48:29.455Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","agentic-ai","ai-gateway","azure","gateway","langchain","large-language-model","llm","llm-gateway","llmapi","llmops","moa","openai","openai-api","optimization","prompt-engineering","proxy","proxy-server","vertex-ai"],"created_at":"2026-01-27T00:17:46.170Z","updated_at":"2026-01-27T00:17:46.451Z","avatar_url":"https://github.com/shengyanli1982.png","language":"Rust","readme":"English | [中文](./README_CN.md)\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./images/logo.png\" alt=\"logo\" width=\"650\"\u003e\n\u003c/div\u003e\n\n**LLMProxy: Enterprise-grade intelligent proxy and load balancer designed specifically for large language models. Unify management and orchestration of various LLM services (public cloud APIs, privately deployed vLLM/Ollama, etc.), enabling efficient, stable, and scalable LLM application access in multi-cloud/hybrid cloud architectures while minimizing client code modifications.**\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#introduction\"\u003eIntroduction\u003c/a\u003e\n  |\n  \u003ca href=\"#quick-start\"\u003eQuick Start\u003c/a\u003e\n  |\n  \u003ca href=\"#core-features\"\u003eCore Features\u003c/a\u003e\n  |\n  \u003ca href=\"#use-cases\"\u003eUse Cases\u003c/a\u003e\n  |\n  \u003ca href=\"#configuration-guide\"\u003eConfiguration Guide\u003c/a\u003e\n  |\n  \u003ca href=\"#advanced-deployment\"\u003eAdvanced Deployment\u003c/a\u003e\n  |\n  \u003ca href=\"#understanding-llmproxy\"\u003eUnderstanding LLMProxy\u003c/a\u003e\n  |\n  \u003ca href=\"#api-endpoints\"\u003eAPI Endpoints\u003c/a\u003e\n  |\n  \u003ca href=\"#prometheus-metrics\"\u003ePrometheus Metrics\u003c/a\u003e\n  |\n  \u003ca href=\"#license\"\u003eLicense\u003c/a\u003e\n\u003c/p\u003e\n\n## Introduction\n\n**LLMProxy** is an enterprise-grade high-availability intelligent proxy and load balancing solution designed specifically for large language model (LLM) APIs. It serves as a unified entry point, receiving client requests and efficiently distributing them to various upstream LLM services (such as OpenAI, Anthropic, or privately deployed vLLM, Ollama, etc.) through flexible routing strategies and LLM-optimized load balancing algorithms, then safely returning responses. LLMProxy aims to solve the challenges of performance, cost, availability, and management complexity faced when directly calling LLM APIs, significantly improving the stability, response speed, and resource utilization efficiency of LLM applications through fine-grained traffic scheduling, connection management, and fault tolerance mechanisms.\n\n### Why Choose LLMProxy?\n\nLLMProxy effectively addresses key challenges in enterprise-level LLM API deployments:\n\n-   **Unified LLM Access \u0026 High Availability**: Aggregates multiple LLM services (different cloud providers, private models like vLLM/Ollama), eliminating single points of failure and ensuring business continuity through intelligent routing and failover.\n-   **LLM-Optimized Load Balancing**: Built-in strategies (round-robin, weighted, response-time aware) specifically optimized for LLM long connections and streaming responses, dynamically allocating requests to the best service nodes, balancing cost and performance.\n-   **Powerful Fault Tolerance \u0026 Resilience**: Integrated circuit breaker pattern automatically isolates failing upstream services to prevent cascading failures; supports request retries to improve success rates of LLM calls in complex network environments.\n-   **Easy Scaling \u0026 Cost Control**: Add or reduce upstream LLM services as needed, seamlessly expanding processing capacity; optimize LLM call expenses by prioritizing low-cost or high-performance resources through load balancing strategies.\n-   **Simplified Integration \u0026 Management**: Provides a unified API entry point, shielding backend LLM service differences and simplifying client integration; centrally manage routing, authentication, and security policies through configuration files.\n-   **Enhanced Observability**: Provides detailed Prometheus metrics for real-time monitoring of LLM service calls, proxy performance, and fault diagnosis.\n\n## Quick Start\n\n### 1. Running the Application Directly\n\nThis is the quickest way to experience LLMProxy without complex environment setup.\n\n**Step 1: Download the Pre-compiled Binary**\n\nVisit the project's [GitHub Releases](https://github.com/shengyanli1982/llmproxy/releases) page and download the latest pre-compiled binary package for your operating system (Windows, Linux, macOS), such as `llmproxyd-Linux-x64-\u003cversion\u003e.zip` or `llmproxyd-Windows-x64-\u003cversion\u003e.zip`.\n\nAfter downloading, extract the file to get an executable named `llmproxyd-\u003cos\u003e-\u003carch\u003e` (Linux/macOS) or `llmproxyd-windows-x64.exe` (Windows).\n\n**Step 2: Create a Configuration File**\n\nIn the same directory as the executable, create a file named `config.yaml`. LLMProxy is designed to proxy large language models, and here's a minimal configuration example that forwards requests from local port `3000` to the OpenAI API. Note that you'll need to replace `YOUR_OPENAI_API_KEY_HERE` with your actual OpenAI API key for it to work:\n\n```yaml\nhttp_server:\n    forwards:\n        - name: \"llm_openai_service\"      # Forward service name\n          port: 3000                      # Port LLMProxy listens on\n          address: \"0.0.0.0\"              # Listen on all network interfaces\n          default_group: \"openai_main_group\" # Link to the upstream group defined below\n    admin:\n        port: 9000                      # Admin port for monitoring\n        address: \"127.0.0.1\"            # Recommended local-only access for security\n\nupstreams:\n    - name: \"openai_chat_api\"           # Upstream service name, e.g., OpenAI\n      url: \"https://api.openai.com/v1\"  # Base URL for OpenAI API\n      auth:\n          type: \"bearer\"                # Authentication type is Bearer Token\n          token: \"YOUR_OPENAI_API_KEY_HERE\" # !!IMPORTANT!! Replace with your actual OpenAI API key\n                                      # If you don't have an OpenAI key, you can choose another LLM service or use a mock service for testing\n\nupstream_groups:\n    - name: \"openai_main_group\"         # Upstream group name\n      upstreams:\n          - name: \"openai_chat_api\"       # Reference to the openai_chat_api upstream defined above\n    # [Optional] Configure longer timeout for LLM requests\n    http_client:\n      timeout:\n        request: 300 # LLM requests typically need more time, 300 seconds or more is recommended\n```\n\nThis configuration defines a forwarding service listening on port `3000` that routes requests to an upstream group named `openai_main_group`. The upstream group is configured with the OpenAI API as its backend service and sets appropriate request timeout values. For more detailed and advanced configuration options, refer to the `config.default.yaml` file in the project's root directory.\n\n**Step 3: Run LLMProxy**\n\nOpen a terminal or command prompt, navigate to the directory containing `llmproxyd-\u003cos\u003e-\u003carch\u003e` (or `llmproxyd-windows-x64.exe`) and `config.yaml`, then execute the following commands:\n\n-   For Linux/macOS:\n    ```bash\n    mv llmproxyd-\u003cos\u003e-\u003carch\u003e llmproxyd\n    chmod +x llmproxyd # You may need to add execute permissions the first time\n    ./llmproxyd --config config.yaml\n    ```\n-   For Windows:\n    ```bash\n    .\\llmproxyd-windows-x64.exe --config config.yaml\n    ```\n\nIf all goes well, you'll see LLMProxy start up and begin listening on the configured ports.\n\n**Step 4: Test the Proxy Service**\n\nOpen another terminal and use `curl` or a similar tool to send an LLM API request to the forwarding port configured in LLMProxy. For example, if your `config.yaml` has the `llm_openai_service` service listening on port `3000` and the upstream is the OpenAI API, you can try sending a chat request (make sure your request body follows the OpenAI API format and replace the API Key in the request body as needed):\n\n```bash\ncurl http://localhost:3000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer YOUR_CLIENT_SIDE_OPENAI_API_KEY\" \\ # Note: This key is typically provided by the client application and forwarded by LLMProxy. LLMProxy itself also configures a server-side key in the upstreams section.\n  -d '{\n    \"model\": \"gpt-3.5-turbo\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hello, LLMProxy! Please introduce yourself.\"}],\n    \"stream\": false\n  }'\n```\n\nThis request will be received by LLMProxy and forwarded to the OpenAI upstream service (`https://api.openai.com/v1`) defined in your `config.yaml`, according to the configuration (including authentication, load balancing, etc.). You should see a JSON response from the OpenAI API. If the upstream service supports streaming responses and you enable streaming in your request (e.g., `\"stream\": true`), LLMProxy will also correctly handle the streaming data.\n\nYou've now successfully run LLMProxy! Next, you can explore more advanced configurations and features.\n\n### 2. Deployment with Docker (Recommended)\n\nUsing Docker Compose is one of the most convenient ways to deploy LLMProxy. Complete Docker Compose configuration examples are provided in the project's `examples/config` directory.\n\n1. **Prepare the Configuration File**:\n\n    Place your custom `config.yaml` file in the same directory as the `docker-compose.yaml` file.\n\n2. **Start the Service**:\n\n    ```bash\n    docker-compose up -d\n    ```\n\n3. **View Running Logs**:\n\n    ```bash\n    docker-compose logs -f llmproxy # llmproxy is the service name defined in the compose file\n    ```\n\n4. **Stop the Service**:\n\n    ```bash\n    docker-compose down\n    ```\n\nDocker Compose Configuration Example (refer to `examples/config/docker-compose.yaml` in the project for the latest version):\n\n```yaml\nversion: \"3.8\" # Using a newer compose version is recommended\n\nservices:\n    llmproxy:\n        image: shengyanli1982/llmproxy:latest # For production, using a specific tag version is recommended\n        container_name: llmproxy_service\n        restart: unless-stopped\n        ports:\n            # Map ports according to the forwards defined in your config.yaml\n            - \"3000:3000\" # Example: mapping the forwarding service listening on port 3000 in the config file\n            # - \"3001:3001\"\n            # Admin interface port mapping\n            - \"127.0.0.1:9000:9000\" # Recommended to map the admin port only to the local loopback address\n        volumes:\n            - ./config.yaml:/app/config.yaml:ro # Mount your configuration file into the container\n            # If you need to persist logs, you can mount a log directory\n            # - ./logs:/app/logs\n        command: [\"--config\", \"/app/config.yaml\"]\n        environment:\n            - TZ=America/New_York # Set container timezone\n            # You can override some configurations with environment variables, for example:\n            # - LLMPROXY_UPSTREAMS__0__AUTH__TOKEN=your_env_openai_key\n        networks:\n            - llmproxy_net\n\nnetworks:\n    llmproxy_net:\n        driver: bridge\n```\n\n## Core Features\n\n-   🔄 **Intelligent LLM Routing \u0026 Request Handling**\n\n    -   Configure independent forwarding services for different business scenarios or LLM models through `http_server.forwards`, enabling fine-grained management.\n    -   Customize dedicated listening addresses and ports for each forwarding service.\n    -   Flexibly route requests to specified upstream LLM service groups based on paths, headers, or other request characteristics.\n\n-   🌐 **Unified Upstream LLM Service Management**\n\n    -   Centrally define and manage various upstream LLM services (public cloud APIs, private model services like vLLM/Ollama, etc.) through `upstreams`.\n    -   Independently name, configure URLs, and manage health for each upstream LLM service.\n    -   Built-in authentication proxy mechanisms (Bearer Token, API Key Header Injection, Basic Auth) to securely connect with different types of LLM services.\n    -   Flexible HTTP header operations (add, delete, modify) to adapt to special requirements of different LLM APIs or inject tracking information.\n\n-   ⚡ **LLM-Optimized Load Balancing**\n\n    -   Use `upstream_groups` to organize functionally similar or mutually backup LLM services into upstream groups for unified scheduling and high availability.\n    -   Provide multiple load balancing strategies optimized for LLMs:\n        -   **Round Robin** - Distribute requests evenly among upstream LLM services.\n        -   **Weighted Round Robin** - Distribute requests to different LLM services according to preset weights (e.g., service processing capacity, cost considerations).\n        -   **Random** - Randomly select an available LLM service.\n        -   **Response Aware** - Especially suitable for LLM services, monitoring node performance in real-time (response latency, concurrent load, success rate) and dynamically directing requests to the currently optimal node, maximizing throughput and user experience.\n        -   **Failover** - Try upstream services in the order they are listed. If the current upstream is unavailable, automatically switch to the next one, providing sequential backup capability.\n    -   Set weights for each upstream LLM service in the weighted round-robin strategy.\n    -   **Dynamic Load Balancer Updates** - Dynamically update the upstream list for any load balancer at runtime through API calls, allowing for seamless addition, removal, or modification of upstream services without service interruption or restart.\n\n-   🔁 **Flexible Traffic Control \u0026 QoS Assurance**\n\n    -   Configure rate limits based on IP or other identifiers (requests/second, concurrent request peaks) for each forwarding service.\n    -   Protect backend LLM services from malicious attacks or traffic surges, ensuring service quality (QoS) for core business.\n\n-   🔌 **LLM-Optimized Connection Management**\n\n    -   **Inbound Connection Management:** Configure precise connection timeouts for client connections.\n    -   **Outbound Connection \u0026 Request Optimization (for upstream LLM services):**\n        -   Custom User-Agent for upstream service identification and statistics.\n        -   TCP Keepalive to maintain long-lasting connections with upstream LLM services, reducing handshake latency, especially beneficial for streaming responses.\n        -   Fine-grained timeout control (connection timeout, request timeout, idle timeout) to accommodate diverse response time characteristics of LLM services (from seconds to minutes).\n        -   Configure intelligent retry strategies for transient errors that may occur with LLM APIs (configurable attempt counts, initial backoff intervals, and exponential backoff).\n        -   Support connecting to upstream LLM services through outbound HTTP/HTTPS proxies to meet enterprise network security and compliance requirements.\n        -   **Native Streaming Support \u0026 Timeout Control**: Natively handles LLM streaming responses (Server-Sent Events) via the `http_client.stream` setting. When enabled (default), it disables the request timeout, which is crucial for long-lived streaming connections that would otherwise be prematurely terminated. When disabled, it enforces a fixed request timeout, making it ideal for non-streaming API calls.\n\n-   🛡️ **Robust Fault Tolerance \u0026 Failover**\n\n    -   **Intelligent Circuit Breaker:** Automatically monitor the health status of upstream LLM services (based on error rates) and quickly isolate failing nodes when thresholds are reached.\n    -   **Configurable Circuit Breaking Policies:** Customize circuit breaking thresholds (e.g., failure rate) and cooldown times (waiting time to enter half-open state after breaking) for each upstream LLM service.\n    -   **Automatic Recovery \u0026 Probing:** Periodically attempt to send probe requests to failed nodes after circuit breaking, automatically reintegrating them into the load balancing pool once service is restored.\n    -   **Seamless Failover:** When an upstream LLM service in a group fails or trips the circuit breaker, automatically and smoothly switch traffic to other healthy nodes in the group, transparent to clients, ensuring business continuity.\n\n-   📊 **Observability \u0026 Management Interface**\n    -   Provide independent management interface and API endpoints through `http_server.admin`.\n    -   Offer `/health` health check endpoint for integration with various monitoring and automated operations systems.\n    -   Expose rich Prometheus metrics through the `/metrics` endpoint, providing comprehensive insights into LLM proxy performance, traffic, errors, latency, upstream LLM service health status, and circuit breaker states.\n\n## Use Cases\n\nLLMProxy is designed for enterprise-level application scenarios that require efficient, reliable, and scalable access to and management of large language model APIs:\n\n-   **Enterprise AI Application Gateway**:\n\n    -   Provides a unified LLM API access entry point for multiple applications or teams within an enterprise.\n    -   Centrally implements access authentication and API key configuration for large language models.\n\n-   **Multi-tenancy and Service Isolation**:\n\n    -   Achieve a multi-tenant architecture within a single LLMProxy instance by configuring independent `forwards` and `upstream_groups` for different teams, applications, or customers (tenants).\n    -   Assign a unique access endpoint (port) to each tenant and apply separate routing rules, API keys, rate limits, and load balancing strategies.\n    -   This is particularly useful for SaaS platforms that need to provide customized LLM services to different customers, or for isolating resources and billing for different departments within an enterprise.\n\n-   **High-Availability, High-Concurrency LLM Services**:\n\n    -   Build high-traffic AI products for end users (such as intelligent customer service, content generation tools, AI assistants).\n    -   Ensure uninterrupted service through load balancing and failover across instances from multiple LLM providers (such as OpenAI, Anthropic, Azure OpenAI, Google Gemini) or self-built models (vLLM, Ollama).\n    -   Utilize advanced strategies like response-time awareness to dynamically allocate traffic to the best-performing nodes, enhancing user experience.\n\n-   **LLM Application Development \u0026 Testing Acceleration**:\n\n    -   Simplify the complexity of developer integration with multiple LLM APIs, decoupling application code from specific LLM services.\n    -   Easily switch between different LLM models or providers for effect evaluation and cost comparison.\n    -   Simulate different upstream responses (such as latency, errors) for test environments, or isolate test traffic.\n\n-   **Multi-Cloud/Hybrid Cloud LLM Strategy Implementation**:\n\n    -   Provide a unified LLM API access layer in complex cloud environments (such as AWS, Azure, GCP, and on-premises data centers in hybrid deployment).\n    -   Route requests to specific geographic locations or specific types of LLM services based on data sovereignty, compliance requirements, or cost factors.\n    -   Deploy as an independent service in container orchestration platforms like Kubernetes to provide LLM access capabilities for microservices.\n\n-   **API Version \u0026 Compatibility Management**:\n    -   When backend LLM APIs upgrade or undergo incompatible changes, LLMProxy can serve as an adaptation layer, maintaining compatibility with older clients through header operations or lightweight transformations (possibly supported in future versions).\n\nBy applying LLMProxy in these scenarios, enterprises can significantly enhance the reliability, performance, and manageability of their LLM applications while reducing integration and operational complexity.\n\n## Configuration Guide\n\nLLMProxy uses structured YAML files for configuration, offering flexible and powerful configuration options. Below is a detailed explanation of key configuration sections:\n\n### Configuration Options Explained\n\n#### HTTP Server Configuration Options\n\n| Configuration Item                              | Type    | Default   | Description                                                                                    |\n| ----------------------------------------------- | ------- | --------- | ---------------------------------------------------------------------------------------------- |\n| `http_server.forwards[].name`                   | String  | -         | **[Required]** Unique identifier name for the forwarding service                               |\n| `http_server.forwards[].port`                   | Integer | 3000      | **[Required]** Listening port for the forwarding service                                       |\n| `http_server.forwards[].address`                | String  | \"0.0.0.0\" | Binding network address for the forwarding service                                             |\n| `http_server.forwards[].default_group`          | String  | -         | **[Required]** Name of the default upstream group when no routing rules match                  |\n| `http_server.forwards[].routing`                | Array   | null      | **[Optional]** Advanced routing rules configuration. If omitted, routing is disabled           |\n| `http_server.forwards[].routing[].path`         | String  | -         | **[Required]** Path pattern for this routing rule                                              |\n| `http_server.forwards[].routing[].target_group` | String  | -         | **[Required]** Name of the upstream group for this route, must be defined in `upstream_groups` |\n| `http_server.forwards[].ratelimit`              | Object  | null      | **[Optional]** Rate limiting configuration. If omitted, rate limiting is disabled              |\n| `http_server.forwards[].ratelimit.per_second`   | Integer | 100       | Maximum number of requests allowed per second per IP (range: 1-10000)                          |\n| `http_server.forwards[].ratelimit.burst`        | Integer | 200       | Number of burst requests allowed per IP (buffer size) (range: 1-20000)                         |\n| `http_server.forwards[].timeout`                | Object  | null      | **[Optional]** Timeout configuration. If omitted, default values are used                      |\n| `http_server.forwards[].timeout.connect`        | Integer | 10        | Timeout for client connections to LLMProxy (seconds)                                           |\n| `http_server.admin.port`                        | Integer | 9000      | Optional listening port for the admin service                                                  |\n| `http_server.admin.address`                     | String  | \"0.0.0.0\" | Binding network address for the admin service                                                  |\n| `http_server.admin.timeout`                     | Object  | null      | **[Optional]** Timeout configuration. If omitted, default values are used                      |\n| `http_server.admin.timeout.connect`             | Integer | 10        | Timeout for connections to the admin interface (seconds)                                       |\n\n#### Upstream Service Configuration Options (Upstream LLM Services)\n\n| Configuration Item              | Type    | Default | Description                                                                                                                    |\n| ------------------------------- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------------ |\n| `upstreams[].name`              | String  | -       | **[Required]** Unique identifier name for the upstream LLM service                                                             |\n| `upstreams[].url`               | String  | -       | **[Required]** Full URL for the upstream LLM service (e.g., `https://api.openai.com/v1/chat/completions`)                      |\n| `upstreams[].auth.type`         | String  | \"none\"  | Authentication type: `bearer`, `basic`, or `none`                                                                              |\n| `upstreams[].auth.token`        | String  | -       | API key or token when `type` is `bearer`                                                                                       |\n| `upstreams[].auth.username`     | String  | -       | Username when `type` is `basic`                                                                                                |\n| `upstreams[].auth.password`     | String  | -       | Password when `type` is `basic`                                                                                                |\n| `upstreams[].headers[].op`      | String  | -       | HTTP header operation type: `insert` (add if not exists), `replace` (replace or add), `remove`                                 |\n| `upstreams[].headers[].key`     | String  | -       | Name of the HTTP header to operate on                                                                                          |\n| `upstreams[].headers[].value`   | String  | -       | Header value for `insert` or `replace` operations                                                                              |\n| `upstreams[].breaker.threshold` | Float   | 0.5     | Circuit breaker trigger threshold, representing failure rate (0.01-1.0), e.g., 0.5 means 50% failures trigger circuit breaking |\n| `upstreams[].breaker.cooldown`  | Integer | 30      | Circuit breaker cooldown time (seconds), i.e., how long after breaking to try half-open state (1-3600)                         |\n\n#### Upstream Group Configuration Options (Upstream LLM Groups)\n\n\u003e [!NOTE]\n\u003e\n\u003e The parameter `upstreams[].url` should be configured with the full URL of the upstream service, e.g., `https://api.openai.com/v1/chat/completions`, not `https://api.openai.com/v1` or `https://api.openai.com`.\n\n| Configuration Item                              | Type    | Default        | Description                                                                                                                                                                                                                                        |\n| ----------------------------------------------- | ------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `upstream_groups[].name`                        | String  | -              | **[Required]** Unique identifier name for the upstream group                                                                                                                                                                                       |\n| `upstream_groups[].upstreams[].name`            | String  | -              | **[Required]** Referenced upstream LLM service name, must be defined in the `upstreams` section                                                                                                                                                    |\n| `upstream_groups[].upstreams[].weight`          | Integer | 1              | Weight value effective only when `balance.strategy` is `weighted_roundrobin`, used for proportional request allocation (range: 1-65535)                                                                                                            |\n| `upstream_groups[].balance.strategy`            | String  | \"roundrobin\"   | Load balancing strategy: `roundrobin`, `weighted_roundrobin`, `random`, `response_aware` or `failover`                                                                                                                                             |\n| `upstream_groups[].http_client.agent`           | String  | \"LLMProxy/1.0\" | User-Agent header value sent to upstream LLM services                                                                                                                                                                                              |\n| `upstream_groups[].http_client.keepalive`       | Integer | 30             | TCP Keepalive time (seconds), range 5-600, 0 is not allowed. Helps keep connections with upstream LLM services active, reducing latency                                                                                                            |\n| `upstream_groups[].http_client.stream`          | Boolean | true           | Controls the request timeout behavior. If `true` (default), the request timeout is disabled, which is **essential** for LLM streaming responses (Server-Sent Events). If `false`, `timeout.request` is enforced, suitable for non-streaming calls. |\n| `upstream_groups[].http_client.timeout`         | Object  | null           | **[Optional]** Timeout configuration. If omitted, default values are used                                                                                                                                                                          |\n| `upstream_groups[].http_client.timeout.connect` | Integer | 10             | Timeout for connecting to upstream LLM services (seconds) (range: 1-120)                                                                                                                                                                           |\n| `upstream_groups[].http_client.timeout.request` | Integer | 300            | Request timeout (seconds) for non-streaming requests. Only effective when `http_client.stream` is `false`. Defines the maximum waiting time for a complete upstream response. (range: 1-1200)                                                      |\n| `upstream_groups[].http_client.timeout.idle`    | Integer | 60             | Timeout (seconds) after which a connection with an upstream LLM service is considered idle and closed if no activity (range: 5-1800)                                                                                                               |\n| `upstream_groups[].http_client.retry`           | Object  | null           | **[Optional]** Request retry configuration. If omitted, retry functionality is disabled                                                                                                                                                            |\n| `upstream_groups[].http_client.retry.attempts`  | Integer | 3              | Maximum number of retry attempts (excluding the first attempt) (range: 1-100)                                                                                                                                                                      |\n| `upstream_groups[].http_client.retry.initial`   | Integer | 500            | Initial waiting time (milliseconds) before the first retry, subsequent retry intervals may use exponential backoff (range: 100-10000)                                                                                                              |\n| `upstream_groups[].http_client.proxy`           | Object  | null           | **[Optional]** Outbound proxy configuration. If omitted, no proxy will be used                                                                                                                                                                     |\n| `upstream_groups[].http_client.proxy.url`       | String  | -              | Outbound proxy server URL (e.g., `http://user:pass@proxy.example.com:8080`)                                                                                                                                                                        |\n\n### HTTP Server Configuration\n\n```yaml\nhttp_server:\n    # Forward service configuration (handling client inbound requests)\n    forwards:\n        - name: \"to_openai_llm_group\" # [Required] Unique name for the forwarding service\n          port: 3000 # [Required] Port this service listens on\n          address: \"0.0.0.0\" # [Optional] Binding network address (default: \"0.0.0.0\", listen on all interfaces)\n          default_group: \"openai_llm_group\" # [Required] Default upstream group when no routing rules match\n          # [Optional] Advanced path-based routing rules\n          routing:\n              # Static path rule - highest priority matching\n              - path: \"/api/specific-endpoint\"\n                target_group: \"specialized_llm_group\"\n              # Named parameter rule\n              - path: \"/api/users/:id/history\"\n                target_group: \"user_history_group\"\n              # Regular expression rule - only matches numeric IDs\n              - path: \"/api/items/{id:[0-9]+}\"\n                target_group: \"item_api_group\"\n          ratelimit: # [Optional] IP rate limiting\n              enabled: true # Whether to enable rate limiting (default: false)\n              per_second: 100 # Maximum requests per second per IP\n              burst: 200 # Burst request capacity per IP\n          timeout: # [Optional] Client connection timeout\n              connect: 10 # Timeout for client connections to LLMProxy (seconds)\n\n    # Admin interface configuration\n    admin:\n        port: 9000 # [Optional] Admin interface port (for /metrics, /health)\n        address: \"127.0.0.1\" # [Optional] Binding network address (default: \"0.0.0.0\", recommended \"127.0.0.1\" for production)\n        timeout:\n            connect: 10 # Timeout for connections to the admin interface (seconds)\n```\n\n### Advanced Path-Based Routing\n\nLLMProxy supports sophisticated path-based routing within each forward service, allowing requests to be directed to different upstream groups based on the request path. This feature enables creating complex routing topologies and service meshes for LLM services.\n\n#### Routing Rule Configuration\n\nRouting rules are defined under the `routing` array in each forward service configuration. When a request arrives, LLMProxy evaluates all configured routing rules in order and selects the first matching rule. If no rules match, the request is sent to the `default_group`.\n\n```yaml\nforwards:\n    - name: \"advanced_routing_service\"\n      port: 3000\n      address: \"0.0.0.0\"\n      default_group: \"fallback_group\" # Used when no routing rules match\n      routing:\n          - path: \"/path/to/match\"\n            target_group: \"target_upstream_group\"\n          # Additional rules...\n```\n\n#### Path Matching Patterns\n\nLLMProxy supports several path matching patterns, from simple static paths to complex regex-based patterns:\n\n1. **Static Paths**: Exact path matching with highest priority.\n\n    ```yaml\n    - path: \"/api/users/admin\"\n      target_group: \"admin_api_group\"\n    ```\n\n2. **Named Parameters**: Match variable path segments using `:param_name` syntax.\n\n    ```yaml\n    - path: \"/api/users/:id\"\n      target_group: \"user_api_group\" # Matches \"/api/users/123\", \"/api/users/abc\", etc.\n    ```\n\n3. **Regex-Constrained Parameters**: Match path segments with regex constraints using `{param:regex}` syntax.\n\n    ```yaml\n    - path: \"/api/items/{id:[0-9]+}\"\n      target_group: \"item_api_group\" # Matches \"/api/items/42\", but not \"/api/items/abc\"\n    ```\n\n4. **Complex Regex Patterns**: Match specific formats with more complex regex patterns.\n\n    ```yaml\n    - path: \"/api/products/{code:[A-Z][A-Z][A-Z][0-9][0-9][0-9]}\"\n      target_group: \"product_api_group\" # Matches \"/api/products/ABC123\"\n    ```\n\n5. **Wildcards**: Use `*` to match single path segments or trailing content.\n\n    ```yaml\n    - path: \"/api/*/docs\"\n      target_group: \"api_docs_group\" # Matches \"/api/v1/docs\", \"/api/v2/docs\"\n\n    - path: \"/files/*\"\n      target_group: \"file_server_group\" # Matches any path starting with \"/files/\"\n    ```\n\n6. **Mixed Patterns**: Combine various patterns for complex routing needs.\n    ```yaml\n    - path: \"/api/:version/users/{id:[0-9]+}/profile\"\n      target_group: \"user_profile_group\" # Matches \"/api/v2/users/42/profile\"\n    ```\n\n#### Routing Matching Priority\n\nWhen multiple rules could match a path, `LLMProxy` follows these priority rules:\n\n1. Static paths always have the highest priority\n2. Parameterized paths with more static segments have higher priority\n3. Regex-constrained parameters have higher priority than unconstrained parameters\n4. Longer path patterns have higher priority than shorter ones\n5. Rules are evaluated in the order they are defined\n\n#### Use Cases for Advanced Routing\n\n-   **Model-Specific Routing**: Direct different model requests to specialized upstream groups.\n\n    ```yaml\n    - path: \"/v1/chat/completions\" # Default chat completions\n      target_group: \"standard_models_group\"\n    - path: \"/v1/embeddings\"\n      target_group: \"embedding_models_group\"\n    ```\n\n-   **Version-Based Routing**: Support API versioning or A/B testing.\n\n    ```yaml\n    - path: \"/api/v1/*\"\n      target_group: \"api_v1_group\"\n    - path: \"/api/v2/*\"\n      target_group: \"api_v2_group\"\n    ```\n\n-   **Resource-Based Routing**: Route different resource types to specialized services.\n    ```yaml\n    - path: \"/api/images/*\"\n      target_group: \"image_generation_group\"\n    - path: \"/api/text/*\"\n      target_group: \"text_generation_group\"\n    ```\n\n### Upstream Service Configuration (Defining Backend LLM API Services)\n\n```yaml\nupstreams:\n    - name: \"openai_gpt4_primary\" # [Required] Unique identifier name for the upstream LLM service\n      url: \"https://api.openai.com/v1\" # [Required] Base URL for the upstream LLM API\n      auth: # [Optional] Authentication configuration\n          type: \"bearer\" # Authentication type: \"bearer\", \"basic\", or \"none\" (default)\n          token: \"YOUR_OPENAI_API_KEY\" # [Required for bearer auth] API key/token\n          # username: \"user\"         # [Required for basic auth] Username\n          # password: \"pass\"         # [Required for basic auth] Password\n      headers: # [Optional] HTTP header operations (modify before forwarding request to this upstream)\n          - op: \"insert\" # Operation type: \"insert\", \"replace\", or \"remove\"\n            key: \"X-Custom-Proxy-Header\" # Name of the HTTP header to operate on\n            value: \"LLMProxy-OpenAI-GPT4\" # Header value (for \"insert\" or \"replace\" operations)\n      breaker: # [Optional] Circuit breaker configuration for this upstream\n          threshold: 0.5 # Failure rate threshold to trigger the circuit breaker (0.01-1.0, default: 0.5)\n          cooldown: 30 # Cooling period before entering half-open state (seconds) (1-3600, default: 30)\n\n    - name: \"anthropic_claude_haiku\"\n      url: \"https://api.anthropic.com\" # Example: Anthropic API\n      auth:\n          type: \"bearer\" # Anthropic also typically uses Bearer Token\n          token: \"YOUR_ANTHROPIC_API_KEY\"\n      headers: # Anthropic may require anthropic-version in header\n          - op: \"insert\"\n            key: \"anthropic-version\"\n            value: \"2023-06-01\"\n          - op: \"insert\" # anthropic-beta: messages-2023-12-15, max-tokens-3-5-sonnet-2024-07-15, etc.\n            key: \"anthropic-beta\"\n            value: \"max-tokens-3-5-sonnet-2024-07-15\"\n      breaker:\n          threshold: 0.4\n          cooldown: 45\n```\n\n### Upstream Group Configuration (Organizing Upstreams and Defining Load Balancing Behavior)\n\n```yaml\nupstream_groups:\n    - name: \"openai_llm_group\" # [Required] Unique identifier name for the upstream group\n      upstreams: # [Required] List of upstream LLM services in this group (at least one)\n          - name: \"openai_gpt4_primary\" # Reference to the `name` defined in the `upstreams` section\n            weight: 8 # [Optional] Weight, only effective when `balance.strategy` is \"weighted_roundrobin\" (default: 1)\n          # - name: \"another_openai_backup_service\"\n          #   weight: 2\n      balance:\n          strategy:\n              \"weighted_roundrobin\" # Load balancing strategy:\n              # \"roundrobin\" (default round-robin),\n              # \"weighted_roundrobin\" (weighted round-robin),\n              # \"random\" (random),\n              # \"response_aware\" (response time aware, recommended for LLM),\n              # \"failover\" (failover strategy, tries upstreams in order)\n      http_client: # [Optional] Define how LLMProxy communicates with upstream LLM services in this group\n          agent: \"LLMProxy/1.0 (OpenAI-Group)\" # [Optional] Custom User-Agent header\n          keepalive: 90 # [Optional] TCP keepalive time (seconds) (0-600, 0=disabled, default: 60)\n          stream: true # [Optional] Enable streaming mode (important for LLM streaming responses, default: true)\n          timeout:\n              connect: 15 # Timeout for connecting to upstream LLM services (seconds) (default: 10)\n              request: 360 # Request timeout (seconds) (default: 300, may need higher for time-consuming LLMs)\n              idle: 90 # Idle connection timeout (seconds) (default: 60)\n          retry: # [Optional] Request retry configuration\n              attempts: 3 # Maximum retry attempts\n              initial: 1000 # Initial waiting time before first retry (milliseconds)\n          proxy: # [Optional] Outbound proxy configuration\n              url: \"http://user:pass@your-proxy-server.com:8080\" # Proxy server URL\n```\n\n### Configuration Best Practices\n\n1. **Security Recommendations**:\n\n    - Strictly limit access to the admin interface (`admin` service). Bind it to the local loopback address (`address: \"127.0.0.1\"`), and consider using firewall rules or a reverse proxy (like Nginx) to add additional authentication and access control.\n\n2. **Performance \u0026 Cost Optimization**:\n\n    - Fine-tune timeout configurations (`timeout.connect`, `timeout.request`, `timeout.idle`) and retry strategies (`retry`) based on the characteristics of different LLM service providers' APIs (such as response time, concurrency limits, billing models).\n    - For LLM services that support streaming responses, ensure `http_client.stream: true` (this is the default value) to receive and forward data with minimal latency.\n    - Configure rate limits (`ratelimit`) reasonably to both protect backend LLM services from overload and meet business needs during peak periods.\n    - Leverage the `weighted_roundrobin` load balancing strategy, combined with the cost and performance of different LLM services, to direct more traffic to services with better price-performance ratios.\n    - For latency-sensitive applications, prioritize using the `response_aware` load balancing strategy, which can dynamically select the best-performing upstream service at any given time.\n\n3. **Reliability \u0026 Resilience Design**:\n    - Configure reasonable circuit breaker parameters (`breaker.threshold`, `breaker.cooldown`) for each upstream LLM service (`upstreams`). Threshold settings need to balance the sensitivity of fault detection and the inherent volatility of the service.\n    - Configure multiple upstream LLM service instances in each upstream group (`upstream_groups`) to achieve redundancy and automatic failover. These can be different regional nodes of the same provider or alternative services from different providers.\n    - Enable request retries (by configuring the `http_client.retry` object) only for idempotent or safely retryable LLM API calls. Note that some LLM operations (such as content generation) may not be idempotent.\n    - Regularly monitor Prometheus metrics data about circuit breaker status, upstream error rates, request latency, etc., and use this information to optimize configurations and troubleshoot potential issues.\n\nFor detailed explanations of all available configuration options, please refer to the `config.default.yaml` file included with the LLMProxy project as a complete reference.\n\n### Example: Multi-tenancy Configuration\n\nLLMProxy can easily achieve multi-tenancy or service isolation by mapping different `forwards` (listening on different ports) to different `upstream_groups`. Each `upstream_group` can have its own independent upstream LLM services, load balancing strategies, and client behavior configurations. This allows a single LLMProxy instance to serve multiple independent clients or applications while maintaining configuration and traffic isolation.\n\nThe following example shows how to configure independent proxy services for two tenants (`tenant-a` and `tenant-b`):\n\n-   `tenant-a` accesses the service on port `3001` with its own dedicated OpenAI API key and rate limiting policy.\n-   `tenant-b` accesses the service on port `3002`, uses a different API key, is configured with a Failover strategy, and has stricter rate limits.\n\n```yaml\nhttp_server:\n    forwards:\n        - name: \"tenant-a-service\"\n          port: 3001\n          address: \"0.0.0.0\"\n          default_group: \"tenant-a-group\"\n          ratelimit:\n              enabled: true\n              per_second: 50 # Rate limit for Tenant A\n              burst: 100\n        - name: \"tenant-b-service\"\n          port: 3002\n          address: \"0.0.0.0\"\n          default_group: \"tenant-b-group\"\n          ratelimit:\n              enabled: true\n              per_second: 20 # Rate limit for Tenant B\n              burst: 40\n\nupstreams:\n    - name: \"openai_primary_for_a\"\n      url: \"https://api.openai.com/v1\"\n      auth:\n          type: \"bearer\"\n          token: \"TENANT_A_OPENAI_API_KEY\" # API key for Tenant A\n    - name: \"openai_primary_for_b\"\n      url: \"https://api.openai.com/v1\"\n      auth:\n          type: \"bearer\"\n          token: \"TENANT_B_OPENAI_API_KEY\" # API key for Tenant B\n    - name: \"openai_backup_for_b\"\n      url: \"https://api.openai.com/v1\"\n      auth:\n          type: \"bearer\"\n          token: \"TENANT_B_BACKUP_API_KEY\" # Backup API key for Tenant B\n\nupstream_groups:\n    # Configuration group for Tenant A\n    - name: \"tenant-a-group\"\n      upstreams:\n          - name: \"openai_primary_for_a\"\n      balance:\n          strategy: \"roundrobin\" # Simple round-robin\n      http_client:\n          timeout:\n              request: 300\n\n    # Configuration group for Tenant B\n    - name: \"tenant-b-group\"\n      upstreams:\n          - name: \"openai_primary_for_b\" # Primary service\n          - name: \"openai_backup_for_b\" # Backup service\n      balance:\n          strategy: \"failover\" # Automatically switch to backup on failure\n      http_client:\n          timeout:\n              request: 360\n```\n\n## Advanced Deployment\n\nLLMProxy supports various flexible deployment methods, including Kubernetes cluster deployment and traditional Linux system service deployment. Here are detailed instructions for each deployment method:\n\n### Kubernetes Deployment\n\nFor Kubernetes environments, we provide recommended deployment configuration files (Deployment, Service, ConfigMap, etc.) in the `examples/config/kubernetes` directory.\n\n1. **Prepare Configuration Files and Secrets**:\n\n    - Put your `config.yaml` content into `examples/config/kubernetes/configmap.yaml`, or create a ConfigMap through other means.\n    - **Strongly recommended**: Store sensitive information like API keys in Kubernetes Secrets, and inject them into the Pod via environment variables or volume mounts, then reference these environment variables in `config.yaml`.\n\n2. **Apply Deployment Manifests**:\n\n    ```bash\n    # (Optional) Create namespace\n    kubectl apply -f examples/config/kubernetes/namespace.yaml\n\n    # Create ConfigMap (containing LLMProxy configuration)\n    kubectl apply -f examples/config/kubernetes/configmap.yaml -n llmproxy\n\n    # (Optional) Create Secrets (for storing API keys)\n    # kubectl create secret generic llm-api-keys -n llmproxy \\\n    #   --from-literal=OPENAI_API_KEY='your_openai_key' \\\n    #   --from-literal=ANTHROPIC_API_KEY='your_anthropic_key'\n\n    # Create Deployment\n    kubectl apply -f examples/config/kubernetes/deployment.yaml -n llmproxy\n\n    # Create Service (exposing LLMProxy service)\n    kubectl apply -f examples/config/kubernetes/service.yaml -n llmproxy\n    ```\n\n3. **Verify Deployment Status**:\n\n    ```bash\n    kubectl get pods -n llmproxy -l app=llmproxy\n    kubectl get services -n llmproxy llmproxy\n    kubectl logs -n llmproxy -l app=llmproxy -f # View logs\n    ```\n\n4. **Access the Service**:\n\n    - **Internal Cluster Access** (via Service name):\n      `http://llmproxy.llmproxy.svc.cluster.local:\u003cport\u003e` (port according to Service definition)\n    - **External Cluster Access**:\n      Typically achieved by configuring an Ingress resource, or using a `LoadBalancer` type Service (if your K8s environment supports it). For port forwarding testing:\n        ```bash\n        kubectl port-forward svc/llmproxy -n llmproxy 3000:3000 9000:9000 # Forward service ports to local\n        ```\n        Then you can access the forwarding service via `http://localhost:3000` and the admin interface via `http://localhost:9000`.\n\n### Linux System Service Deployment (using systemd)\n\nFor traditional Linux server environments, you can use systemd to manage the LLMProxy service.\n\n1. **Download and Install the Binary**:\n   Visit [GitHub Releases](https://github.com/shengyanli1982/llmproxy/releases) to download the latest `llmproxyd-Linux-x64-\u003cversion\u003e.zip`.\n\n    ```bash\n    # Example version, please replace with the latest version\n    VERSION=\"0.1.0\" # Assuming this is the version you downloaded\n    curl -L -o llmproxyd-Linux-x64.zip https://github.com/shengyanli1982/llmproxy/releases/download/v${VERSION}/llmproxyd-Linux-x64-${VERSION}.zip\n    unzip llmproxyd-Linux-x64.zip\n    sudo mv llmproxyd-Linux-x64 /usr/local/bin/llmproxyd\n    sudo chmod +x /usr/local/bin/llmproxyd\n    ```\n\n2. **Create Configuration Directory and File**:\n\n    ```bash\n    sudo mkdir -p /etc/llmproxy\n    sudo nano /etc/llmproxy/config.yaml\n    # Paste your configuration content in the editor, and ensure sensitive information like API keys is properly handled\n    # (e.g., future versions may support reading from environment variables)\n    ```\n\n3. **Create a Dedicated System User (Recommended)**:\n\n    ```bash\n    sudo useradd --system --no-create-home --shell /usr/sbin/nologin llmproxyuser\n    sudo chown -R llmproxyuser:llmproxyuser /etc/llmproxy\n    # If there's a log directory, it should also be authorized\n    ```\n\n4. **Create systemd Service File**:\n   Copy the `examples/config/llmproxy.service` file to `/etc/systemd/system/llmproxy.service`.\n\n    ```bash\n    sudo cp examples/config/llmproxy.service /etc/systemd/system/llmproxy.service\n    ```\n\n    Edit `/etc/systemd/system/llmproxy.service` as needed, especially `User`, `Group` (if you created a dedicated user), and the configuration file path in `ExecStart` (`--config /etc/llmproxy/config.yaml`).\n\n5. **Reload systemd, Start, and Enable Auto-start**:\n\n    ```bash\n    sudo systemctl daemon-reload\n    sudo systemctl start llmproxy\n    sudo systemctl enable llmproxy\n    ```\n\n6. **Check Service Status and Logs**:\n    ```bash\n    sudo systemctl status llmproxy\n    sudo journalctl -u llmproxy -f # View real-time logs\n    ```\n\n### Security Best Practices (Common to All Deployment Methods)\n\n1.  **API Key and Credential Management**:\n\n    -   **Absolutely avoid** hardcoding API keys in plain text in configuration files.\n    -   **Containerization/Kubernetes**: Prioritize using Secrets to manage credentials and inject them into LLMProxy containers via environment variables or file mounts.\n    -   **System Service**: Restrict configuration file read permissions, consider using external configuration sources or environment variables (pending LLMProxy support).\n    -   Regularly rotate all API keys.\n\n2.  **Network Security**:\n\n    -   **Admin Interface Security**: Bind the admin port (`admin.port`, default 9000) to the local loopback address (`127.0.0.1`) or an internal trusted network, never expose it to the public internet. Consider deploying a reverse proxy in front of it and adding authentication.\n    -   **TLS/SSL Encryption**: For forwarding services facing the public internet, it's recommended to deploy a reverse proxy (such as Nginx, Traefik, Caddy, or cloud provider's load balancer) in front of LLMProxy to handle TLS termination and certificate management. LLMProxy itself focuses on proxy logic.\n    -   Use firewalls to restrict unnecessary port access.\n\n3.  **Principle of Least Privilege**:\n\n    -   **Container**: Run the LLMProxy process as a non-root user (future version images will support this). Ensure the container filesystem is as read-only as possible.\n    -   **System Service**: Use a dedicated, low-privilege system user to run the LLMProxy service.\n\n4.  **Logging and Monitoring**:\n    -   Configure a log collection solution to aggregate LLMProxy logs (including access logs and error logs) to a centralized log management system (such as ELK Stack, Grafana Loki, Splunk).\n    -   Continuously monitor Prometheus metrics exposed through the `/metrics` endpoint, and set up alerts for critical metrics (such as error rates, latency, circuit breaker status).\n\n## Understanding LLMProxy\n\n### Architecture\n\nLLMProxy adopts a modular, high-performance asynchronous architecture design. Core components include:\n\n-   **HTTP Listeners (Forward Servers)**: Based on asynchronous HTTP servers, responsible for listening to client requests, with each `forwards` configuration item corresponding to an independent listening instance.\n-   **Request Processor \u0026 Router**: Parses incoming requests and routes them to the specified upstream group according to configured routing rules.\n-   **Upstream Group Manager**: Manages a group of logically related upstream LLM services, containing load balancers and HTTP client pools.\n-   **Upstream Service Instances**: Represents a specific backend LLM API endpoint, including its URL, authentication information, circuit breaker configuration, etc.\n-   **Load Balancer**: Embedded in each upstream group, intelligently distributes requests among healthy available upstream services according to the selected strategy (round-robin, weighted, random, response-time aware).\n-   **HTTP Client**: Responsible for establishing connections with upstream LLM services and sending requests, supporting connection pooling, timeout control, retries, streaming, etc.\n-   **Circuit Breaker**: Equipped for each upstream service instance, continuously monitors its health status, automatically breaks the circuit when persistent failures are detected to prevent failure propagation, and automatically retries when the service recovers.\n-   **Metrics Collector**: Based on the Prometheus client, collects and exposes detailed performance and operational metrics in real-time.\n-   **Configuration Manager**: Responsible for loading and parsing the `config.yaml` file and validating the configuration's validity.\n\n![architecture](./images/architecture.png)\n_Figure: LLMProxy core architecture diagram (simplified version)_\n\n### Warm Restarts on Linux\n\nTo enhance service availability, LLMProxy leverages the `SO_REUSEPORT` socket option on `Linux` systems for both its forwarding and admin services. This feature allows multiple instances of LLMProxy to listen on the same port, enabling seamless, zero-downtime restarts and upgrades. When a new process starts, it can immediately begin accepting new connections on the shared port, while the old process completes any ongoing requests before gracefully shutting down(**There will be a very small amount of connection drops, but it can be ignored**). This mechanism prevents connection drops during deployments and significantly simplifies high-availability setups. Please note that this feature is specific to `Linux` and is not available on other operating systems like `Windows` or `macOS`.\n\n### Response-Time Aware Load Balancing Algorithm\n\nLLMProxy's response-time aware (`response_aware`) load balancing algorithm is an intelligent scheduling strategy designed specifically for large language models, which typically have high and variable response times and are computationally intensive. Unlike traditional round-robin or random strategies, this algorithm is specifically designed for services like LLMs with highly variable response times. It dynamically allocates new requests to the best service node by analyzing the comprehensive performance of upstream nodes in real-time (combining average response time, current concurrent load, and request success rate).\n\n#### How It Works\n\n1.  **Real-time Performance Sampling \u0026 Smoothing**: The system continuously collects and records key performance metrics for each upstream LLM service node:\n\n    -   **Average Response Time**: Calculated using an exponentially weighted moving average algorithm, smoothing short-term fluctuations to better reflect recent performance trends.\n    -   **In-flight Requests**: The number of concurrent requests currently being processed but not yet completed by the node.\n    -   **Request Success Rate**: The percentage of requests successfully completed recently.\n\n2.  **Dynamic Health \u0026 Comprehensive Scoring**: Combined with circuit breaker status, only healthy (non-broken) nodes are considered. For healthy nodes, a comprehensive performance score is calculated using a formula similar to the following, where a lower score indicates better node performance:\n\n    $$\\text{Score} = \\text{ResponseTime} \\times (\\text{ProcessingRequests} + 1) \\times \\frac{1}{\\text{SuccessRate}}$$\n\n    Where:\n\n    -   $\\text{ResponseTime}$ is the node's average response time (milliseconds)\n    -   $\\text{ProcessingRequests}$ is the number of concurrent requests currently being processed by the node\n    -   $\\text{SuccessRate}$ is the node's request success rate (a value between 0-1)\n\n![score](./images/response_aware_parameter_impact_en.png)\n_Figure: Impact of various parameters on selection probability in the response-time aware algorithm (illustration)_\n\n3.  **Intelligent Node Selection**:\n\n    -   When a new request arrives, the load balancer traverses all healthy (non-broken) nodes in the current upstream group.\n    -   Calculates the real-time performance score for each healthy node.\n    -   Selects the node with the lowest score (i.e., best overall performance) to handle the current request.\n    -   The in-flight request count for the selected node is incremented accordingly.\n\n4.  **Continuous Adaptive Adjustment**:\n    -   After a request is completed, its actual response time, success status, etc. are recorded.\n    -   This information is used to update the node's average response time, success rate, and other statistics.\n    -   The in-flight request count is decremented accordingly.\n    -   This continuous feedback loop enables the algorithm to dynamically adapt to real-time changes in upstream LLM service performance.\n\n#### Advantages\n\n-   **Dynamic Adaptability**: Automatically adapts to real-time fluctuations and sudden load spikes in upstream LLM service performance without manual intervention.\n-   **LLM Optimization**: Particularly suitable for handling the high latency and latency uncertainty of LLM requests (applicable from millisecond to minute-level responses).\n-   **Multi-dimensional Consideration**: Comprehensively considers latency, concurrency, and success rate, avoiding concentrating traffic on a single slow node or overloaded node.\n-   **Smooth Distribution**: Smoothing techniques like exponentially weighted moving averages avoid decision oscillations due to momentary jitters, providing more stable load distribution.\n-   **High-concurrency Performance**: Algorithm design emphasizes efficiency, ensuring minimal overhead in high-concurrency scenarios.\n-   **Fault Avoidance**: Tightly integrated with the circuit breaker mechanism, automatically excluding faulty or broken nodes.\n\n#### Applicable Scenarios\n\nThis algorithm is particularly suitable for the following application scenarios:\n\n-   **Large Language Model API Proxying**: Effectively handles the high latency and latency uncertainty of LLM requests (applicable from millisecond to minute-level responses).\n-   **Heterogeneous Upstream Services**: When the upstream group contains LLM services with different performance characteristics and costs (e.g., mixing GPT-4 and GPT-3.5, or models from different vendors).\n-   **Service Quality Sensitive Applications**: Scenarios with high requirements for LLM response latency.\n\n#### Configuration Example\n\n```yaml\nupstream_groups:\n    - name: \"mixed_llm_services\"\n      upstreams:\n          # The response-time aware strategy doesn't directly use the 'weight' parameter, it evaluates dynamically\n          - name: \"fast_but_expensive_llm\"\n          - name: \"slower_but_cheaper_llm\"\n          - name: \"another_llm_provider_model\"\n      balance:\n          strategy: \"response_aware\" # Enable response-time aware load balancing\n      http_client:\n          # ... (recommended to configure appropriate timeout and retry for LLM requests)\n          timeout:\n              request: 300 # Adjust according to the expected time of the slowest model\n```\n\n### Circuit Breaker Mechanism\n\nLLMProxy integrates a powerful circuit breaker pattern for each upstream service instance, aimed at enhancing the resilience and stability of the entire system and preventing local failures from propagating to upstream services or clients.\n\n#### How It Works\n\nThe circuit breaker emulates the behavior of a fuse in an electrical circuit, following a three-state lifecycle model:\n\n1.  **Closed State**:\n\n    -   Initial and normal operating state. All requests directed to this upstream service are allowed through.\n    -   LLMProxy continuously monitors the success and failure of requests sent to this upstream (typically based on HTTP status codes or connection errors).\n    -   If the failure rate within the defined statistical window reaches the configured threshold (`breaker.threshold`), the circuit breaker transitions to the \"Open\" state.\n\n2.  **Open State**:\n\n    -   Circuit breaker is activated (\"tripped\"). At this point, LLMProxy will **immediately reject** all new requests directed to this failing upstream service without actually attempting to connect.\n    -   This avoids unnecessary timeout waits, enables fast failure, and reduces pressure on the failing upstream, giving it time to recover.\n    -   If this upstream service belongs to an upstream group, the load balancer will direct traffic to other healthy members in the group (if they exist).\n    -   After the \"Open\" state persists for a preset cooling period (`breaker.cooldown`), the circuit breaker transitions to the \"Half-Open\" state.\n\n3.  **Half-Open State**:\n    -   The circuit breaker allows a small portion (typically a single) \"probe\" request through, attempting to connect to the previously failing upstream service.\n    -   **If the probe request succeeds**: The system considers that the upstream service may have recovered. The circuit breaker resets and transitions back to the \"Closed\" state, resuming normal traffic.\n    -   **If the probe request fails**: The system considers that the upstream service is still unstable. The circuit breaker returns to the \"Open\" state and starts a new cooling timer.\n\n#### Coordination with Load Balancing\n\n-   When an upstream service's circuit breaker is in the \"Open\" or \"Half-Open\" (after a failed probe) state, the load balancer treats it as an unavailable node and will not assign new user requests to it.\n-   Intelligent failover: If an upstream group has multiple upstream services, and one or more are broken, the load balancer will automatically distribute traffic to the remaining healthy (\"Closed\" state) services.\n-   Only when all upstream services in a group are unavailable (e.g., all broken) will requests to that group fail.\n\n#### Advantages\n\n-   **Fast Failure \u0026 Resource Protection**: Quickly identifies and isolates failing upstreams, preventing client requests from waiting for long periods or depleting proxy resources due to unresponsive upstreams.\n-   **Preventing Cascading Failures**: By isolating problem services, prevents their failure pressure from propagating to other parts of the system or causing client retry storms.\n-   **Automatic Recovery Detection**: No manual intervention needed; automatically probes whether upstream services have recovered and reintegrates them into service when they have.\n-   **Fine-grained Configuration**: Can independently configure circuit breaking thresholds and cooldown times for each upstream service instance (`upstreams[].breaker`) to adapt to different service characteristics.\n-   **Enhancing Overall System Resilience**: Enables the system to maintain partial functionality or gracefully degrade when faced with instability or failure in some backend LLM services.\n-   **Observability**: Monitor circuit breaker status changes and behavior through Prometheus metrics for operational purposes.\n\n#### Configuration Example\n\n```yaml\nupstreams:\n    - name: \"openai_service_main\"\n      url: \"https://api.openai.com/v1\"\n      # ... auth config ...\n      breaker:\n          threshold: 0.5 # When the failure rate of requests to this upstream reaches 50% within the statistical window, the circuit breaker opens\n          cooldown: 30 # After breaking, wait 30 seconds to enter the half-open state to attempt recovery\n\n    - name: \"critical_custom_llm_service\"\n      url: \"http://my.custom.llm:8080\"\n      breaker:\n          threshold: 0.3 # For more critical or less volatile services, a lower failure rate threshold can be set\n          cooldown: 60 # If the service recovers more slowly, the cooling time can be appropriately extended\n```\n\n## API Endpoints\n\nLLMProxy exposes the following main types of HTTP API endpoints:\n\n### Forwarding Endpoints\n\n-   **Path \u0026 Port**: Determined by the `port`, `address` in the `http_server.forwards[]` configuration and the original path of the client request.\n    -   _Example_: If configured with `port: 3000`, `address: \"0.0.0.0\"`, then client requests to `http://\u003cllmproxy_host\u003e:3000/v1/chat/completions` will be handled by this forwarding service.\n-   _Description_: These are the main working endpoints of LLMProxy. Client applications (such as your AI application backend) send standard LLM API requests (e.g., OpenAI, Anthropic format) to these endpoints.\n-   _Protocol_: HTTP (LLMProxy itself does not currently handle HTTPS termination directly; it's recommended to use a reverse proxy like Nginx in front to handle TLS).\n-   _Usage_: LLMProxy receives these requests, processes them according to the configured `upstream_group` (load balancing, authentication injection, header modification, circuit breaking, etc.), then forwards the request to the selected upstream LLM API service, and returns the upstream's response (including streaming responses) to the client.\n\n### Admin Endpoints\n\nThese endpoints are defined by the `http_server.admin` configuration block, by default listening on `0.0.0.0:9000` (recommended to change to `127.0.0.1:9000` in production environments).\n\n#### Standard Endpoints\n\n-   **GET /health**\n\n    -   _Description_: Provides basic health checking. Mainly used for automated systems (such as Kubernetes Liveness/Readiness Probes, load balancer health checks) to determine if the LLMProxy service process is running and able to respond to requests.\n    -   _Returns_:\n        -   `200 OK`: Indicates the service is basically normal.\n        -   (Future versions may add more detailed health status, such as whether configuration loading was successful, etc.)\n    -   _Content Type_: `text/plain` or `application/json`\n\n-   **GET /metrics**\n\n    -   _Description_: Exposes Prometheus format monitoring metrics. For Prometheus servers to scrape, used to monitor LLMProxy's performance, traffic, errors, upstream status, etc.\n    -   _Returns_: Text format Prometheus metrics data.\n    -   _Content Type_: `text/plain; version=0.0.4; charset=utf-8`\n\n#### Configuration Management API\n\n![openapi_ui](./images/openapi-ui.png)\n_Figure: OpenAPI UI Example_\n\nFor enhanced operational visibility and easier debugging, the admin service provides a comprehensive Configuration Management API. This RESTful API allows you to inspect and modify the live, in-memory configuration of LLMProxy at any time without service restart, which is invaluable for auditing, troubleshooting, dynamic reconfiguration, and integration with automated operational workflows.\n\nThe API is versioned under the `/api/v1` path prefix. For security, access to these `api endpoints` can be protected by setting the `LLMPROXY_ADMIN_AUTH_TOKEN` environment variable, which enforces `Bearer Token` authentication.\n\n**API Endpoints**\n\nThe API offers a structured set of endpoints to retrieve and modify all key configuration entities:\n\n-   **Forwards**:\n    -   `GET /api/v1/forwards`: Retrieves a list of all configured forward services.\n    -   `GET /api/v1/forwards/{name}`: Fetches the details of a specific forward service.\n-   **Upstream Groups**:\n    -   `GET /api/v1/upstream-groups`: Lists all configured upstream groups.\n    -   `GET /api/v1/upstream-groups/{name}`: Fetches the details of a specific group.\n    -   `PATCH /api/v1/upstream-groups/{name}`: Updates the upstreams list of a specific group. This operation atomically replaces the entire upstreams list with the new one provided.\n-   **Upstreams**:\n    -   `GET /api/v1/upstreams`: Lists all configured upstream services.\n    -   `GET /api/v1/upstreams/{name}`: Fetches the details of a specific upstream.\n    -   `POST /api/v1/upstreams`: Creates a new upstream service.\n    -   `PUT /api/v1/upstreams/{name}`: Updates an existing upstream service.\n    -   `DELETE /api/v1/upstreams/{name}`: Deletes an upstream service (with dependency protection to prevent deletion if the service is referenced by any upstream group).\n\n**Dynamic Configuration**\n\nThe dynamic configuration API enables hot-reloading of LLMProxy's configuration without service restart:\n\n-   **Upstream Service Management**: Create, update, or delete upstream services on-the-fly.\n-   **Upstream Group Management**: Reconfigure upstream groups by modifying their upstreams list.\n-   **Dependency Protection**: Built-in safeguards prevent breaking changes, such as deleting an upstream that's currently in use.\n-   **Configuration Consistency**: All modifications maintain the integrity of LLMProxy's configuration.\n\n**Interactive OpenAPI UI**\n\nTo make exploring and interacting with the Configuration Management API as easy as possible, LLMProxy includes a built-in OpenAPI UI. This interactive interface provides comprehensive documentation for all endpoints and allows you to execute API calls directly from your browser.\n\n-   **Access**: The OpenAPI UI is available at the `/api/v1/docs` path on the admin port. For security reasons, it is **only enabled** when LLMProxy is launched in debug mode (e.g., by using the `-d` or `--debug` command-line flag).\n\n## Prometheus Metrics\n\nLLMProxy exposes comprehensive Prometheus metrics through the admin endpoint's `/metrics` path for real-time monitoring of system performance, request handling, upstream service health status, and internal component states. These metrics are crucial for operations, troubleshooting, and capacity planning.\n\nBelow are the key metric categories and examples:\n\n### HTTP Server \u0026 Request Metrics (for inbound requests from clients to LLMProxy)\n\n-   `llmproxy_http_requests_total` (Counter)\n    -   Description: Total number of HTTP requests received.\n    -   Labels: `forward` (forwarding service name), `method` (HTTP method), `path` (request path).\n-   `llmproxy_http_request_duration_seconds` (Histogram)\n    -   Description: Latency distribution of HTTP request processing.\n    -   Labels: `forward`, `method`, `path`.\n-   `llmproxy_http_request_errors_total` (Counter)\n    -   Description: Total number of errors that occurred while processing HTTP requests.\n    -   Labels: `forward`, `error`, `status`.\n-   `llmproxy_ratelimit_total` (Counter)\n    -   Description: Total number of requests rejected due to rate limiting.\n    -   Labels: `forward`.\n\n### Upstream Client Metrics (for outbound requests from LLMProxy to backend LLM services)\n\n-   `llmproxy_upstream_requests_total` (Counter)\n    -   Description: Total number of requests sent to upstream LLM services.\n    -   Labels: `group` (upstream group name), `upstream` (upstream service name).\n-   `llmproxy_upstream_duration_seconds` (Histogram)\n    -   Description: Latency distribution of sending requests to upstream LLM services and receiving responses.\n    -   Labels: `group`, `upstream`.\n-   `llmproxy_upstream_errors_total` (Counter)\n    -   Description: Total number of errors that occurred when communicating with upstream LLM services.\n    -   Labels: `error` (error type), `group`, `upstream`.\n\n### Circuit Breaker Metrics\n\n-   `llmproxy_circuitbreaker_state_changes_total` (Counter)\n    -   Description: Total number of circuit breaker state transitions.\n    -   Labels: `group`, `upstream`, `url`, `from` (original state), `to` (new state).\n-   `llmproxy_circuitbreaker_calls_total` (Counter)\n    -   Description: Total number of calls processed through the circuit breaker (including successful, failed, rejected ones).\n    -   Labels: `group`, `upstream`, `url`, `result` (result type).\n\nThese metrics can be scraped by Prometheus and then visualized and configured for alerting using tools like Grafana, enabling comprehensive monitoring of the LLMProxy service and the LLM API calls it proxies.\n\n## License\n\n[MIT License](LICENSE)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshengyanli1982%2Fllmproxy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshengyanli1982%2Fllmproxy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshengyanli1982%2Fllmproxy/lists"}