{"id":26296102,"url":"https://github.com/k2/olol","last_synced_at":"2025-06-11T23:02:52.652Z","repository":{"id":281726825,"uuid":"944817508","full_name":"K2/olol","owner":"K2","description":"# Ollama\u003c=\u003eOllama Inference Cluster","archived":false,"fork":false,"pushed_at":"2025-03-10T20:49:05.000Z","size":4086,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-10T21:22:44.564Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/K2.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-08T02:59:10.000Z","updated_at":"2025-03-10T20:49:09.000Z","dependencies_parsed_at":"2025-03-10T21:22:48.184Z","dependency_job_id":"fcc78cdf-2944-4b6c-b430-5a4eb2f38527","html_url":"https://github.com/K2/olol","commit_stats":null,"previous_names":["k2/olol"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/K2/olol","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/K2%2Folol","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/K2%2Folol/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/K2%2Folol/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/K2%2Folol/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/K2","download_url":"https://codeload.github.com/K2/olol/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/K2%2Folol/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259360728,"owners_count":22845817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-15T04:16:58.842Z","updated_at":"2025-06-11T23:02:52.636Z","avatar_url":"https://github.com/K2.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OLOL - Ollama Load Balancing and Clustering\n\nA distributed inference system that allows you to build a powerful multi-host cluster for Ollama AI models with transparent scaling and fault tolerance.\n\nOLOL (Ollama Load Balancer) is a Python package providing gRPC interfaces with both synchronous and asynchronous support for distributed inference across multiple Ollama instances.\n\n![olol imagry](olol.png  \"Olol project mascot's\")\n\n## Overview\n\nThis system provides a unified API endpoint that transparently distributes inference requests across multiple Ollama instances running on different hosts. It maintains compatibility with the Ollama API while adding clustering capabilities.  \n\n### Key Features\n\n- **Transparent API Compatibility**: Drop-in replacement for the Ollama API\n- **Automatic Load Balancing**: Distributes requests across available servers\n- **Model Awareness**: Routes requests to servers that have the requested model\n- **Session Affinity**: Maintains chat history consistently across requests\n- **Redundancy**: Can pull models to multiple servers for high availability\n- **Monitoring**: Built-in status endpoint to monitor the cluster\n- **Distributed Inference**: Automatically splits large models across multiple servers for faster inference\n\n## Architecture\n\nThe system consists of these main components:\n\n1. **gRPC Server**: Runs on each inference host with local Ollama installed\n2. **API Proxy Client**: Provides a unified Ollama API endpoint for applications\n3. **RPC Server**: Enables distributed inference by splitting model layers across servers\n4. **Inference Coordinator**: Manages distributed inference and model partitioning\n5. **Protocol Buffer Definition**: Defines the communication contract\n\n### Distributed Inference\n\nOLOL supports distributed inference, which allows you to split large models across multiple servers:\n\n- **Layer Partitioning**: Automatically splits model layers across available servers\n- **Auto-Detection**: Automatically uses distributed inference for large models (13B+)\n- **Hardware Optimization**: Allocates model layers based on server hardware capabilities\n- **Transparent API**: No changes required to your client code\n- **Advanced Options**: Fine-tune distribution with API options\n\nDistributed inference is particularly useful for:\n- Running very large models (\u003e13B parameters) across multiple smaller machines\n- Speeding up inference by parallelizing model computation\n- Enabling models too large to fit in a single machine's memory\n\n### Model Quantization\n\nOLOL intelligently handles model quantization:\n\n- **Smart Quantization**: Automatically selects the best quantization level based on hardware and model size\n- **Compatibility Detection**: Checks if a compatible quantization is already loaded\n- **On-Demand Loading**: Can load models with appropriate quantization when needed\n- **Quantization Fallbacks**: Can use higher-quality quantization to serve lower-quality requests\n\n### Auto-Discovery\n\nOLOL includes an auto-discovery system for zero-configuration clustering:\n\n- **Server Auto-Registration**: RPC servers automatically find and register with the proxy\n- **Capability Broadcasting**: Servers advertise their hardware capabilities and device types\n- **Dynamic Scaling**: New servers are automatically added to the cluster when they come online\n- **Subnet Detection**: Servers automatically scan the subnet to find the proxy\n\nQuantization compatibility rules:\n- `q4_0` (smallest memory usage): Compatible with models loaded as q4_0, q4_1, q5_0, q5_1, or q8_0\n- `q5_0` (balanced): Compatible with models loaded as q5_0, q5_1, or q8_0\n- `q8_0` (highest quality): Only compatible with q8_0\n- `f16` (unquantized): Only compatible with f16\n\nWhen requested quantization isn't available, OLOL will:\n1. Try to find a model with compatible (higher-quality) quantization\n2. Try to load the model with the requested quantization\n3. If that fails, load with the best quantization for the available hardware\n\n### System Architecture\n\n```mermaid\nflowchart TD\n    Client[Client Application] --\u003e|HTTP API Requests| Proxy[API Proxy]\n    \n    subgraph Load Balancer\n        Proxy --\u003e|Model Registry| Registry[Model Registry]\n        Proxy --\u003e|Session Tracking| Sessions[Session Manager]\n        Proxy --\u003e|Server Monitoring| Monitor[Server Monitor]\n    end\n    \n    Registry -.-\u003e|Updates| Proxy\n    Sessions -.-\u003e|State| Proxy\n    Monitor -.-\u003e|Status Updates| Proxy\n    \n    Proxy --\u003e|gRPC| Server1[Inference Server 1]\n    Proxy --\u003e|gRPC| Server2[Inference Server 2]\n    Proxy --\u003e|gRPC| Server3[Inference Server 3]\n    \n    subgraph \"Inference Server 1\"\n        Server1 --\u003e|CLI Commands| Ollama1[Ollama]\n        Ollama1 --\u003e|Local Models| ModelDir1[Model Storage]\n    end\n    \n    subgraph \"Inference Server 2\"\n        Server2 --\u003e|CLI Commands| Ollama2[Ollama]\n        Ollama2 --\u003e|Local Models| ModelDir2[Model Storage]\n    end\n    \n    subgraph \"Inference Server 3\" \n        Server3 --\u003e|CLI Commands| Ollama3[Ollama]\n        Ollama3 --\u003e|Local Models| ModelDir3[Model Storage]\n    end\n    \n    class Client,Proxy,Registry,Sessions,Monitor,Server1,Server2,Server3 componentNode;\n    class Ollama1,Ollama2,Ollama3,ModelDir1,ModelDir2,ModelDir3 resourceNode;\n    \n    classDef componentNode fill:#b3e0ff,stroke:#9cc,stroke-width:2px;\n    classDef resourceNode fill:#ffe6cc,stroke:#d79b00,stroke-width:1px;\n```\n\n### Entity Relationship Model\n\n```mermaid\nerDiagram\n    APIProxy ||--o{ InferenceServer : \"routes-requests-to\"\n    APIProxy ||--o{ Session : \"manages\"\n    APIProxy ||--o{ ModelRegistry : \"maintains\"\n    APIProxy ||--o{ LoadBalancer : \"uses\"\n    InferenceServer ||--o{ Model : \"hosts\"\n    InferenceServer ||--o{ Session : \"maintains-state-for\"\n    InferenceServer ||--o{ Metrics : \"generates\"\n    Session }o--|| Model : \"uses\"\n    Model }|--|| ModelRegistry : \"registered-in\"\n    Client }|--o{ APIProxy : \"connects-to\"\n    Session }o--o{ ChatHistory : \"contains\"\n    LoadBalancer }o--|| HealthCheck : \"performs\"\n    \n    APIProxy {\n        string host\n        int port\n        array servers\n        object session_map\n        object model_map\n        int max_workers\n        bool async_mode\n    }\n    \n    InferenceServer {\n        string host\n        int port\n        int current_load\n        bool online\n        array loaded_models\n        array active_sessions\n        float cpu_usage\n        float memory_usage\n        int gpu_memory\n        timestamp last_heartbeat\n    }\n    \n    Model {\n        string name\n        string tag\n        string family\n        int size_mb\n        string digest\n        array compatible_servers\n        json parameters\n        float quantization\n        string architecture\n    }\n    \n    Session {\n        string session_id\n        string model_name\n        array messages\n        timestamp created_at\n        timestamp last_active\n        string server_host\n        json model_parameters\n        float timeout\n        string status\n    }\n    \n    ChatHistory {\n        string role\n        string content\n        timestamp timestamp\n        float temperature\n        int tokens_used\n        float completion_time\n        json metadata\n    }\n    \n    Client {\n        string application_type\n        string api_version\n        string client_id\n        json preferences\n        timestamp connected_at\n    }\n    \n    ModelRegistry {\n        map model_to_servers\n        int total_models\n        timestamp last_updated\n        json model_stats\n        array pending_pulls\n        json version_info\n    }\n\n    LoadBalancer {\n        string algorithm\n        int max_retries\n        float timeout\n        json server_weights\n        bool sticky_sessions\n        json routing_rules\n    }\n\n    Metrics {\n        string server_id\n        float response_time\n        int requests_per_second\n        float error_rate\n        json resource_usage\n        timestamp collected_at\n    }\n\n    HealthCheck {\n        string check_id\n        string status\n        int interval_seconds\n        timestamp last_check\n        json error_details\n        int consecutive_failures\n    }\n```\n\n### Request Flow Sequence\n```mermaid\nsequenceDiagram\n    participant Client as Client Application\n    participant Proxy as API Proxy\n    participant Registry as Model Registry\n    participant SessionMgr as Session Manager\n    participant Server1 as Inference Server 1\n    participant Server2 as Inference Server 2\n    participant Ollama as Ollama CLI/HTTP\n    \n    Client-\u003e\u003e+Proxy: POST /api/chat (model: llama2)\n    Proxy-\u003e\u003e+Registry: Find servers with llama2\n    Registry--\u003e\u003e-Proxy: Server1 and Server2 available\n    \n    Proxy-\u003e\u003eSessionMgr: Create/Get Session\n    alt New Session\n        SessionMgr--\u003e\u003eProxy: New Session ID\n        Note over Proxy,Server1: Select Server1 (lowest load)\n        Proxy-\u003e\u003e+Server1: CreateSession(session_id, llama2)\n        Server1-\u003e\u003eOllama: ollama run llama2\n        Server1--\u003e\u003e-Proxy: Session Created\n    else Existing Session\n        SessionMgr--\u003e\u003eProxy: Existing Session on Server2\n        Note over Proxy,Server2: Maintain Session Affinity\n    end\n    \n    Proxy-\u003e\u003e+Server1: ChatMessage(session_id, message)\n    Server1-\u003e\u003eOllama: ollama run with history\n    Ollama--\u003e\u003eServer1: Response\n    Server1--\u003e\u003e-Proxy: Chat Response\n    Proxy--\u003e\u003e-Client: JSON Response\n    \n    Note right of Client: Later: Model Update\n    \n    Client-\u003e\u003e+Proxy: POST /api/pull (model: mistral)\n    Proxy-\u003e\u003e+Registry: Check model status\n    Registry--\u003e\u003e-Proxy: Not available\n    \n    par Pull to Server1\n        Proxy-\u003e\u003e+Server1: PullModel(\"mistral\")\n        Server1-\u003e\u003eOllama: ollama pull mistral\n        Server1--\u003e\u003eProxy: Stream Progress\n    and Pull to Server2\n        Proxy-\u003e\u003e+Server2: PullModel(\"mistral\") \n        Server2-\u003e\u003eOllama: ollama pull mistral\n        Server2--\u003e\u003eProxy: Stream Progress\n    end\n    \n    Server1--\u003e\u003e-Proxy: Pull Complete\n    Server2--\u003e\u003e-Proxy: Pull Complete\n    Proxy-\u003e\u003eRegistry: Update model-\u003eserver map\n    Proxy--\u003e\u003e-Client: Pull Complete Response\n```\n\n\n## Installation\n\n```bash\n# Install from PyPI (once published)\nuv pip install olol\n\n# Install with extras\nuv pip install \"olol[proxy,async]\"\n\n# Development installation\ngit clone https://github.com/K2/olol.git\ncd olol\nuv pip install -e \".[dev]\"\n\n# Build and install from source\ncd olol\n./tools/build-simple.sh\nuv pip install dist/olol-0.1.0-py3-none-any.whl\n```\n\n## Quick Start\n\n### 1. Start Ollama instances\n\nStart multiple Ollama instances on different machines or ports.\n\n### 2. Start the gRPC servers\n\n```bash\n# Start a synchronous server\nolol server --host 0.0.0.0 --port 50051 --ollama-host http://localhost:11434\n\n# Start an asynchronous server (on another machine)\nolol server --host 0.0.0.0 --port 50052 --ollama-host http://localhost:11434 --async\n```\n\n### 3. Start the load balancing proxy\n\n```bash\n# Basic proxy with load balancing\nolol proxy --host 0.0.0.0 --port 8000 --servers \"192.168.1.10:50051,192.168.1.11:50051\"\n\n# Start with distributed inference enabled\nolol proxy --host 0.0.0.0 --port 8000 --servers \"192.168.1.10:50051,192.168.1.11:50051\" --distributed\n\n# With custom RPC servers for distributed inference\nolol proxy --servers \"192.168.1.10:50051,192.168.1.11:50051\" --distributed --rpc-servers \"192.168.1.10:50052,192.168.1.11:50052\"\n\n# Auto-discovery mode (will automatically find and add new servers)\nolol proxy --distributed --discovery\n\n# Specify network interface for multi-network-interface setups\nolol proxy --distributed --interface 10.0.0.5\n```\n\n### 4. Set up distributed inference (optional)\n\nFor large models, you can shard the model across multiple servers for faster inference:\n\n```bash\n# Start RPC servers on each machine that will participate in distributed inference\nolol rpc-server --host 0.0.0.0 --port 50052 --device auto\n\n# With optimized settings for large models\nolol rpc-server --device cuda --flash-attention --context-window 16384 --quantize q5_0\n\n# Auto-discovery mode (servers will automatically find and register with proxies)\nolol rpc-server --discovery\n\n# Specify preferred network interface when multiple are available\nolol rpc-server --device cuda --interface 192.168.1.10\n\n# Testing distributed inference directly\nolol dist --servers \"192.168.1.10:50052,192.168.1.11:50052\" --model llama2:13b --prompt \"Hello, world!\"\n```\n\n### 5. Use the client\n\n```bash\n# Test with the command-line client\nolol client --host localhost --port 8000 --model llama2 --prompt \"Hello, world!\"\n\n# Or use the async client\nolol client --host localhost --port 8000 --model llama2 --prompt \"Hello, world!\" --async\n```\n\n## Python API\n\n### Synchronous Client\n\n```python\nfrom olol.sync import OllamaClient\n\nclient = OllamaClient(host=\"localhost\", port=8000)\ntry:\n    # Stream text generation\n    for response in client.generate(\"llama2\", \"What is the capital of France?\"):\n        if not response.done:\n            print(response.response, end=\"\", flush=True)\n        else:\n            print(f\"\\nCompleted in {response.total_duration}ms\")\nfinally:\n    client.close()\n```\n\n### Asynchronous Client\n\n```python\nimport asyncio\nfrom olol.async import AsyncOllamaClient\n\nasync def main():\n    client = AsyncOllamaClient(host=\"localhost\", port=8000)\n    try:\n        # Stream text generation\n        async for response in client.generate(\"llama2\", \"What is the capital of France?\"):\n            if not response.done:\n                print(response.response, end=\"\", flush=True)\n            else:\n                print(f\"\\nCompleted in {response.total_duration}ms\")\n    finally:\n        await client.close()\n\nasyncio.run(main())\n```\n\n### HTTP API Usage\n\nOnce the proxy is running, connect your client applications to it using the standard Ollama API:\n\n```bash\n# Example: Chat with a model\ncurl -X POST http://localhost:8000/api/chat -d '{\n  \"model\": \"llama2\",\n  \"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]\n}'\n\n# Using distributed inference explicitly\ncurl -X POST http://localhost:8000/api/generate -d '{\n  \"model\": \"llama2:13b\",\n  \"prompt\": \"Write a poem about distributed computing\",\n  \"options\": {\n    \"distributed\": true\n  }\n}'\n```\n\n### Status Endpoint\n\nMonitor the status of your cluster:\n\n```bash\ncurl http://localhost:8000/api/status\n```\n\n## Configuration\n\n### Command-Line Interface\n\nThe main command-line interface accepts various arguments:\n\n```bash\n# Show available commands\nolol --help\n\n# Show options for a specific command\nolol server --help\nolol proxy --help\n```\n\n### Direct Command Tools\n\nOLOL also provides direct command tools that can be used with `uv run`:\n\n```bash\n# Start a proxy server\nuv run olol-proxy --distributed --discovery\n\n# Start an RPC server\nuv run olol-rpc --device cuda --quantize q5_0 --context-window 8192\n\n# Start a standard server\nuv run olol-server --host 0.0.0.0 --port 50051\n\n# Run distributed inference\nuv run olol-dist --servers \"server1:50052,server2:50052\" --model llama2:13b --prompt \"Hello!\"\n\n# Use the client\nuv run olol-client --model llama2 --prompt \"Tell me about distributed systems\"\n```\n\nThese command tools accept the same options as their corresponding `olol` commands:\n\nEnvironment variables:\n\n**OLOL configuration:**\n- `OLLAMA_SERVERS`: Comma-separated list of gRPC server addresses (default: \"localhost:50051\")\n- `OLOL_PORT`: HTTP port for the API proxy (default: 8000)\n- `OLOL_LOG_LEVEL`: Set logging level (default: INFO)\n\n**Ollama optimization settings:**\n- `OLLAMA_FLASH_ATTENTION`: Enable FlashAttention for faster inference\n- `OLLAMA_NUMA`: Enable NUMA optimization if available\n- `OLLAMA_KEEP_ALIVE`: How long to keep models loaded (e.g., \"1h\")\n- `OLLAMA_MEMORY_LOCK`: Lock memory to prevent swapping\n- `OLLAMA_LOAD_TIMEOUT`: Longer timeout for loading large models\n- `OLLAMA_QUANTIZE`: Quantization level (e.g., \"q8_0\", \"q5_0\", \"f16\")\n- `OLLAMA_CONTEXT_WINDOW`: Default context window size (e.g., \"8192\", \"16384\")\n- `OLLAMA_DEBUG`: Enable debug mode with additional logging\n- `OLLAMA_LOG_LEVEL`: Set Ollama log level\n\n## Contributing\n\nContributions are welcome! Please check out our [Contribution Guidelines](CONTRIBUTING.md).\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fk2%2Folol","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fk2%2Folol","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fk2%2Folol/lists"}