{"id":29544730,"url":"https://github.com/kolosalai/kolosal-server","last_synced_at":"2025-07-17T16:02:15.416Z","repository":{"id":280273047,"uuid":"932977988","full_name":"KolosalAI/kolosal-server","owner":"KolosalAI","description":"Kolosal AI is an OpenSource and Lightweight alternative to Ollama to run LLMs 100% offline on your device.","archived":false,"fork":false,"pushed_at":"2025-07-13T20:12:20.000Z","size":293820,"stargazers_count":4,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-13T21:26:08.784Z","etag":null,"topics":["c","cpp","deepseek","gemma","gemma3","gemma3n","llama","llama2","llama3","llava","llm","llms","mistral","ollama","phi4","qwen"],"latest_commit_sha":null,"homepage":"https://kolosal.ai","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KolosalAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-14T21:57:23.000Z","updated_at":"2025-07-11T02:37:48.000Z","dependencies_parsed_at":"2025-07-13T21:20:36.454Z","dependency_job_id":null,"html_url":"https://github.com/KolosalAI/kolosal-server","commit_stats":null,"previous_names":["genta-technology/kolosal-server","kolosalai/kolosal-server"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/KolosalAI/kolosal-server","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KolosalAI%2Fkolosal-server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KolosalAI%2Fkolosal-server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KolosalAI%2Fkolosal-server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KolosalAI%2Fkolosal-server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KolosalAI","download_url":"https://codeload.github.com/KolosalAI/kolosal-server/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KolosalAI%2Fkolosal-server/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265625563,"owners_count":23800623,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","cpp","deepseek","gemma","gemma3","gemma3n","llama","llama2","llama3","llava","llm","llms","mistral","ollama","phi4","qwen"],"created_at":"2025-07-17T16:00:53.360Z","updated_at":"2025-07-17T16:02:15.396Z","avatar_url":"https://github.com/KolosalAI.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Kolosal Server\n\nA high-performance inference server for large language models with OpenAI-compatible API endpoints. Now available for both **Windows** and **Linux** systems!\n\n## Platform Support\n\n- 🪟 **Windows**: Full support with Visual Studio and MSVC\n- 🐧 **Linux**: Native support with GCC/Clang\n- 🎮 **GPU Acceleration**: NVIDIA CUDA and Vulkan support\n- 📦 **Easy Installation**: Direct binary installation or build from source\n\n## Features\n\n- 🚀 **Fast Inference**: Built with llama.cpp for optimized model inference\n- 🔗 **OpenAI Compatible**: Drop-in replacement for OpenAI API endpoints\n- 📡 **Streaming Support**: Real-time streaming responses for chat completions\n- 🎛️ **Multi-Model Management**: Load and manage multiple models simultaneously\n- 📊 **Real-time Metrics**: Monitor completion performance with TPS, TTFT, and success rates\n- ⚙️ **Lazy Loading**: Defer model loading until first request with `load_immediately=false`\n- 🔧 **Configurable**: Flexible model loading parameters and inference settings\n- 🔒 **Authentication**: API key and rate limiting support\n- 🌐 **Cross-Platform**: Windows and Linux native builds\n\n## Quick Start\n\n### Linux (Recommended)\n\n#### Prerequisites\n\n**System Requirements:**\n- Ubuntu 20.04+ or equivalent Linux distribution (CentOS 8+, Fedora 32+, Arch Linux)\n- GCC 9+ or Clang 10+\n- CMake 3.14+\n- Git with submodule support\n- At least 4GB RAM (8GB+ recommended for larger models)\n- CUDA Toolkit 11.0+ (optional, for NVIDIA GPU acceleration)\n- Vulkan SDK (optional, for alternative GPU acceleration)\n\n**Install Dependencies:**\n\n**Ubuntu/Debian:**\n```bash\n# Update package list\nsudo apt update\n\n# Install essential build tools\nsudo apt install -y build-essential cmake git pkg-config\n\n# Install required libraries\nsudo apt install -y libcurl4-openssl-dev libyaml-cpp-dev\n\n# Optional: Install CUDA for GPU support\n# Follow NVIDIA's official installation guide for your distribution\n```\n\n**CentOS/RHEL/Fedora:**\n```bash\n# For CentOS/RHEL 8+\nsudo dnf groupinstall \"Development Tools\"\nsudo dnf install cmake git curl-devel yaml-cpp-devel\n\n# For Fedora\nsudo dnf install gcc-c++ cmake git libcurl-devel yaml-cpp-devel\n```\n\n**Arch Linux:**\n```bash\nsudo pacman -S base-devel cmake git curl yaml-cpp\n```\n\n#### Building from Source\n\n**1. Clone the Repository:**\n```bash\ngit clone https://github.com/kolosalai/kolosal-server.git\ncd kolosal-server\n```\n\n**2. Initialize Submodules:**\n```bash\ngit submodule update --init --recursive\n```\n\n**3. Create Build Directory:**\n```bash\nmkdir build \u0026\u0026 cd build\n```\n\n**4. Configure Build:**\n\n**Standard Build (CPU-only):**\n```bash\ncmake -DCMAKE_BUILD_TYPE=Release ..\n```\n\n**With CUDA Support:**\n```bash\ncmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON ..\n```\n\n**With Vulkan Support:**\n```bash\ncmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON ..\n```\n\n**Debug Build:**\n```bash\ncmake -DCMAKE_BUILD_TYPE=Debug ..\n```\n\n**5. Build the Project:**\n```bash\n# Use all available CPU cores\nmake -j$(nproc)\n\n# Or specify number of cores manually\nmake -j4\n```\n\n**6. Verify Build:**\n```bash\n# Check if the executable was created\nls -la kolosal-server\n\n# Test basic functionality\n./kolosal-server --help\n```\n\n#### Running the Server\n\n**Start the Server:**\n```bash\n# From build directory\n./kolosal-server\n\n# Or specify a config file\n./kolosal-server --config ../config.yaml\n```\n\n**Background Service:**\n```bash\n# Run in background\nnohup ./kolosal-server \u003e server.log 2\u003e\u00261 \u0026\n\n# Check if running\nps aux | grep kolosal-server\n```\n\n**Check Server Status:**\n```bash\n# Test if server is responding\ncurl http://localhost:8080/v1/health\n```\n\n#### Alternative Installation Methods\n\n**Install to System Path:**\n```bash\n# Install binary to /usr/local/bin\nsudo cp build/kolosal-server /usr/local/bin/\n\n# Make it executable\nsudo chmod +x /usr/local/bin/kolosal-server\n\n# Now you can run from anywhere\nkolosal-server --help\n```\n\n**Install with Package Manager (Future):**\n```bash\n# Note: Package manager installation will be available in future releases\n# For now, use the build from source method above\n```\n\n#### Installation as System Service\n\n**Create Service File:**\n```bash\nsudo tee /etc/systemd/system/kolosal-server.service \u003e /dev/null \u003c\u003c EOF\n[Unit]\nDescription=Kolosal Server - LLM Inference Server\nAfter=network.target\n\n[Service]\nType=simple\nUser=kolosal\nGroup=kolosal\nWorkingDirectory=/opt/kolosal-server\nExecStart=/opt/kolosal-server/kolosal-server --config /etc/kolosal-server/config.yaml\nRestart=always\nRestartSec=5\nStandardOutput=journal\nStandardError=journal\n\n[Install]\nWantedBy=multi-user.target\nEOF\n```\n\n**Enable and Start Service:**\n```bash\n# Create user for service\nsudo useradd -r -s /bin/false kolosal\n\n# Install binary and config\nsudo mkdir -p /opt/kolosal-server /etc/kolosal-server\nsudo cp build/kolosal-server /opt/kolosal-server/\nsudo cp config.example.yaml /etc/kolosal-server/config.yaml\nsudo chown -R kolosal:kolosal /opt/kolosal-server\n\n# Enable and start service\nsudo systemctl daemon-reload\nsudo systemctl enable kolosal-server\nsudo systemctl start kolosal-server\n\n# Check status\nsudo systemctl status kolosal-server\n```\n\n#### Troubleshooting\n\n**Common Build Issues:**\n\n1. **Missing dependencies:**\n   ```bash\n   # Check for missing packages\n   ldd build/kolosal-server\n   \n   # Install missing development packages\n   sudo apt install -y libssl-dev libcurl4-openssl-dev\n   ```\n\n2. **CMake version too old:**\n   ```bash\n   # Install newer CMake from Kitware APT repository\n   wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2\u003e/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg \u003e/dev/null\n   sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main'\n   sudo apt update \u0026\u0026 sudo apt install cmake\n   ```\n\n3. **CUDA compilation errors:**\n   ```bash\n   # Verify CUDA installation\n   nvcc --version\n   nvidia-smi\n   \n   # Set CUDA environment variables if needed\n   export CUDA_HOME=/usr/local/cuda\n   export PATH=$CUDA_HOME/bin:$PATH\n   export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH\n   ```\n\n4. **Permission issues:**\n   ```bash\n   # Fix ownership\n   sudo chown -R $USER:$USER ./build\n   \n   # Make executable\n   chmod +x build/kolosal-server\n   ```\n\n**Performance Optimization:**\n\n1. **CPU Optimization:**\n   ```bash\n   # Build with native optimizations\n   cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=\"-march=native\" ..\n   ```\n\n2. **Memory Settings:**\n   ```bash\n   # For systems with limited RAM, reduce parallel jobs\n   make -j2\n   \n   # Set memory limits in config\n   echo \"server.max_memory_mb: 4096\" \u003e\u003e config.yaml\n   ```\n\n3. **GPU Memory:**\n   ```bash\n   # Monitor GPU usage\n   watch nvidia-smi\n   \n   # Adjust GPU layers in model config\n   # Reduce n_gpu_layers if running out of VRAM\n   ```\n\n### Windows\n\n**Prerequisites:**\n- Windows 10/11\n- Visual Studio 2019 or later\n- CMake 3.20+\n- CUDA Toolkit (optional, for GPU acceleration)\n\n**Building:**\n```bash\ngit clone https://github.com/kolosalai/kolosal-server.git\ncd kolosal-server\nmkdir build \u0026\u0026 cd build\ncmake ..\ncmake --build . --config Debug\n```\n\n### Running the Server\n\n```bash\n./Debug/kolosal-server.exe\n```\n\nThe server will start on `http://localhost:8080` by default.\n\n## Configuration\n\nKolosal Server supports configuration through JSON and YAML files for advanced setup including authentication, logging, model preloading, and server parameters.\n\n### Quick Configuration Examples\n\n#### Minimal Configuration (`config.yaml`)\n\n```yaml\nserver:\n  port: \"8080\"\n\nmodels:\n  - id: \"my-model\"\n    path: \"./models/model.gguf\"\n    load_immediately: true\n```\n\n#### Production Configuration\n\n```yaml\nserver:\n  port: \"8080\"\n  max_connections: 500\n  worker_threads: 8\n\nauth:\n  enabled: true\n  require_api_key: true\n  api_keys:\n    - \"sk-your-api-key-here\"\n\nmodels:\n  - id: \"gpt-3.5-turbo\"\n    path: \"./models/gpt-3.5-turbo.gguf\"\n    load_immediately: true\n    main_gpu_id: 0\n    load_params:\n      n_ctx: 4096\n      n_gpu_layers: 50\n\nfeatures:\n  metrics: true  # Enable /metrics and /completion-metrics\n```\n\nFor complete configuration documentation including all parameters, authentication setup, CORS configuration, and more examples, see the **[Configuration Guide](docs/CONFIGURATION.md)**.\n\n## API Usage\n\n### 1. Add a Model Engine\n\nBefore using chat completions, you need to add a model engine:\n\n```bash\ncurl -X POST http://localhost:8080/engines \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"engine_id\": \"my-model\",\n    \"model_path\": \"path/to/your/model.gguf\",\n    \"load_immediately\": true,\n    \"n_ctx\": 2048,\n    \"n_gpu_layers\": 0,\n    \"main_gpu_id\": 0\n  }'\n```\n\n#### Lazy Loading\n\nFor faster startup times, you can defer model loading until first use:\n\n```bash\ncurl -X POST http://localhost:8080/engines \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"engine_id\": \"my-model\",\n    \"model_path\": \"https://huggingface.co/model-repo/model.gguf\",\n    \"load_immediately\": false,\n    \"n_ctx\": 4096,\n    \"n_gpu_layers\": 30,\n    \"main_gpu_id\": 0\n  }'\n```\n\n### 2. Chat Completions\n\n#### Non-Streaming Chat Completion\n\n```bash\ncurl -X POST http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"my-model\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello, how are you today?\"\n      }\n    ],\n    \"stream\": false,\n    \"temperature\": 0.7,\n    \"max_tokens\": 100\n  }'\n```\n\n**Response:**\n```json\n{\n  \"choices\": [\n    {\n      \"finish_reason\": \"stop\",\n      \"index\": 0,\n      \"message\": {\n        \"content\": \"Hello! I'm doing well, thank you for asking. How can I help you today?\",\n        \"role\": \"assistant\"\n      }\n    }\n  ],\n  \"created\": 1749981228,\n  \"id\": \"chatcmpl-80HTkM01z7aaaThFbuALkbTu\",\n  \"model\": \"my-model\",\n  \"object\": \"chat.completion\",\n  \"system_fingerprint\": \"fp_4d29efe704\",\n  \"usage\": {\n    \"completion_tokens\": 15,\n    \"prompt_tokens\": 9,\n    \"total_tokens\": 24\n  }\n}\n```\n\n#### Streaming Chat Completion\n\n```bash\ncurl -X POST http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Accept: text/event-stream\" \\\n  -d '{\n    \"model\": \"my-model\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"Tell me a short story about a robot.\"\n      }\n    ],\n    \"stream\": true,\n    \"temperature\": 0.8,\n    \"max_tokens\": 150\n  }'\n```\n\n**Response (Server-Sent Events):**\n```\ndata: {\"choices\":[{\"delta\":{\"content\":\"\",\"role\":\"assistant\"},\"finish_reason\":null,\"index\":0}],\"created\":1749981242,\"id\":\"chatcmpl-1749981241-1\",\"model\":\"my-model\",\"object\":\"chat.completion.chunk\",\"system_fingerprint\":\"fp_4d29efe704\"}\n\ndata: {\"choices\":[{\"delta\":{\"content\":\"Once\"},\"finish_reason\":null,\"index\":0}],\"created\":1749981242,\"id\":\"chatcmpl-1749981241-1\",\"model\":\"my-model\",\"object\":\"chat.completion.chunk\",\"system_fingerprint\":\"fp_4d29efe704\"}\n\ndata: {\"choices\":[{\"delta\":{\"content\":\" upon\"},\"finish_reason\":null,\"index\":0}],\"created\":1749981242,\"id\":\"chatcmpl-1749981241-1\",\"model\":\"my-model\",\"object\":\"chat.completion.chunk\",\"system_fingerprint\":\"fp_4d29efe704\"}\n\ndata: {\"choices\":[{\"delta\":{\"content\":\"\"},\"finish_reason\":\"stop\",\"index\":0}],\"created\":1749981242,\"id\":\"chatcmpl-1749981241-1\",\"model\":\"my-model\",\"object\":\"chat.completion.chunk\",\"system_fingerprint\":\"fp_4d29efe704\"}\n\ndata: [DONE]\n```\n\n#### Multi-Message Conversation\n\n```bash\ncurl -X POST http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"my-model\",\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful programming assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"How do I create a simple HTTP server in Python?\"\n      },\n      {\n        \"role\": \"assistant\",\n        \"content\": \"You can create a simple HTTP server in Python using the built-in http.server module...\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"Can you show me the code?\"\n      }\n    ],\n    \"stream\": false,\n    \"temperature\": 0.7,\n    \"max_tokens\": 200\n  }'\n```\n\n#### Advanced Parameters\n\n```bash\ncurl -X POST http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"my-model\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"What is the capital of France?\"\n      }\n    ],\n    \"stream\": false,\n    \"temperature\": 0.1,\n    \"top_p\": 0.9,\n    \"max_tokens\": 50,\n    \"seed\": 42,\n    \"presence_penalty\": 0.0,\n    \"frequency_penalty\": 0.0\n  }'\n```\n\n### 3. Completions\n\n#### Non-Streaming Completion\n\n```bash\ncurl -X POST http://localhost:8080/v1/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"my-model\",\n    \"prompt\": \"The future of artificial intelligence is\",\n    \"stream\": false,\n    \"temperature\": 0.7,\n    \"max_tokens\": 100\n  }'\n```\n\n**Response:**\n```json\n{\n  \"choices\": [\n    {\n      \"finish_reason\": \"stop\",\n      \"index\": 0,\n      \"text\": \" bright and full of possibilities. As we continue to advance in machine learning and deep learning technologies, we can expect to see significant improvements in various fields...\"\n    }\n  ],\n  \"created\": 1749981288,\n  \"id\": \"cmpl-80HTkM01z7aaaThFbuALkbTu\",\n  \"model\": \"my-model\",\n  \"object\": \"text_completion\",\n  \"usage\": {\n    \"completion_tokens\": 25,\n    \"prompt_tokens\": 8,\n    \"total_tokens\": 33\n  }\n}\n```\n\n#### Streaming Completion\n\n```bash\ncurl -X POST http://localhost:8080/v1/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Accept: text/event-stream\" \\\n  -d '{\n    \"model\": \"my-model\",\n    \"prompt\": \"Write a haiku about programming:\",\n    \"stream\": true,\n    \"temperature\": 0.8,\n    \"max_tokens\": 50\n  }'\n```\n\n**Response (Server-Sent Events):**\n```\ndata: {\"choices\":[{\"finish_reason\":\"\",\"index\":0,\"text\":\"\"}],\"created\":1749981290,\"id\":\"cmpl-1749981289-1\",\"model\":\"my-model\",\"object\":\"text_completion\"}\n\ndata: {\"choices\":[{\"finish_reason\":\"\",\"index\":0,\"text\":\"Code\"}],\"created\":1749981290,\"id\":\"cmpl-1749981289-1\",\"model\":\"my-model\",\"object\":\"text_completion\"}\n\ndata: {\"choices\":[{\"finish_reason\":\"\",\"index\":0,\"text\":\" flows\"}],\"created\":1749981290,\"id\":\"cmpl-1749981289-1\",\"model\":\"my-model\",\"object\":\"text_completion\"}\n\ndata: {\"choices\":[{\"finish_reason\":\"stop\",\"index\":0,\"text\":\"\"}],\"created\":1749981290,\"id\":\"cmpl-1749981289-1\",\"model\":\"my-model\",\"object\":\"text_completion\"}\n\ndata: [DONE]\n```\n\n#### Multiple Prompts\n\n```bash\ncurl -X POST http://localhost:8080/v1/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"my-model\",\n    \"prompt\": [\n      \"The weather today is\",\n      \"In other news,\"\n    ],\n    \"stream\": false,\n    \"temperature\": 0.5,\n    \"max_tokens\": 30\n  }'\n```\n\n#### Advanced Completion Parameters\n\n```bash\ncurl -X POST http://localhost:8080/v1/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"my-model\",\n    \"prompt\": \"Explain quantum computing:\",\n    \"stream\": false,\n    \"temperature\": 0.2,\n    \"top_p\": 0.9,\n    \"max_tokens\": 100,\n    \"seed\": 123,\n    \"presence_penalty\": 0.0,\n    \"frequency_penalty\": 0.1\n  }'\n```\n\n### 4. Engine Management\n\n#### List Available Engines\n\n```bash\ncurl -X GET http://localhost:8080/v1/engines\n```\n\n#### Get Engine Status\n\n```bash\ncurl -X GET http://localhost:8080/engines/my-model/status\n```\n\n#### Remove an Engine\n\n```bash\ncurl -X DELETE http://localhost:8080/engines/my-model\n```\n\n### 5. Completion Metrics and Monitoring\n\nThe server provides real-time completion metrics for monitoring performance and usage:\n\n#### Get Completion Metrics\n\n```bash\ncurl -X GET http://localhost:8080/completion-metrics\n```\n\n**Response:**\n```json\n{\n  \"completion_metrics\": {\n    \"summary\": {\n      \"total_requests\": 15,\n      \"completed_requests\": 14,\n      \"failed_requests\": 1,\n      \"success_rate_percent\": 93.33,\n      \"total_input_tokens\": 120,\n      \"total_output_tokens\": 350,\n      \"avg_turnaround_time_ms\": 1250.5,\n      \"avg_tps\": 12.8,\n      \"avg_output_tps\": 8.4,\n      \"avg_ttft_ms\": 245.2,\n      \"avg_rps\": 0.85\n    },\n    \"per_engine\": [\n      {\n        \"model_name\": \"my-model\",\n        \"engine_id\": \"default\",\n        \"total_requests\": 15,\n        \"completed_requests\": 14,\n        \"failed_requests\": 1,\n        \"total_input_tokens\": 120,\n        \"total_output_tokens\": 350,\n        \"tps\": 12.8,\n        \"output_tps\": 8.4,\n        \"avg_ttft\": 245.2,\n        \"rps\": 0.85,\n        \"last_updated\": \"2025-06-16T17:04:12.123Z\"\n      }\n    ],\n    \"timestamp\": \"2025-06-16T17:04:12.123Z\"\n  }\n}\n```\n\n**Alternative endpoints:**\n```bash\n# OpenAI-style endpoint\ncurl -X GET http://localhost:8080/v1/completion-metrics\n\n# Alternative path\ncurl -X GET http://localhost:8080/completion/metrics\n```\n\n#### Metrics Explained\n\n| Metric | Description |\n|--------|-------------|\n| `total_requests` | Total number of completion requests received |\n| `completed_requests` | Number of successfully completed requests |\n| `failed_requests` | Number of requests that failed |\n| `success_rate_percent` | Success rate as a percentage |\n| `total_input_tokens` | Total input tokens processed |\n| `total_output_tokens` | Total output tokens generated |\n| `avg_turnaround_time_ms` | Average time from request to completion (ms) |\n| `avg_tps` | Average tokens per second (input + output) |\n| `avg_output_tps` | Average output tokens per second |\n| `avg_ttft_ms` | Average time to first token (ms) |\n| `avg_rps` | Average requests per second |\n\n#### PowerShell Example\n\n```powershell\n# Get completion metrics\n$metrics = Invoke-RestMethod -Uri \"http://localhost:8080/completion-metrics\" -Method GET\nWrite-Output \"Success Rate: $($metrics.completion_metrics.summary.success_rate_percent)%\"\nWrite-Output \"Average TPS: $($metrics.completion_metrics.summary.avg_tps)\"\n```\n\n### 6. Health Check\n\n```bash\ncurl -X GET http://localhost:8080/v1/health\n```\n\n## Parameters Reference\n\n### Chat Completion Parameters\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `model` | string | required | The ID of the model to use |\n| `messages` | array | required | List of message objects |\n| `stream` | boolean | false | Whether to stream responses |\n| `temperature` | number | 1.0 | Sampling temperature (0.0-2.0) |\n| `top_p` | number | 1.0 | Nucleus sampling parameter |\n| `max_tokens` | integer | 128 | Maximum tokens to generate |\n| `seed` | integer | random | Random seed for reproducible outputs |\n| `presence_penalty` | number | 0.0 | Presence penalty (-2.0 to 2.0) |\n| `frequency_penalty` | number | 0.0 | Frequency penalty (-2.0 to 2.0) |\n\n### Completion Parameters\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `model` | string | required | The ID of the model to use |\n| `prompt` | string/array | required | Text prompt or array of prompts |\n| `stream` | boolean | false | Whether to stream responses |\n| `temperature` | number | 1.0 | Sampling temperature (0.0-2.0) |\n| `top_p` | number | 1.0 | Nucleus sampling parameter |\n| `max_tokens` | integer | 16 | Maximum tokens to generate |\n| `seed` | integer | random | Random seed for reproducible outputs |\n| `presence_penalty` | number | 0.0 | Presence penalty (-2.0 to 2.0) |\n| `frequency_penalty` | number | 0.0 | Frequency penalty (-2.0 to 2.0) |\n\n### Message Object\n\n| Field | Type | Description |\n|-------|------|-------------|\n| `role` | string | Role: \"system\", \"user\", or \"assistant\" |\n| `content` | string | The content of the message |\n\n### Engine Loading Parameters\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `engine_id` | string | required | Unique identifier for the engine |\n| `model_path` | string | required | Path to the GGUF model file or URL |\n| `load_immediately` | boolean | true | Whether to load the model immediately or defer until first use |\n| `n_ctx` | integer | 4096 | Context window size |\n| `n_gpu_layers` | integer | 100 | Number of layers to offload to GPU |\n| `main_gpu_id` | integer | 0 | Primary GPU device ID |\n\n## Error Handling\n\nThe server returns standard HTTP status codes and JSON error responses:\n\n```json\n{\n  \"error\": {\n    \"message\": \"Model 'non-existent-model' not found or could not be loaded\",\n    \"type\": \"invalid_request_error\",\n    \"param\": null,\n    \"code\": null\n  }\n}\n```\n\nCommon error codes:\n- `400` - Bad Request (invalid JSON, missing parameters)\n- `404` - Not Found (model/engine not found)\n- `500` - Internal Server Error (inference failures)\n\n## Examples with PowerShell\n\nFor Windows users, here are PowerShell equivalents:\n\n### Add Engine\n```powershell\n$body = @{\n    engine_id = \"my-model\"\n    model_path = \"C:\\path\\to\\model.gguf\"\n    load_immediately = $true\n    n_ctx = 2048\n    n_gpu_layers = 0\n} | ConvertTo-Json\n\nInvoke-RestMethod -Uri \"http://localhost:8080/engines\" -Method POST -Body $body -ContentType \"application/json\"\n```\n\n### Chat Completion\n```powershell\n$body = @{\n    model = \"my-model\"\n    messages = @(\n        @{\n            role = \"user\"\n            content = \"Hello, how are you?\"\n        }\n    )\n    stream = $false\n    temperature = 0.7\n    max_tokens = 100\n} | ConvertTo-Json -Depth 3\n\nInvoke-RestMethod -Uri \"http://localhost:8080/v1/chat/completions\" -Method POST -Body $body -ContentType \"application/json\"\n```\n\n### Completion\n```powershell\n$body = @{\n    model = \"my-model\"\n    prompt = \"The future of AI is\"\n    stream = $false\n    temperature = 0.7\n    max_tokens = 50\n} | ConvertTo-Json\n\nInvoke-RestMethod -Uri \"http://localhost:8080/v1/completions\" -Method POST -Body $body -ContentType \"application/json\"\n```\n\n## 📚 Developer Documentation\n\nFor developers looking to contribute to or extend Kolosal Server, comprehensive documentation is available in the [`docs/`](docs/) directory:\n\n### 🚀 Getting Started\n- **[Developer Guide](docs/DEVELOPER_GUIDE.md)** - Complete setup, architecture, and development workflows\n- **[Configuration Guide](docs/CONFIGURATION.md)** - Complete server configuration in JSON and YAML formats\n- **[Architecture Overview](docs/ARCHITECTURE.md)** - Detailed system design and component relationships\n\n### 🔧 Implementation Guides\n- **[Adding New Routes](docs/ADDING_ROUTES.md)** - Step-by-step guide for implementing API endpoints\n- **[Adding New Models](docs/ADDING_MODELS.md)** - Guide for creating data models and JSON handling\n- **[API Specification](docs/API_SPECIFICATION.md)** - Complete API reference with examples\n\n### 📖 Quick Links\n- [Documentation Index](docs/README.md) - Complete documentation overview\n- [Project Structure](docs/DEVELOPER_GUIDE.md#project-structure) - Understanding the codebase\n- [Contributing Guidelines](docs/DEVELOPER_GUIDE.md#contributing) - How to contribute\n\n## Acknowledgments\n\nKolosal Server is built on top of excellent open-source projects and we want to acknowledge their contributions:\n\n### llama.cpp\nThis project is powered by [llama.cpp](https://github.com/ggml-org/llama.cpp), developed by [Georgi Gerganov](https://github.com/ggerganov) and the [ggml-org](https://github.com/ggml-org) community. llama.cpp provides the high-performance inference engine that makes Kolosal Server possible.\n\n- **Project**: [https://github.com/ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp)\n- **License**: MIT License\n- **Description**: Inference of Meta's LLaMA model (and others) in pure C/C++\n\nWe extend our gratitude to the llama.cpp team for their incredible work on optimized LLM inference, which forms the foundation of our server's performance capabilities.\n\n### Other Dependencies\n- **[yaml-cpp](https://github.com/jbeder/yaml-cpp)**: YAML parsing and emitting library\n- **[nlohmann/json](https://github.com/nlohmann/json)**: JSON library for Modern C++\n- **[libcurl](https://curl.se/libcurl/)**: Client-side URL transfer library\n- **[prometheus-cpp](https://github.com/jupp0r/prometheus-cpp)**: Prometheus metrics library for C++\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Contributing\n\nWe welcome contributions! Please see our [Developer Documentation](docs/) for detailed guides on:\n\n1. **Getting Started**: [Developer Guide](docs/DEVELOPER_GUIDE.md)\n2. **Understanding the System**: [Architecture Overview](docs/ARCHITECTURE.md)\n3. **Adding Features**: [Route](docs/ADDING_ROUTES.md) and [Model](docs/ADDING_MODELS.md) guides\n4. **API Changes**: [API Specification](docs/API_SPECIFICATION.md)\n\n### Quick Contributing Steps\n1. Fork the repository\n2. Follow the [Developer Guide](docs/DEVELOPER_GUIDE.md) for setup\n3. Create a feature branch\n4. Implement your changes following our guides\n5. Add tests and update documentation\n6. Submit a Pull Request\n\n## Support\n\n- **Issues**: Report bugs and feature requests on [GitHub Issues](https://github.com/your-org/kolosal-server/issues)\n- **Documentation**: Check the [docs/](docs/) directory for comprehensive guides\n- **Discussions**: Join [Kolosal AI Discord](https://discord.gg/NCufxNCB) for questions and community support\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkolosalai%2Fkolosal-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkolosalai%2Fkolosal-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkolosalai%2Fkolosal-server/lists"}