https://github.com/kolosalai/kolosal-server
Kolosal AI is an OpenSource and Lightweight alternative to Ollama to run LLMs 100% offline on your device.
https://github.com/kolosalai/kolosal-server
c cpp deepseek gemma gemma3 gemma3n llama llama2 llama3 llava llm llms mistral ollama phi4 qwen
Last synced: 3 months ago
JSON representation
Kolosal AI is an OpenSource and Lightweight alternative to Ollama to run LLMs 100% offline on your device.
- Host: GitHub
- URL: https://github.com/kolosalai/kolosal-server
- Owner: KolosalAI
- License: apache-2.0
- Created: 2025-02-14T21:57:23.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-07-13T20:12:20.000Z (3 months ago)
- Last Synced: 2025-07-13T21:26:08.784Z (3 months ago)
- Topics: c, cpp, deepseek, gemma, gemma3, gemma3n, llama, llama2, llama3, llava, llm, llms, mistral, ollama, phi4, qwen
- Language: C++
- Homepage: https://kolosal.ai
- Size: 280 MB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Kolosal Server
A high-performance inference server for large language models with OpenAI-compatible API endpoints. Now available for both **Windows** and **Linux** systems!
## Platform Support
- 🪟 **Windows**: Full support with Visual Studio and MSVC
- 🐧 **Linux**: Native support with GCC/Clang
- 🎮 **GPU Acceleration**: NVIDIA CUDA and Vulkan support
- 📦 **Easy Installation**: Direct binary installation or build from source## Features
- 🚀 **Fast Inference**: Built with llama.cpp for optimized model inference
- 🔗 **OpenAI Compatible**: Drop-in replacement for OpenAI API endpoints
- 📡 **Streaming Support**: Real-time streaming responses for chat completions
- 🎛️ **Multi-Model Management**: Load and manage multiple models simultaneously
- 📊 **Real-time Metrics**: Monitor completion performance with TPS, TTFT, and success rates
- ⚙️ **Lazy Loading**: Defer model loading until first request with `load_immediately=false`
- 🔧 **Configurable**: Flexible model loading parameters and inference settings
- 🔒 **Authentication**: API key and rate limiting support
- 🌐 **Cross-Platform**: Windows and Linux native builds## Quick Start
### Linux (Recommended)
#### Prerequisites
**System Requirements:**
- Ubuntu 20.04+ or equivalent Linux distribution (CentOS 8+, Fedora 32+, Arch Linux)
- GCC 9+ or Clang 10+
- CMake 3.14+
- Git with submodule support
- At least 4GB RAM (8GB+ recommended for larger models)
- CUDA Toolkit 11.0+ (optional, for NVIDIA GPU acceleration)
- Vulkan SDK (optional, for alternative GPU acceleration)**Install Dependencies:**
**Ubuntu/Debian:**
```bash
# Update package list
sudo apt update# Install essential build tools
sudo apt install -y build-essential cmake git pkg-config# Install required libraries
sudo apt install -y libcurl4-openssl-dev libyaml-cpp-dev# Optional: Install CUDA for GPU support
# Follow NVIDIA's official installation guide for your distribution
```**CentOS/RHEL/Fedora:**
```bash
# For CentOS/RHEL 8+
sudo dnf groupinstall "Development Tools"
sudo dnf install cmake git curl-devel yaml-cpp-devel# For Fedora
sudo dnf install gcc-c++ cmake git libcurl-devel yaml-cpp-devel
```**Arch Linux:**
```bash
sudo pacman -S base-devel cmake git curl yaml-cpp
```#### Building from Source
**1. Clone the Repository:**
```bash
git clone https://github.com/kolosalai/kolosal-server.git
cd kolosal-server
```**2. Initialize Submodules:**
```bash
git submodule update --init --recursive
```**3. Create Build Directory:**
```bash
mkdir build && cd build
```**4. Configure Build:**
**Standard Build (CPU-only):**
```bash
cmake -DCMAKE_BUILD_TYPE=Release ..
```**With CUDA Support:**
```bash
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON ..
```**With Vulkan Support:**
```bash
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON ..
```**Debug Build:**
```bash
cmake -DCMAKE_BUILD_TYPE=Debug ..
```**5. Build the Project:**
```bash
# Use all available CPU cores
make -j$(nproc)# Or specify number of cores manually
make -j4
```**6. Verify Build:**
```bash
# Check if the executable was created
ls -la kolosal-server# Test basic functionality
./kolosal-server --help
```#### Running the Server
**Start the Server:**
```bash
# From build directory
./kolosal-server# Or specify a config file
./kolosal-server --config ../config.yaml
```**Background Service:**
```bash
# Run in background
nohup ./kolosal-server > server.log 2>&1 &# Check if running
ps aux | grep kolosal-server
```**Check Server Status:**
```bash
# Test if server is responding
curl http://localhost:8080/v1/health
```#### Alternative Installation Methods
**Install to System Path:**
```bash
# Install binary to /usr/local/bin
sudo cp build/kolosal-server /usr/local/bin/# Make it executable
sudo chmod +x /usr/local/bin/kolosal-server# Now you can run from anywhere
kolosal-server --help
```**Install with Package Manager (Future):**
```bash
# Note: Package manager installation will be available in future releases
# For now, use the build from source method above
```#### Installation as System Service
**Create Service File:**
```bash
sudo tee /etc/systemd/system/kolosal-server.service > /dev/null << EOF
[Unit]
Description=Kolosal Server - LLM Inference Server
After=network.target[Service]
Type=simple
User=kolosal
Group=kolosal
WorkingDirectory=/opt/kolosal-server
ExecStart=/opt/kolosal-server/kolosal-server --config /etc/kolosal-server/config.yaml
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal[Install]
WantedBy=multi-user.target
EOF
```**Enable and Start Service:**
```bash
# Create user for service
sudo useradd -r -s /bin/false kolosal# Install binary and config
sudo mkdir -p /opt/kolosal-server /etc/kolosal-server
sudo cp build/kolosal-server /opt/kolosal-server/
sudo cp config.example.yaml /etc/kolosal-server/config.yaml
sudo chown -R kolosal:kolosal /opt/kolosal-server# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable kolosal-server
sudo systemctl start kolosal-server# Check status
sudo systemctl status kolosal-server
```#### Troubleshooting
**Common Build Issues:**
1. **Missing dependencies:**
```bash
# Check for missing packages
ldd build/kolosal-server
# Install missing development packages
sudo apt install -y libssl-dev libcurl4-openssl-dev
```2. **CMake version too old:**
```bash
# Install newer CMake from Kitware APT repository
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null
sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main'
sudo apt update && sudo apt install cmake
```3. **CUDA compilation errors:**
```bash
# Verify CUDA installation
nvcc --version
nvidia-smi
# Set CUDA environment variables if needed
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
```4. **Permission issues:**
```bash
# Fix ownership
sudo chown -R $USER:$USER ./build
# Make executable
chmod +x build/kolosal-server
```**Performance Optimization:**
1. **CPU Optimization:**
```bash
# Build with native optimizations
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-march=native" ..
```2. **Memory Settings:**
```bash
# For systems with limited RAM, reduce parallel jobs
make -j2
# Set memory limits in config
echo "server.max_memory_mb: 4096" >> config.yaml
```3. **GPU Memory:**
```bash
# Monitor GPU usage
watch nvidia-smi
# Adjust GPU layers in model config
# Reduce n_gpu_layers if running out of VRAM
```### Windows
**Prerequisites:**
- Windows 10/11
- Visual Studio 2019 or later
- CMake 3.20+
- CUDA Toolkit (optional, for GPU acceleration)**Building:**
```bash
git clone https://github.com/kolosalai/kolosal-server.git
cd kolosal-server
mkdir build && cd build
cmake ..
cmake --build . --config Debug
```### Running the Server
```bash
./Debug/kolosal-server.exe
```The server will start on `http://localhost:8080` by default.
## Configuration
Kolosal Server supports configuration through JSON and YAML files for advanced setup including authentication, logging, model preloading, and server parameters.
### Quick Configuration Examples
#### Minimal Configuration (`config.yaml`)
```yaml
server:
port: "8080"models:
- id: "my-model"
path: "./models/model.gguf"
load_immediately: true
```#### Production Configuration
```yaml
server:
port: "8080"
max_connections: 500
worker_threads: 8auth:
enabled: true
require_api_key: true
api_keys:
- "sk-your-api-key-here"models:
- id: "gpt-3.5-turbo"
path: "./models/gpt-3.5-turbo.gguf"
load_immediately: true
main_gpu_id: 0
load_params:
n_ctx: 4096
n_gpu_layers: 50features:
metrics: true # Enable /metrics and /completion-metrics
```For complete configuration documentation including all parameters, authentication setup, CORS configuration, and more examples, see the **[Configuration Guide](docs/CONFIGURATION.md)**.
## API Usage
### 1. Add a Model Engine
Before using chat completions, you need to add a model engine:
```bash
curl -X POST http://localhost:8080/engines \
-H "Content-Type: application/json" \
-d '{
"engine_id": "my-model",
"model_path": "path/to/your/model.gguf",
"load_immediately": true,
"n_ctx": 2048,
"n_gpu_layers": 0,
"main_gpu_id": 0
}'
```#### Lazy Loading
For faster startup times, you can defer model loading until first use:
```bash
curl -X POST http://localhost:8080/engines \
-H "Content-Type: application/json" \
-d '{
"engine_id": "my-model",
"model_path": "https://huggingface.co/model-repo/model.gguf",
"load_immediately": false,
"n_ctx": 4096,
"n_gpu_layers": 30,
"main_gpu_id": 0
}'
```### 2. Chat Completions
#### Non-Streaming Chat Completion
```bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{
"role": "user",
"content": "Hello, how are you today?"
}
],
"stream": false,
"temperature": 0.7,
"max_tokens": 100
}'
```**Response:**
```json
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "Hello! I'm doing well, thank you for asking. How can I help you today?",
"role": "assistant"
}
}
],
"created": 1749981228,
"id": "chatcmpl-80HTkM01z7aaaThFbuALkbTu",
"model": "my-model",
"object": "chat.completion",
"system_fingerprint": "fp_4d29efe704",
"usage": {
"completion_tokens": 15,
"prompt_tokens": 9,
"total_tokens": 24
}
}
```#### Streaming Chat Completion
```bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "my-model",
"messages": [
{
"role": "user",
"content": "Tell me a short story about a robot."
}
],
"stream": true,
"temperature": 0.8,
"max_tokens": 150
}'
```**Response (Server-Sent Events):**
```
data: {"choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}data: {"choices":[{"delta":{"content":"Once"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: {"choices":[{"delta":{"content":" upon"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: {"choices":[{"delta":{"content":""},"finish_reason":"stop","index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: [DONE]
```#### Multi-Message Conversation
```bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{
"role": "system",
"content": "You are a helpful programming assistant."
},
{
"role": "user",
"content": "How do I create a simple HTTP server in Python?"
},
{
"role": "assistant",
"content": "You can create a simple HTTP server in Python using the built-in http.server module..."
},
{
"role": "user",
"content": "Can you show me the code?"
}
],
"stream": false,
"temperature": 0.7,
"max_tokens": 200
}'
```#### Advanced Parameters
```bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"stream": false,
"temperature": 0.1,
"top_p": 0.9,
"max_tokens": 50,
"seed": 42,
"presence_penalty": 0.0,
"frequency_penalty": 0.0
}'
```### 3. Completions
#### Non-Streaming Completion
```bash
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"prompt": "The future of artificial intelligence is",
"stream": false,
"temperature": 0.7,
"max_tokens": 100
}'
```**Response:**
```json
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"text": " bright and full of possibilities. As we continue to advance in machine learning and deep learning technologies, we can expect to see significant improvements in various fields..."
}
],
"created": 1749981288,
"id": "cmpl-80HTkM01z7aaaThFbuALkbTu",
"model": "my-model",
"object": "text_completion",
"usage": {
"completion_tokens": 25,
"prompt_tokens": 8,
"total_tokens": 33
}
}
```#### Streaming Completion
```bash
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "my-model",
"prompt": "Write a haiku about programming:",
"stream": true,
"temperature": 0.8,
"max_tokens": 50
}'
```**Response (Server-Sent Events):**
```
data: {"choices":[{"finish_reason":"","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}data: {"choices":[{"finish_reason":"","index":0,"text":"Code"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: {"choices":[{"finish_reason":"","index":0,"text":" flows"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: {"choices":[{"finish_reason":"stop","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: [DONE]
```#### Multiple Prompts
```bash
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"prompt": [
"The weather today is",
"In other news,"
],
"stream": false,
"temperature": 0.5,
"max_tokens": 30
}'
```#### Advanced Completion Parameters
```bash
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"prompt": "Explain quantum computing:",
"stream": false,
"temperature": 0.2,
"top_p": 0.9,
"max_tokens": 100,
"seed": 123,
"presence_penalty": 0.0,
"frequency_penalty": 0.1
}'
```### 4. Engine Management
#### List Available Engines
```bash
curl -X GET http://localhost:8080/v1/engines
```#### Get Engine Status
```bash
curl -X GET http://localhost:8080/engines/my-model/status
```#### Remove an Engine
```bash
curl -X DELETE http://localhost:8080/engines/my-model
```### 5. Completion Metrics and Monitoring
The server provides real-time completion metrics for monitoring performance and usage:
#### Get Completion Metrics
```bash
curl -X GET http://localhost:8080/completion-metrics
```**Response:**
```json
{
"completion_metrics": {
"summary": {
"total_requests": 15,
"completed_requests": 14,
"failed_requests": 1,
"success_rate_percent": 93.33,
"total_input_tokens": 120,
"total_output_tokens": 350,
"avg_turnaround_time_ms": 1250.5,
"avg_tps": 12.8,
"avg_output_tps": 8.4,
"avg_ttft_ms": 245.2,
"avg_rps": 0.85
},
"per_engine": [
{
"model_name": "my-model",
"engine_id": "default",
"total_requests": 15,
"completed_requests": 14,
"failed_requests": 1,
"total_input_tokens": 120,
"total_output_tokens": 350,
"tps": 12.8,
"output_tps": 8.4,
"avg_ttft": 245.2,
"rps": 0.85,
"last_updated": "2025-06-16T17:04:12.123Z"
}
],
"timestamp": "2025-06-16T17:04:12.123Z"
}
}
```**Alternative endpoints:**
```bash
# OpenAI-style endpoint
curl -X GET http://localhost:8080/v1/completion-metrics# Alternative path
curl -X GET http://localhost:8080/completion/metrics
```#### Metrics Explained
| Metric | Description |
|--------|-------------|
| `total_requests` | Total number of completion requests received |
| `completed_requests` | Number of successfully completed requests |
| `failed_requests` | Number of requests that failed |
| `success_rate_percent` | Success rate as a percentage |
| `total_input_tokens` | Total input tokens processed |
| `total_output_tokens` | Total output tokens generated |
| `avg_turnaround_time_ms` | Average time from request to completion (ms) |
| `avg_tps` | Average tokens per second (input + output) |
| `avg_output_tps` | Average output tokens per second |
| `avg_ttft_ms` | Average time to first token (ms) |
| `avg_rps` | Average requests per second |#### PowerShell Example
```powershell
# Get completion metrics
$metrics = Invoke-RestMethod -Uri "http://localhost:8080/completion-metrics" -Method GET
Write-Output "Success Rate: $($metrics.completion_metrics.summary.success_rate_percent)%"
Write-Output "Average TPS: $($metrics.completion_metrics.summary.avg_tps)"
```### 6. Health Check
```bash
curl -X GET http://localhost:8080/v1/health
```## Parameters Reference
### Chat Completion Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model` | string | required | The ID of the model to use |
| `messages` | array | required | List of message objects |
| `stream` | boolean | false | Whether to stream responses |
| `temperature` | number | 1.0 | Sampling temperature (0.0-2.0) |
| `top_p` | number | 1.0 | Nucleus sampling parameter |
| `max_tokens` | integer | 128 | Maximum tokens to generate |
| `seed` | integer | random | Random seed for reproducible outputs |
| `presence_penalty` | number | 0.0 | Presence penalty (-2.0 to 2.0) |
| `frequency_penalty` | number | 0.0 | Frequency penalty (-2.0 to 2.0) |### Completion Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model` | string | required | The ID of the model to use |
| `prompt` | string/array | required | Text prompt or array of prompts |
| `stream` | boolean | false | Whether to stream responses |
| `temperature` | number | 1.0 | Sampling temperature (0.0-2.0) |
| `top_p` | number | 1.0 | Nucleus sampling parameter |
| `max_tokens` | integer | 16 | Maximum tokens to generate |
| `seed` | integer | random | Random seed for reproducible outputs |
| `presence_penalty` | number | 0.0 | Presence penalty (-2.0 to 2.0) |
| `frequency_penalty` | number | 0.0 | Frequency penalty (-2.0 to 2.0) |### Message Object
| Field | Type | Description |
|-------|------|-------------|
| `role` | string | Role: "system", "user", or "assistant" |
| `content` | string | The content of the message |### Engine Loading Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `engine_id` | string | required | Unique identifier for the engine |
| `model_path` | string | required | Path to the GGUF model file or URL |
| `load_immediately` | boolean | true | Whether to load the model immediately or defer until first use |
| `n_ctx` | integer | 4096 | Context window size |
| `n_gpu_layers` | integer | 100 | Number of layers to offload to GPU |
| `main_gpu_id` | integer | 0 | Primary GPU device ID |## Error Handling
The server returns standard HTTP status codes and JSON error responses:
```json
{
"error": {
"message": "Model 'non-existent-model' not found or could not be loaded",
"type": "invalid_request_error",
"param": null,
"code": null
}
}
```Common error codes:
- `400` - Bad Request (invalid JSON, missing parameters)
- `404` - Not Found (model/engine not found)
- `500` - Internal Server Error (inference failures)## Examples with PowerShell
For Windows users, here are PowerShell equivalents:
### Add Engine
```powershell
$body = @{
engine_id = "my-model"
model_path = "C:\path\to\model.gguf"
load_immediately = $true
n_ctx = 2048
n_gpu_layers = 0
} | ConvertTo-JsonInvoke-RestMethod -Uri "http://localhost:8080/engines" -Method POST -Body $body -ContentType "application/json"
```### Chat Completion
```powershell
$body = @{
model = "my-model"
messages = @(
@{
role = "user"
content = "Hello, how are you?"
}
)
stream = $false
temperature = 0.7
max_tokens = 100
} | ConvertTo-Json -Depth 3Invoke-RestMethod -Uri "http://localhost:8080/v1/chat/completions" -Method POST -Body $body -ContentType "application/json"
```### Completion
```powershell
$body = @{
model = "my-model"
prompt = "The future of AI is"
stream = $false
temperature = 0.7
max_tokens = 50
} | ConvertTo-JsonInvoke-RestMethod -Uri "http://localhost:8080/v1/completions" -Method POST -Body $body -ContentType "application/json"
```## 📚 Developer Documentation
For developers looking to contribute to or extend Kolosal Server, comprehensive documentation is available in the [`docs/`](docs/) directory:
### 🚀 Getting Started
- **[Developer Guide](docs/DEVELOPER_GUIDE.md)** - Complete setup, architecture, and development workflows
- **[Configuration Guide](docs/CONFIGURATION.md)** - Complete server configuration in JSON and YAML formats
- **[Architecture Overview](docs/ARCHITECTURE.md)** - Detailed system design and component relationships### 🔧 Implementation Guides
- **[Adding New Routes](docs/ADDING_ROUTES.md)** - Step-by-step guide for implementing API endpoints
- **[Adding New Models](docs/ADDING_MODELS.md)** - Guide for creating data models and JSON handling
- **[API Specification](docs/API_SPECIFICATION.md)** - Complete API reference with examples### 📖 Quick Links
- [Documentation Index](docs/README.md) - Complete documentation overview
- [Project Structure](docs/DEVELOPER_GUIDE.md#project-structure) - Understanding the codebase
- [Contributing Guidelines](docs/DEVELOPER_GUIDE.md#contributing) - How to contribute## Acknowledgments
Kolosal Server is built on top of excellent open-source projects and we want to acknowledge their contributions:
### llama.cpp
This project is powered by [llama.cpp](https://github.com/ggml-org/llama.cpp), developed by [Georgi Gerganov](https://github.com/ggerganov) and the [ggml-org](https://github.com/ggml-org) community. llama.cpp provides the high-performance inference engine that makes Kolosal Server possible.- **Project**: [https://github.com/ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp)
- **License**: MIT License
- **Description**: Inference of Meta's LLaMA model (and others) in pure C/C++We extend our gratitude to the llama.cpp team for their incredible work on optimized LLM inference, which forms the foundation of our server's performance capabilities.
### Other Dependencies
- **[yaml-cpp](https://github.com/jbeder/yaml-cpp)**: YAML parsing and emitting library
- **[nlohmann/json](https://github.com/nlohmann/json)**: JSON library for Modern C++
- **[libcurl](https://curl.se/libcurl/)**: Client-side URL transfer library
- **[prometheus-cpp](https://github.com/jupp0r/prometheus-cpp)**: Prometheus metrics library for C++## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## Contributing
We welcome contributions! Please see our [Developer Documentation](docs/) for detailed guides on:
1. **Getting Started**: [Developer Guide](docs/DEVELOPER_GUIDE.md)
2. **Understanding the System**: [Architecture Overview](docs/ARCHITECTURE.md)
3. **Adding Features**: [Route](docs/ADDING_ROUTES.md) and [Model](docs/ADDING_MODELS.md) guides
4. **API Changes**: [API Specification](docs/API_SPECIFICATION.md)### Quick Contributing Steps
1. Fork the repository
2. Follow the [Developer Guide](docs/DEVELOPER_GUIDE.md) for setup
3. Create a feature branch
4. Implement your changes following our guides
5. Add tests and update documentation
6. Submit a Pull Request## Support
- **Issues**: Report bugs and feature requests on [GitHub Issues](https://github.com/your-org/kolosal-server/issues)
- **Documentation**: Check the [docs/](docs/) directory for comprehensive guides
- **Discussions**: Join [Kolosal AI Discord](https://discord.gg/NCufxNCB) for questions and community support