https://github.com/lhzn-io/kanoa-mlops
https://github.com/lhzn-io/kanoa-mlops
docker gpu inference llm local-llm machine-learning mlops molmo python self-hosted vision-language-model vllm vlm
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/lhzn-io/kanoa-mlops
- Owner: lhzn-io
- License: mit
- Created: 2025-11-30T15:31:44.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-20T14:08:58.000Z (6 months ago)
- Last Synced: 2025-12-21T05:12:23.527Z (6 months ago)
- Topics: docker, gpu, inference, llm, local-llm, machine-learning, mlops, molmo, python, self-hosted, vision-language-model, vllm, vlm
- Language: Python
- Size: 653 KB
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Agents: agents.md
Awesome Lists containing this project
README
# kanoa-mlops
**The infrastructure backbone for privacy-first AI interpretation.**
[](https://github.com/lhzn-io/kanoa-mlops/actions/workflows/pre-commit.yml)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
`kanoa-mlops` provides the local compute layer for the [`kanoa`](https://github.com/lhzn-io/kanoa) library — enabling you to interpret data science outputs (plots, tables, models) using state-of-the-art vision-language models, all running on your own hardware.
- **Privacy First** — Your data never leaves your machine
- **Multiple Backends** — Choose Ollama (easy), vLLM (fast), or cloud GPU (scalable)
- **Full Observability** — Prometheus + Grafana + NVIDIA DCGM monitoring stack
- **Seamless Integration** — Extends `kanoa` CLI with `serve` and `stop` commands
## Installation
### For Users (add to your project)
```bash
# Base install (local inference with Ollama/vLLM)
pip install kanoa-mlops
# With GCP support (for cloud deployment)
pip install kanoa-mlops[gcp]
# Everything (GCP + dev tools)
pip install kanoa-mlops[all]
```
### For Contributors
```bash
# Clone and install in editable mode
git clone https://github.com/lhzn-io/kanoa-mlops.git
cd kanoa-mlops
# Base environment (no GCP tools)
conda env create -f environment.yml
conda activate kanoa-mlops
# Or with GCP infrastructure tools (terraform, gcloud)
conda env create -f environment-gcp.yml
conda activate kanoa-mlops-gcp
```
## Quick Start
### Option A: Add to Existing Project (pip)
For users adding local AI to a science project:
```bash
# Install the package
pip install kanoa-mlops
# Initialize docker templates in your project
kanoa mlops init --dir .
# Start Ollama
kanoa mlops serve ollama
# Done! Your project now has local AI
```
### Option B: Clone Repository (full setup)
For contributors or those wanting the full monitoring stack:
```bash
# Clone the repo
git clone https://github.com/lhzn-io/kanoa-mlops.git
cd kanoa-mlops
# Start Ollama (pulls model on first run)
kanoa mlops serve ollama
# Start monitoring (optional)
kanoa mlops serve monitoring
```
### The Performance Path (vLLM)
For maximum throughput on NVIDIA GPUs:
```bash
# Download model (~14GB)
huggingface-cli download allenai/Molmo-7B-D-0924
# Start vLLM server
docker compose -f docker/vllm/docker-compose.molmo.yml up -d
# Verify
curl http://localhost:8000/health
```
## Platform Guidance
- **x86/CUDA GPUs** (RTX 4090, 5080, etc.): Use vLLM for best performance
- **Jetson Thor/ARM64**: Use Ollama for Scout (vLLM Thor image lacks bitsandbytes quantization)
- **Easy Setup**: Ollama handles quantization automatically (65GB Q4 vs 203GB full)
- **Production**: vLLM offers more control over inference parameters
#### Advanced Models
For the Olmo3 32B Think model (requires significant GPU memory):
```bash
# Download model (~32GB) - requires Hugging Face authentication for gated models
huggingface-cli download allenai/Olmo-3-32B-Think
# Start vLLM server (optimized for Jetson Thor with 128GB Unified Memory)
make serve-olmo3-32b
# Verify
curl http://localhost:8000/health
```
## Choose Your Backend
| Backend | Best For | Hardware | Throughput | Setup |
| :--- | :--- | :--- | :--- | :--- |
| **Ollama** | Getting started, CPU/Apple Silicon | Any | ~15 tok/s | `kanoa mlops serve ollama` |
| **vLLM** | Production, maximum speed | NVIDIA GPU | ~31 tok/s | Docker Compose |
| **GCP L4** | No local GPU, team sharing | Cloud | ~25 tok/s | Terraform |
### Ollama (Easiest)
Perfect for development, VSCode integration, and broader hardware support.
```bash
kanoa mlops serve ollama
# → Ollama running at http://localhost:11434
```
Supports: Gemma 3 (4B/12B), Llama 3, Mistral, and [many more](https://ollama.com/library).
### vLLM (Fastest)
Optimized inference with CUDA, batching, and OpenAI-compatible API.
```bash
# Molmo 7B (best for vision)
docker compose -f docker/vllm/docker-compose.molmo.yml up -d
# Gemma 3 12B (best for reasoning)
docker compose -f docker/vllm/docker-compose.gemma.yml up -d
```
Endpoints: `http://localhost:8000/v1/chat/completions`
### GCP Cloud GPU (Scalable)
For users without local GPUs or production workloads.
```bash
cd infrastructure/gcp
cp terraform.tfvars.example terraform.tfvars # Configure
terraform apply
```
Features: L4 GPU (~$0.70/hr), auto-shutdown, IP-restricted firewall.
## Using with kanoa
Once a backend is running, `kanoa` automatically detects it:
```python
from kanoa import Interpreter
# Uses local backend (Ollama or vLLM) automatically
interpreter = Interpreter(backend="local")
# Interpret your matplotlib figure
result = interpreter.interpret(fig)
print(result.text)
```
Or explicitly configure:
```python
from kanoa.backends import VLLMBackend
backend = VLLMBackend(
api_base="http://localhost:8000/v1",
model="allenai/Molmo-7B-D-0924"
)
```
## CLI Integration
`kanoa-mlops` extends the `kanoa` CLI with infrastructure commands:
```bash
# Initialize (for pip users)
kanoa mlops init --dir . # Scaffold docker templates
# Start services
kanoa mlops serve ollama # Start Ollama
kanoa mlops serve monitoring # Start Prometheus + Grafana
kanoa mlops serve all # Start everything
# Stop services
kanoa mlops stop # Stop all services
kanoa mlops stop ollama # Stop specific service
# Status
kanoa mlops status # Show config and running services
# Restart services
kanoa mlops restart ollama # Restart Ollama
```
## Monitoring Stack
Real-time observability for your inference workloads:
```bash
kanoa mlops serve monitoring
# → Grafana: http://localhost:3000 (admin/admin)
# → Prometheus: http://localhost:9090
```
**Dashboard Features:**
| Section | Metrics |
| :--- | :--- |
| Token Odometers | Cumulative prompt/generated tokens, request counts |
| Latency | TTFT and TPOT percentiles (p50, p90, p95, p99) |
| GPU Hardware | Temperature, power, utilization, memory (via NVIDIA DCGM) |
| vLLM Performance | KV cache usage, request queue, throughput |
See [monitoring/README.md](monitoring/README.md) for full documentation.
## Project Structure
```text
kanoa-mlops/
├── kanoa_mlops/ # CLI plugin (serve, stop commands)
│ └── plugin.py
├── docker/
│ ├── vllm/ # vLLM Docker Compose configs
│ └── ollama/ # Ollama Docker Compose config
├── monitoring/
│ ├── grafana/ # Dashboards and provisioning
│ └── prometheus/ # Scrape configs
├── infrastructure/
│ └── gcp/ # Terraform for cloud GPU
├── examples/ # Jupyter notebooks
├── scripts/ # Model download utilities
└── tests/integration/ # Backend integration tests
```
## Performance
### Benchmark Results (RTX 5080 16GB eGPU)
| Model | Backend | Throughput | Notes |
| :--- | :--- | :--- | :--- |
| **Molmo 7B** | vLLM | **31.1 tok/s** | Best for vision tasks |
| Gemma 3 12B | vLLM | 10.3 tok/s | Strong text reasoning |
| Gemma 3 4B | Ollama | ~15 tok/s | Good balance |
**Why vLLM is faster:**
- Continuous batching for concurrent requests
- PagedAttention for efficient KV cache
- FP8 quantization support
## Supported Models
### Vision-Language Models
| Model | Size | vLLM | Ollama | Notes |
| :--- | :--- | :--- | :--- | :--- |
| Molmo 7B | 14GB | [✓] | — | Best vision performance |
| Gemma 3 | 4B-27B | [✓] | [✓] | Excellent all-rounder |
| Olmo 3 32B Think | 32GB | [✓] | — | Advanced reasoning, code generation |
| LLaVa-Next | 7B-34B | [ ] | [✓] | Planned for vLLM |
### Text-Only Models (via Ollama)
Llama 3.1, Mistral, Qwen 2.5, and [100+ more](https://ollama.com/library).
## Hardware Compatibility
| Platform | Status | Notes |
| :--- | :--- | :--- |
| NVIDIA RTX (Desktop/Laptop) | [✓] Verified | RTX 3080+ recommended |
| NVIDIA RTX (eGPU) | [✓] Verified | TB3/TB4 bandwidth sufficient |
| NVIDIA Jetson Thor | [✓] Verified | 128GB Unified Memory, Blackwell GPU |
| Apple Silicon | [✓] Ollama | M1/M2/M3 via Ollama |
| GCP L4 GPU | [✓] Verified | 24GB VRAM, ~$0.70/hr |
| Intel/AMD GPU | — | Not supported |
## Development Setup
### Plugin Architecture
`kanoa-mlops` is a **plugin** for the `kanoa` CLI. The `kanoa` package provides the CLI framework, and `kanoa-mlops` registers additional commands (`serve`, `stop`, `restart`) via Python entry points.
```text
kanoa (CLI) ──loads──► kanoa-mlops (plugin)
│ │
└── entry points ◄───────┘
```
### Co-Development Setup
To develop both packages simultaneously, install both in **editable mode**:
```bash
# Clone both repos
git clone https://github.com/lhzn-io/kanoa.git
git clone https://github.com/lhzn-io/kanoa-mlops.git
# Create and activate environment
conda env create -f kanoa-mlops/environment.yml
conda activate kanoa-mlops
# Install BOTH packages in editable mode
pip install -e ./kanoa # Provides 'kanoa' CLI
pip install -e ./kanoa-mlops # Registers plugin commands
# Verify
kanoa --help # Should show: serve, stop, restart
```
> **Why both?** The `kanoa` package provides the CLI entry point. The `kanoa-mlops` package registers its commands as plugins. Both must be installed for the full CLI to work.
### Quick Reinstall
If you switch conda environments or commands are missing:
```bash
pip install -e /path/to/kanoa -e /path/to/kanoa-mlops
```
## Prerequisites
- **Docker** and Docker Compose
- **NVIDIA GPU + Drivers** (for vLLM)
- **Python 3.11+**
**WSL2/eGPU Users**: See the [Local GPU Setup Guide](docs/source/local-gpu-setup.md) for platform-specific instructions.
## Roadmap
- [✓] Ollama integration (Dec 2025)
- [✓] CLI plugin system (Dec 2025)
- [✓] NVIDIA DCGM monitoring (Dec 2025)
- [✓] NVIDIA Jetson Thor support (Dec 2025)
- [ ] PostgreSQL + pgvector for RAG
- [ ] Kubernetes / Helm charts
- [ ] NVIDIA Jetson Orin support
## Contributing
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
- **Adding New Models?** Check out the [Model Contribution Guide](docs/adding-models.md).
> **Pro Tip**: We find **Claude Code** to be an excellent DevOps buddy for this project. If you use AI tools, just remember our [Human-in-the-Loop policy](CONTRIBUTING.md#4-ai-contribution-policy).
## License
MIT License — see [LICENSE](LICENSE) for details.