https://github.com/lhzn-io/kanoa-mlops

docker gpu inference llm local-llm machine-learning mlops molmo python self-hosted vision-language-model vllm vlm
Last synced: 5 months ago
JSON representation
Host: GitHub
URL: https://github.com/lhzn-io/kanoa-mlops
Owner: lhzn-io
License: mit
Created: 2025-11-30T15:31:44.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-12-20T14:08:58.000Z (6 months ago)
Last Synced: 2025-12-21T05:12:23.527Z (6 months ago)
Topics: docker, gpu, inference, llm, local-llm, machine-learning, mlops, molmo, python, self-hosted, vision-language-model, vllm, vlm
Language: Python
Size: 653 KB
Stars: 1
Watchers: 0
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Agents: agents.md
Awesome Lists containing this project

README

          # kanoa-mlops

**The infrastructure backbone for privacy-first AI interpretation.**

[![Tests](https://github.com/lhzn-io/kanoa-mlops/actions/workflows/pre-commit.yml/badge.svg)](https://github.com/lhzn-io/kanoa-mlops/actions/workflows/pre-commit.yml)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

`kanoa-mlops` provides the local compute layer for the [`kanoa`](https://github.com/lhzn-io/kanoa) library — enabling you to interpret data science outputs (plots, tables, models) using state-of-the-art vision-language models, all running on your own hardware.

- **Privacy First** — Your data never leaves your machine

- **Multiple Backends** — Choose Ollama (easy), vLLM (fast), or cloud GPU (scalable)

- **Full Observability** — Prometheus + Grafana + NVIDIA DCGM monitoring stack

- **Seamless Integration** — Extends `kanoa` CLI with `serve` and `stop` commands

## Installation

### For Users (add to your project)

```bash

# Base install (local inference with Ollama/vLLM)

pip install kanoa-mlops

# With GCP support (for cloud deployment)

pip install kanoa-mlops[gcp]

# Everything (GCP + dev tools)

pip install kanoa-mlops[all]

```

### For Contributors

```bash

# Clone and install in editable mode

git clone https://github.com/lhzn-io/kanoa-mlops.git

cd kanoa-mlops

# Base environment (no GCP tools)

conda env create -f environment.yml

conda activate kanoa-mlops

# Or with GCP infrastructure tools (terraform, gcloud)

conda env create -f environment-gcp.yml

conda activate kanoa-mlops-gcp

```

## Quick Start

### Option A: Add to Existing Project (pip)

For users adding local AI to a science project:

```bash

# Install the package

pip install kanoa-mlops

# Initialize docker templates in your project

kanoa mlops init --dir .

# Start Ollama

kanoa mlops serve ollama

# Done! Your project now has local AI

```

### Option B: Clone Repository (full setup)

For contributors or those wanting the full monitoring stack:

```bash

# Clone the repo

git clone https://github.com/lhzn-io/kanoa-mlops.git

cd kanoa-mlops

# Start Ollama (pulls model on first run)

kanoa mlops serve ollama

# Start monitoring (optional)

kanoa mlops serve monitoring

```

### The Performance Path (vLLM)

For maximum throughput on NVIDIA GPUs:

```bash

# Download model (~14GB)

huggingface-cli download allenai/Molmo-7B-D-0924

# Start vLLM server

docker compose -f docker/vllm/docker-compose.molmo.yml up -d

# Verify

curl http://localhost:8000/health

```

## Platform Guidance

- **x86/CUDA GPUs** (RTX 4090, 5080, etc.): Use vLLM for best performance

- **Jetson Thor/ARM64**: Use Ollama for Scout (vLLM Thor image lacks bitsandbytes quantization)

- **Easy Setup**: Ollama handles quantization automatically (65GB Q4 vs 203GB full)

- **Production**: vLLM offers more control over inference parameters

#### Advanced Models

For the Olmo3 32B Think model (requires significant GPU memory):

```bash

# Download model (~32GB) - requires Hugging Face authentication for gated models

huggingface-cli download allenai/Olmo-3-32B-Think

# Start vLLM server (optimized for Jetson Thor with 128GB Unified Memory)

make serve-olmo3-32b

# Verify

curl http://localhost:8000/health

```

## Choose Your Backend

| Backend | Best For | Hardware | Throughput | Setup |

| :--- | :--- | :--- | :--- | :--- |

| **Ollama** | Getting started, CPU/Apple Silicon | Any | ~15 tok/s | `kanoa mlops serve ollama` |

| **vLLM** | Production, maximum speed | NVIDIA GPU | ~31 tok/s | Docker Compose |

| **GCP L4** | No local GPU, team sharing | Cloud | ~25 tok/s | Terraform |

### Ollama (Easiest)

Perfect for development, VSCode integration, and broader hardware support.

```bash

kanoa mlops serve ollama

# → Ollama running at http://localhost:11434

```

Supports: Gemma 3 (4B/12B), Llama 3, Mistral, and [many more](https://ollama.com/library).

### vLLM (Fastest)

Optimized inference with CUDA, batching, and OpenAI-compatible API.

```bash

# Molmo 7B (best for vision)

docker compose -f docker/vllm/docker-compose.molmo.yml up -d

# Gemma 3 12B (best for reasoning)

docker compose -f docker/vllm/docker-compose.gemma.yml up -d

```

Endpoints: `http://localhost:8000/v1/chat/completions`

### GCP Cloud GPU (Scalable)

For users without local GPUs or production workloads.

```bash

cd infrastructure/gcp

cp terraform.tfvars.example terraform.tfvars  # Configure

terraform apply

```

Features: L4 GPU (~$0.70/hr), auto-shutdown, IP-restricted firewall.

## Using with kanoa

Once a backend is running, `kanoa` automatically detects it:

```python

from kanoa import Interpreter

# Uses local backend (Ollama or vLLM) automatically

interpreter = Interpreter(backend="local")

# Interpret your matplotlib figure

result = interpreter.interpret(fig)

print(result.text)

```

Or explicitly configure:

```python

from kanoa.backends import VLLMBackend

backend = VLLMBackend(

    api_base="http://localhost:8000/v1",

    model="allenai/Molmo-7B-D-0924"

)

```

## CLI Integration

`kanoa-mlops` extends the `kanoa` CLI with infrastructure commands:

```bash

# Initialize (for pip users)

kanoa mlops init --dir .  # Scaffold docker templates

# Start services

kanoa mlops serve ollama       # Start Ollama

kanoa mlops serve monitoring   # Start Prometheus + Grafana

kanoa mlops serve all          # Start everything

# Stop services

kanoa mlops stop               # Stop all services

kanoa mlops stop ollama        # Stop specific service

# Status

kanoa mlops status             # Show config and running services

# Restart services

kanoa mlops restart ollama     # Restart Ollama

```

## Monitoring Stack

Real-time observability for your inference workloads:

```bash

kanoa mlops serve monitoring

# → Grafana:    http://localhost:3000 (admin/admin)

# → Prometheus: http://localhost:9090

```

**Dashboard Features:**

| Section | Metrics |

| :--- | :--- |

| Token Odometers | Cumulative prompt/generated tokens, request counts |

| Latency | TTFT and TPOT percentiles (p50, p90, p95, p99) |

| GPU Hardware | Temperature, power, utilization, memory (via NVIDIA DCGM) |

| vLLM Performance | KV cache usage, request queue, throughput |

See [monitoring/README.md](monitoring/README.md) for full documentation.

## Project Structure

```text

kanoa-mlops/

├── kanoa_mlops/           # CLI plugin (serve, stop commands)

│   └── plugin.py

├── docker/

│   ├── vllm/              # vLLM Docker Compose configs

│   └── ollama/            # Ollama Docker Compose config

├── monitoring/

│   ├── grafana/           # Dashboards and provisioning

│   └── prometheus/        # Scrape configs

├── infrastructure/

│   └── gcp/               # Terraform for cloud GPU

├── examples/              # Jupyter notebooks

├── scripts/               # Model download utilities

└── tests/integration/     # Backend integration tests

```

## Performance

### Benchmark Results (RTX 5080 16GB eGPU)

| Model | Backend | Throughput | Notes |

| :--- | :--- | :--- | :--- |

| **Molmo 7B** | vLLM | **31.1 tok/s** | Best for vision tasks |

| Gemma 3 12B | vLLM | 10.3 tok/s | Strong text reasoning |

| Gemma 3 4B | Ollama | ~15 tok/s | Good balance |

**Why vLLM is faster:**

- Continuous batching for concurrent requests

- PagedAttention for efficient KV cache

- FP8 quantization support

## Supported Models

### Vision-Language Models

| Model | Size | vLLM | Ollama | Notes |

| :--- | :--- | :--- | :--- | :--- |

| Molmo 7B | 14GB | [✓] | — | Best vision performance |

| Gemma 3 | 4B-27B | [✓] | [✓] | Excellent all-rounder |

| Olmo 3 32B Think | 32GB | [✓] | — | Advanced reasoning, code generation |

| LLaVa-Next | 7B-34B | [ ] | [✓] | Planned for vLLM |

### Text-Only Models (via Ollama)

Llama 3.1, Mistral, Qwen 2.5, and [100+ more](https://ollama.com/library).

## Hardware Compatibility

| Platform | Status | Notes |

| :--- | :--- | :--- |

| NVIDIA RTX (Desktop/Laptop) | [✓] Verified | RTX 3080+ recommended |

| NVIDIA RTX (eGPU) | [✓] Verified | TB3/TB4 bandwidth sufficient |

| NVIDIA Jetson Thor | [✓] Verified | 128GB Unified Memory, Blackwell GPU |

| Apple Silicon | [✓] Ollama | M1/M2/M3 via Ollama |

| GCP L4 GPU | [✓] Verified | 24GB VRAM, ~$0.70/hr |

| Intel/AMD GPU | — | Not supported |

## Development Setup

### Plugin Architecture

`kanoa-mlops` is a **plugin** for the `kanoa` CLI. The `kanoa` package provides the CLI framework, and `kanoa-mlops` registers additional commands (`serve`, `stop`, `restart`) via Python entry points.

```text

kanoa (CLI)  ──loads──►  kanoa-mlops (plugin)

     │                        │

     └── entry points ◄───────┘

```

### Co-Development Setup

To develop both packages simultaneously, install both in **editable mode**:

```bash

# Clone both repos

git clone https://github.com/lhzn-io/kanoa.git

git clone https://github.com/lhzn-io/kanoa-mlops.git

# Create and activate environment

conda env create -f kanoa-mlops/environment.yml

conda activate kanoa-mlops

# Install BOTH packages in editable mode

pip install -e ./kanoa           # Provides 'kanoa' CLI

pip install -e ./kanoa-mlops     # Registers plugin commands

# Verify

kanoa --help  # Should show: serve, stop, restart

```

> **Why both?** The `kanoa` package provides the CLI entry point. The `kanoa-mlops` package registers its commands as plugins. Both must be installed for the full CLI to work.

### Quick Reinstall

If you switch conda environments or commands are missing:

```bash

pip install -e /path/to/kanoa -e /path/to/kanoa-mlops

```

## Prerequisites

- **Docker** and Docker Compose

- **NVIDIA GPU + Drivers** (for vLLM)

- **Python 3.11+**

**WSL2/eGPU Users**: See the [Local GPU Setup Guide](docs/source/local-gpu-setup.md) for platform-specific instructions.

## Roadmap

- [✓] Ollama integration (Dec 2025)

- [✓] CLI plugin system (Dec 2025)

- [✓] NVIDIA DCGM monitoring (Dec 2025)

- [✓] NVIDIA Jetson Thor support (Dec 2025)

- [ ] PostgreSQL + pgvector for RAG

- [ ] Kubernetes / Helm charts

- [ ] NVIDIA Jetson Orin support

## Contributing

We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

- **Adding New Models?** Check out the [Model Contribution Guide](docs/adding-models.md).

> **Pro Tip**: We find **Claude Code** to be an excellent DevOps buddy for this project. If you use AI tools, just remember our [Human-in-the-Loop policy](CONTRIBUTING.md#4-ai-contribution-policy).

## License

MIT License — see [LICENSE](LICENSE) for details.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lhzn-io/kanoa-mlops

Awesome Lists containing this project

README