https://github.com/r-aas/platform_monorepo

Local-first AI/ML platform on k3d — 35 Helm charts, 8 agents, 9 MCP servers, 243 federated tools. macOS Apple Silicon reference system.
https://github.com/r-aas/platform_monorepo
agentgateway agents apple-silicon argocd gitops helm k3d kagent local-first mcp mlops
Last synced: 4 months ago
JSON representation
Local-first AI/ML platform on k3d — 35 Helm charts, 8 agents, 9 MCP servers, 243 federated tools. macOS Apple Silicon reference system.
Host: GitHub
URL: https://github.com/r-aas/platform_monorepo
Owner: r-aas
License: mit
Created: 2026-03-30T03:02:37.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-30T03:14:02.000Z (4 months ago)
Last Synced: 2026-03-30T05:53:06.323Z (4 months ago)
Topics: agentgateway, agents, apple-silicon, argocd, gitops, helm, k3d, kagent, local-first, mcp, mlops
Language: Python
Size: 2.26 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Platform Monorepo

[![Build Images](https://github.com/r-aas/platform_monorepo/actions/workflows/build-images.yml/badge.svg)](https://github.com/r-aas/platform_monorepo/actions/workflows/build-images.yml)

[![Lint Charts](https://github.com/r-aas/platform_monorepo/actions/workflows/lint-charts.yml/badge.svg)](https://github.com/r-aas/platform_monorepo/actions/workflows/lint-charts.yml)

[![Release](https://github.com/r-aas/platform_monorepo/releases/latest)](https://github.com/r-aas/platform_monorepo/releases)

A reference implementation for **AgentOps end-to-end** — running entirely on a single Mac. No cloud. No API keys. Everything local, reproducible, and observable.

This is what a complete agent operations system looks like when you wire together best-of-breed open-source tools: agent definitions, prompt lifecycle, experiment tracking, tool federation, autonomous scheduling, LLM observability, and GitOps deployment — all in one repo.

```

task up    # ~20 min from zero to fully operational

```

## Why This Exists

Building AI agents is easy. Operating them is hard. There's no single tool that covers the full lifecycle:

| Problem | What's needed | What this repo uses |

|---------|---------------|---------------------|

| Agents need tools | Standardized tool interface | **9 MCP servers** behind a unified gateway (243 tools) |

| Agents need memory | Persistent context across runs | **pgvector** with per-agent TTL and memory categories |

| Agents need scheduling | Autonomous execution on cadence | **kagent** CRDs + k8s CronJobs |

| Prompts need versioning | Track what changed and when | **MLflow** prompt registry with aliases and canary routing |

| Prompts need evaluation | Know if changes help or hurt | **LLM-as-judge** eval pipeline with A/B testing |

| LLM calls need observability | Cost, latency, quality tracking | **Langfuse** tracing + drift detection |

| Everything needs deployment | Reproducible, auditable, rollback-able | **ArgoCD** GitOps from in-cluster **GitLab** |

| Everything needs to talk | Standard protocols | **OpenAI-compatible API**, **A2A protocol**, **MCP** |

This repo is the answer to "how do I actually run all of this together?"

## Architecture

```

┌─────────────────────────────────────────────────────────┐

│                    macOS (Apple Silicon)                  │

│                                                          │

│  Ollama (native, Metal GPU)                              │

│     ↑                                                    │

│  ┌──┴────────────────────────────────────────────────┐   │

│  │  k3d cluster "mewtwo"                             │   │

│  │                                                   │   │

│  │  LiteLLM ──→ Ollama (192.168.65.254:11434)           │   │

│  │     ↑                                             │   │

│  │  n8n (17 workflows) ──→ MLflow (prompts/evals)    │   │

│  │     ↑                       ↓                     │   │

│  │  agentgateway ←── 9 MCP servers (243 tools)       │   │

│  │     ↑                                             │   │

│  │  kagent (6 agents on CronJob schedules)           │   │

│  │     ↑                                             │   │

│  │  ArgoCD ←── GitLab CE (in-cluster git)            │   │

│  └───────────────────────────────────────────────────┘   │

│                                                          │

│  Colima VM (dockerd, 8 CPU, 32 GB RAM)                   │

└─────────────────────────────────────────────────────────┘

```

### How the layers connect

**You ask a question** → n8n `/webhook/chat` receives it → looks up the agent's system prompt from MLflow → classifies the task → picks the right task-specific prompt → calls LiteLLM → LiteLLM routes to Ollama → response is traced to Langfuse and MLflow → returned to you with a `trace_id` for feedback.

**An agent runs autonomously** → kagent CronJob fires → POSTs to the agent's A2A endpoint → agent loads its memory from pgvector → picks tools from the MCP gateway → executes → writes results back to memory → logs trace.

**You change a prompt** → commit to GitLab → ArgoCD syncs → n8n workflow promotion hook patches URLs and upserts workflows → new prompt version is live. Roll back by reverting the commit.

## Services

| Service | Purpose | URL |

|---------|---------|-----|

| **n8n** | Workflow automation — 17 workflows expose webhook APIs for chat, prompts, eval, tracing | http://n8n.platform.127.0.0.1.nip.io |

| **MLflow** | Prompt registry (versioned, with aliases), experiment tracking, eval storage | http://mlflow.platform.127.0.0.1.nip.io |

| **LiteLLM** | OpenAI-compatible proxy — routes to Ollama, handles auth, per-key access | http://litellm.platform.127.0.0.1.nip.io |

| **Langfuse** | LLM observability — traces, scores, cost analysis, drift detection | http://langfuse.platform.127.0.0.1.nip.io |

| **Gateway** | Unified entry point — agent-gateway (Python) + agentgateway MCP proxy (Rust) | http://gateway.platform.127.0.0.1.nip.io |

| **ArgoCD** | GitOps controller — syncs all 35 Helm charts from GitLab | http://argocd.platform.127.0.0.1.nip.io |

| **GitLab CE** | In-cluster git — source of truth for ArgoCD, no external dependencies | http://gitlab.platform.127.0.0.1.nip.io |

| **ODD Platform** | Data catalog — metadata, lineage, quality (PostgreSQL-only, ARM64 native) | http://odd.platform.127.0.0.1.nip.io |

| **Plane** | Project management — issues, sprints, backlogs | http://plane.platform.127.0.0.1.nip.io |

| **MinIO** | S3-compatible object storage for MLflow artifacts | http://minio.platform.127.0.0.1.nip.io |

All URLs use [nip.io](https://nip.io) wildcard DNS — `*.platform.127.0.0.1.nip.io` resolves to `127.0.0.1`. No `/etc/hosts` editing needed.

## Agents

Six autonomous agents, each with a defined role, schedule, tool access, and memory:

| Agent | Schedule | What it does | Tools |

|-------|----------|--------------|-------|

| **platform-admin** | Every 15 min | Watches cluster health, responds to incidents, manages k8s resources | kubernetes, gitlab, kagent, ollama |

| **project-coordinator** | Hourly | Triages backlog, manages sprints, reports status across Plane and GitLab | kubernetes, plane, gitlab |

| **data-engineer** | Every 2 hours | Manages data catalog, traces lineage, runs quality checks, handles ingestion | kubernetes, minio, mlflow |

| **mlops** | Every 4 hours | Tracks experiments, manages model lifecycle, monitors drift | kubernetes, mlflow, langfuse, minio, ollama |

| **developer** | Every 6 hours | Generates code, reviews PRs, runs security audits, manages CI/CD | kubernetes, gitlab |

| **qa-eval** | Nightly (2 AM) | Runs benchmarks, detects regressions, evaluates prompt quality | kubernetes, mlflow, langfuse |

Each agent has:

- **Autonomy guardrails** — max budget ($2-10), max turns (100-300), approval gates for destructive ops

- **Persistent memory** — pgvector with categorized memories (incidents, decisions, patterns, baselines)

- **Collaboration graph** — agents can delegate to each other via A2A protocol

Agent definitions live in `agents/` as YAML specs.

## Workflows (n8n)

The 17 n8n workflows are the API layer of the platform. Key endpoints:

| Endpoint | Workflow | What it does |

|----------|----------|--------------|

| `POST /webhook/chat` | Chat | Unified chat — agent mode (with MCP tools) or plain LLM. Task classification, session management, trace logging |

| `POST /webhook/prompts` | Prompt CRUD | Create, update, delete, promote, diff, canary config for versioned prompts |

| `POST /webhook/eval` | Prompt Eval | Run test cases with LLM-as-judge scoring. A/B eval between production and staging |

| `POST /webhook/traces` | Tracing | Log executions, submit feedback, set baselines, detect drift |

| `POST /webhook/sessions` | Sessions | Persistent conversation history with append/close lifecycle |

| `POST /webhook/datasets` | Datasets | Upload and manage evaluation datasets |

| `POST /webhook/experiments` | Experiments | Browse MLflow experiments, runs, and metrics |

| `POST /webhook/agents` | Agent Catalog | Query and manage registered agents |

| `GET /webhook/v1/models` | OpenAI Compat | Drop-in OpenAI API — lists prompts as models |

| `POST /webhook/v1/chat/completions` | OpenAI Compat | Chat completions with prompt-enhanced routing and canary support |

| `POST /webhook/a2a` | A2A Server | Google Agent-to-Agent protocol (JSON-RPC 2.0) |

Workflow JSONs live in `n8n-data/workflows/` and are promoted to the cluster via a Helm post-install hook that clones the repo, patches service URLs, and upserts via the n8n REST API.

## MCP Servers (Tool Federation)

Nine MCP servers expose platform capabilities as tools. The **agentgateway** (Rust, Linux Foundation) federates all of them into a single endpoint:

| Server | Wraps | Example tools |

|--------|-------|---------------|

| **mcp-kubernetes** | k8s API | `kubectl_get`, `kubectl_logs`, `exec_in_pod`, `kubectl_apply` |

| **mcp-gitlab** | GitLab CE | `create_issue`, `create_merge_request`, `list_pipelines` |

| **mcp-mlflow** | MLflow API | `search_experiments`, `log_metric`, `get_model_version` |

| **mcp-langfuse** | Langfuse API | `get_traces`, `create_score`, `get_usage` |

| **mcp-n8n** | n8n API | `list_workflows`, `execute_workflow`, `get_executions` |

| **mcp-plane** | Plane API | `create_issue`, `list_cycles`, `update_sprint` |

| **mcp-minio** | MinIO S3 | `list_buckets`, `get_object`, `put_object` |

| **mcp-ollama** | Ollama API | `list_models`, `pull_model`, `generate` |

| **mcp-odd-platform** | ODD Platform | `search_catalog`, `get_upstream_lineage`, `get_quality_tests` |

**243 tools** available in a single MCP session at `gateway.platform.127.0.0.1.nip.io/mcp/all`. Tool names are prefixed by backend (e.g., `kubernetes_kubectl_get`, `gitlab_create_issue`).

Per-backend routes also available: `/mcp/kubernetes`, `/mcp/gitlab`, `/mcp/mlflow`, etc.

## Prompt Lifecycle

The platform implements a complete prompt-as-code lifecycle:

```

seed → eval + judge → optimize → benchmark → promote → canary → monitor

```

1. **Version** — Prompts stored in MLflow with semantic versioning, `production` and `staging` aliases

2. **Evaluate** — Run test cases against prompts, score with LLM-as-judge (built-in or custom criteria)

3. **A/B test** — Canary routing sends configurable traffic percentage to staging version

4. **Promote** — Move staging to production when metrics improve

5. **Monitor** — Trace every LLM call, detect drift against baselines, alert on regression

## Prerequisites

**Minimum**: Apple Silicon Mac with 64 GB RAM (see [Hardware Requirements](#hardware-requirements) for details).

| Tool | Version | Install |

|------|---------|---------|

| macOS | 14+ (Apple Silicon) | — |

| Colima | 0.8+ | `brew install colima` |

| Docker CLI | 27+ | `brew install docker` |

| k3d | 5.7+ | `brew install k3d` |

| kubectl | 1.30+ | `brew install kubectl` |

| Helm | 3.16+ | `brew install helm` |

| helmfile | 0.169+ | `brew install helmfile` |

| Task | 3.40+ | `brew install go-task` |

| Ollama | 0.6+ | [ollama.com](https://ollama.com) |

| uv | 0.6+ | `brew install uv` |

| jq, yq | latest | `brew install jq yq` |

```bash

brew install colima docker k3d kubectl helm helmfile go-task uv jq yq

brew install --cask ollama

```

## Quick Start

```bash

# 1. Clone

git clone https://github.com/r-aas/platform_monorepo.git

cd platform_monorepo

# 2. Create secrets

cp envs/secrets.env.example envs/secrets.env

# Defaults work for local dev — edit if you want custom passwords

# 3. Bootstrap everything

task up

```

That's it. `task up` handles everything: preflight checks → starts Colima VM → starts Ollama → pulls models (glm-4.7-flash + nomic-embed-text) → creates k3d cluster → helmfile bootstrap → GitLab setup → pulls/builds images → ArgoCD sync → n8n workflow promotion → agent deployment → smoke tests. ~20 minutes on first run.

### Verify it works

```bash

task smoke           # Hit every endpoint, report pass/fail

task urls            # Print all URLs with credentials

task status          # Full platform health check

```

### Try it

```bash

# Chat with an agent

curl -X POST http://n8n.platform.127.0.0.1.nip.io/webhook/chat \

  -H 'Content-Type: application/json' \

  -d '{"message": "what experiments are running?", "agent_name": "mlops"}'

# Chat with MCP tools (agent can call any of 243 tools)

curl -X POST http://n8n.platform.127.0.0.1.nip.io/webhook/chat \

  -H 'Content-Type: application/json' \

  -d '{"message": "list pods in the genai namespace", "config": {"mcp_tools": "all"}}'

# Use the OpenAI-compatible API

curl -X POST http://n8n.platform.127.0.0.1.nip.io/webhook/v1/chat/completions \

  -H 'Content-Type: application/json' \

  -d '{"model": "glm-4.7-flash", "messages": [{"role": "user", "content": "hello"}]}'

# Initialize an MCP session (all 243 tools)

curl -X POST http://gateway.platform.127.0.0.1.nip.io/mcp/all \

  -H 'Content-Type: application/json' \

  -H 'Accept: application/json, text/event-stream' \

  -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}}}'

```

## Day-to-Day Operations

```bash

task stop            # Pause (preserves state, frees RAM — ~21s)

task start           # Resume (~7 min)

task restart         # Full restart (handles crashes)

task down            # Destroy cluster (PV data preserved)

task smoke           # Verify all endpoints

task status          # Platform health overview

task urls            # All URLs + credentials

task doctor          # Preflight + smoke combined

```

## Repo Structure

```

charts/                 # 35 Helm charts (ArgoCD-managed after bootstrap)

  genai-n8n/            #   n8n + workflow promotion hook

  genai-mlflow/         #   MLflow tracking server

  genai-litellm/        #   LLM proxy

  genai-agentgateway/   #   Rust MCP proxy (Linux Foundation)

  genai-agent-gateway/  #   Python agent gateway (custom)

  genai-langfuse/       #   LLM observability

  genai-odd-platform/   #   Data catalog (ODD Platform)

  genai-plane/          #   Project management

  genai-minio/          #   Object storage

  genai-agentregistry/  #   Agent/skill catalog + semantic search

  genai-mcp-*/          #   Individual MCP server charts (9)

  genai-pg-*/           #   PostgreSQL instances (n8n, mlflow, plane)

  gitlab-ce/            #   In-cluster source control

  argocd/               #   GitOps controller

  ingress-nginx/        #   Ingress controller

n8n-data/workflows/     # 17 workflow JSONs (promoted to cluster via Helm hook)

agents/                 # Agent YAML definitions (6 agents + shared config)

skills/                 # 21 skill definitions

scripts/                # Bootstrap, import, benchmark, smoke test scripts

specs/                  # Feature specs (001-029, spec-driven development)

services/

  agent-gateway/        # Python FastAPI — agent registry, skill catalog, MCP proxy

images/                 # 19 Docker image builds (MCP servers, custom services)

mcp-servers/            # MCP server catalog config

data/                   # Seed prompts, benchmarks, training data

docs/                   # Architecture docs, environment guide

envs/                   # Environment config (global.env, secrets.env)

manifests/              # Raw k8s manifests + CRDs

taskfiles/              # Taskfile includes (mcp, mlops, quality, status, workflows)

helmfile.yaml           # Bootstrap-only (seeds ArgoCD + infra)

Taskfile.yml            # Root task runner

```

## Secrets

All secrets live in `envs/secrets.env` (gitignored). The example file ships with safe defaults for local dev:

| Secret | What it's for |

|--------|---------------|

| `PG_MLFLOW_PASSWORD` | MLflow PostgreSQL |

| `PG_LANGFUSE_PASSWORD` | Langfuse PostgreSQL |

| `PG_PLANE_PASSWORD` | Plane PostgreSQL |

| `MLFLOW_FLASK_SECRET_KEY` | MLflow session signing |

| `MINIO_ROOT_USER` / `PASSWORD` | MinIO S3 storage |

| `LITELLM_API_KEY` | LiteLLM proxy auth |

| `GITLAB_PAT` | Auto-generated during `task up` |

| `PLANE_API_TOKEN` | Plane project management API |

| `LANGFUSE_PUBLIC_KEY` / `SECRET_KEY` | Langfuse observability |

| `PGVECTOR_PASSWORD` | Shared pgvector (ODD Platform, agent registry) |

`task seed-secrets` creates k8s secrets from this file. Use `--force` to recreate.

## Troubleshooting

### Common issues

| Symptom | Cause | Fix |

|---------|-------|-----|

| Pods stuck in `Pending` | Colima VM out of resources | `colima stop && colima start --cpu 8 --memory 32 --disk 200` |

| `ImagePullBackOff` on MCP pods | Custom images not on all k3d nodes | `task build-images` (imports to all nodes) |

| n8n webhooks return 404 | Workflows not activated after import | `task smoke` to check; workflow promotion hook runs on deploy |

| MLflow rejects requests | DNS rebinding protection | Chart sets `disableSecurityMiddleware` flag — check ArgoCD sync |

| Ollama unreachable from cluster | Wrong host IP | Pods must use `192.168.65.254` (Colima gateway), not `localhost` |

| LiteLLM 401 errors | Missing API key in n8n | Check `LITELLM_API_KEY` in secrets.env, run `task seed-secrets` |

| ArgoCD sync stuck | Failed Helm hook blocking | Delete the failed Job: `kubectl delete job  -n genai` |

| PostgreSQL won't start | sshfs doesn't support chown | local-path provisioner must use `/var/lib/rancher/k3s/local-storage` (overlay FS) |

| `disk-pressure` taint after restart | Colima VM disk was full | Free space, restart Colima, then `kubectl taint nodes  node.kubernetes.io/disk-pressure-` |

### Diagnostic commands

```bash

task doctor          # Preflight checks + smoke tests

task status          # Full platform status

task smoke           # Hit every endpoint

kubectl get pods -n genai | grep -v Running    # Find unhealthy pods

kubectl logs deploy/genai-n8n -n genai         # n8n logs

kubectl logs deploy/genai-mlflow -n genai      # MLflow logs

```

## How it's deployed

1. **helmfile** bootstraps the cluster — installs ingress-nginx, creates namespaces, deploys GitLab CE and ArgoCD

2. **ArgoCD** takes over — watches GitLab for changes to `charts/`, auto-syncs all 35 Helm charts

3. **Workflow promotion** — a Helm post-install hook on the n8n chart clones this repo, patches service URLs for k8s DNS, and upserts all workflows via the n8n REST API

4. **Agent scheduling** — kagent CRDs define agents, k8s CronJobs POST to their A2A endpoints on cadence

Change anything in `charts/` → push to GitLab → ArgoCD syncs automatically. That's it.

## Hardware Requirements

The platform runs 35 Helm charts totaling **9.3 CPU cores** and **12.5 GB memory** in requests (26.5 GB limits), plus Ollama running natively on the Mac for GPU inference. Here's what you actually need:

### Resource breakdown

| Component | CPU | RAM | Disk |

|-----------|-----|-----|------|

| Colima VM (k3d cluster) | 8 cores | 24-32 GB | 200 GB |

| Ollama (native, Metal GPU) | shared | 19 GB (glm-4.7-flash) | 20 GB (model files) |

| macOS + apps | shared | ~6 GB | — |

### Machine tiers

With glm-4.7-flash (19 GB) + Colima VM + macOS:

| Machine | RAM | Works? | Notes |

|---------|-----|--------|-------|

| 24 GB | 24 GB | No | Not enough for VM + model + macOS |

| 36 GB | 36 GB | No | Can't fit 19 GB model + usable VM |

| 48 GB | 48 GB | Tight | 22 GB VM, expect memory pressure under load |

| 64 GB | 64 GB | Yes | 32 GB VM, comfortable headroom |

| 96-128 GB | 96-128 GB | Ideal | Full resource limits, room for larger models |

### LLM model

The platform uses `glm-4.7-flash` (19 GB) for inference and `nomic-embed-text` (274 MB) for embeddings:

```bash

ollama pull glm-4.7-flash

ollama pull nomic-embed-text

```

### Adjusting for smaller machines

If you have 36-48 GB RAM, reduce the Colima VM size:

```bash

# In task up, or manually:

colima start --cpu 6 --memory 20 --disk 150

```

The cluster will still run — pods will schedule with less headroom and some may restart under memory pressure.

### Disk space

| What | Size |

|------|------|

| Colima VM disk | 200 GB (thin-provisioned, grows as needed) |

| Persistent volumes (databases, storage) | ~59 GB |

| Docker images (all services) | ~25 GB |

| Ollama models | ~20 GB (glm-4.7-flash + nomic-embed-text) |

| Repo + tools | ~2 GB |

Developed on a MacBook Pro M4 Max (128 GB RAM, 16 cores).

## License

MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/r-aas/platform_monorepo

Awesome Lists containing this project

README