An open API service indexing awesome lists of open source software.

https://github.com/ingero-io/ingero-fleet

GPU cluster straggler detection - custom OTEL Collector distribution
https://github.com/ingero-io/ingero-fleet

anomaly-detection distributed-training gpu gpu-observability kubernetes llm-inference machine-learning observability opentelemetry opentelemetry-collector otlp sre straggler-detection

Last synced: about 2 months ago
JSON representation

GPU cluster straggler detection - custom OTEL Collector distribution

Awesome Lists containing this project

README

          

# Ingero Fleet - GPU Cluster Straggler Detection

[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)

**Version: 0.12.7**

Fleet is a lightweight central service that answers one question across your entire GPU cluster: **"Which node is the straggler?"**

Each [Ingero](https://github.com/ingero-io/ingero) agent computes a health score from its GPU workload, pushes it to Fleet via OTLP, and Fleet computes a fleet-wide threshold using outlier-resistant statistics. Agents poll the threshold and self-classify. No agents need inbound network access - everything is outbound push and pull.

## Quick Start

> **Looking for a worked end-to-end example?** Three multi-node
> quickstart guides take you from zero to a detected straggler on
> three GPU hosts in about 20 minutes. Pick the deployment style that
> matches your environment:
>
> - [Kubernetes (Helm)](docs/quickstart-k8s.md)
> - [Bare-metal binary](docs/quickstart-binary.md)
> - [Docker](docs/quickstart-docker.md)
>
> See [`docs/quickstart.md`](docs/quickstart.md) for a one-page
> comparison if you are not sure which to pick.

### Option A: Add to your existing OTEL Collector

Add the Ingero Go modules to your [OCB](https://opentelemetry.io/docs/collector/custom-collector/) manifest and rebuild:

```yaml
# builder-config.yaml
processors:
# ingero-version:builder-gomod-processor product=ingero-fleet channel=stable
- gomod: github.com/ingero-io/ingero-fleet/processor v0.12.7

extensions:
# ingero-version:builder-gomod-extension product=ingero-fleet channel=stable
- gomod: github.com/ingero-io/ingero-fleet/extension v0.12.7
```

```bash
ocb --config builder-config.yaml
```

### Option B: Use the pre-built Fleet distribution

```bash
# Docker
docker run -p 4317:4317 -p 8080:8080 ghcr.io/ingero-io/ingero-fleet:latest

# Binary
# ingero-version:install-curl-version product=ingero-fleet channel=stable
VERSION=0.12.7
curl -fsSL "https://github.com/ingero-io/ingero-fleet/releases/download/v${VERSION}/ingero-fleet_${VERSION}_linux_amd64.tar.gz" | tar xz
./ingero-fleet --config fleet-config.yaml
```

### Option C: Kubernetes (Helm)

```bash
helm install ingero-fleet ./helm/ingero-fleet
```

(Chart default is `replicaCount: 1`; see High Availability section below for multi-replica guidance.)

### Configure the agent

Point your Ingero agent at Fleet:

```yaml
# ingero.yaml (on each GPU node)
fleet:
endpoint: https://fleet.example.com:4317
```

## Architecture

Ingero architecture: per-node agent emits OTLP to Fleet collector; Fleet aggregates via ingeroprocessor + providerprocessor; backends are Prometheus, Grafana, MCP clients, and UDS sinks

The detailed Fleet-Agent component view is below.

### Fleet-Agent Overview

```mermaid
graph TB
subgraph GPU Cluster
A1[Ingero Agent
gpu-node-01
score: 0.92]
A2[Ingero Agent
gpu-node-02
score: 0.91]
A3[Ingero Agent
gpu-node-03
score: 0.58]
A4[Ingero Agent
gpu-node-04
score: 0.93]
end

subgraph Fleet Service
R[OTLP Receiver
gRPC :4317 / HTTP :4318]
P[Ingero Processor
Score Map + MAD + EMA]
E[Ingero Extension
Threshold API :8080
Middleware Piggyback]
EX[Prometheus Exporter]
end

subgraph Observability
PR[Prometheus]
GR[Grafana]
end

A1 -->|OTLP push| R
A2 -->|OTLP push| R
A3 -->|OTLP push| R
A4 -->|OTLP push| R

R --> P
P -->|Set threshold| E
P --> EX

EX --> PR
PR --> GR

E -.->|threshold in
push response| A1
E -.->|threshold in
push response| A2
E -.->|threshold in
push response| A3
E -.->|threshold in
push response| A4

style A3 fill:#f66,stroke:#333,color:#fff
```

*Node 03 (score 0.58) is below the fleet threshold (0.87) - detected as straggler.*

### Detailed Communication Flow

```mermaid
sequenceDiagram
participant A as Ingero Agent
(GPU Node)
participant R as OTLP Receiver
:4317 gRPC / :4318 HTTP
participant M as Ingero Extension
(Middleware)
participant P as Ingero Processor
participant S as ThresholdStore
(in-memory)
participant T as Timer
(every 10s)
participant API as Threshold API
:8080

Note over A: Computes health score
from 4 GPU signals

A->>R: OTLP push (HTTP :4318)
metric: ingero.node.health_score = 0.92
attrs: node.id, cluster.id, state
header: ingero.cluster.id = cluster-prod

R->>M: HTTP request passes through middleware
M->>R: Injects response headers:
X-Ingero-Threshold: 0.87
X-Ingero-Quorum-Met: true

R->>P: ConsumeMetrics(OTLP payload)
P->>P: Extract health_score from payload
Write to score map[cluster:node]

R-->>A: HTTP 200 + threshold headers

Note over A: Reads X-Ingero-Threshold
0.92 > 0.87 = healthy

T->>P: Timer tick (every push_interval)
P->>P: Read score map (RLock)
Compute MAD per cluster
Apply EMA smoothing
Check quorum, panic mode
P->>S: Set(cluster_id, ThresholdResult)

Note over M: Next push reads
updated threshold from store

A->>API: GET /api/v1/threshold?cluster_id=cluster-prod
(fallback if no piggyback)
API->>S: Get(cluster_id)
S-->>API: ThresholdResult
API-->>A: {"threshold": 0.87, "quorum_met": true}
```

### Ports and Protocols

| Port | Protocol | Component | Direction | Purpose |
|------|----------|-----------|-----------|---------|
| 4317 | gRPC | OTLP Receiver | Agent -> Fleet | Health score push (binary protobuf) |
| 4318 | HTTP | OTLP Receiver | Agent -> Fleet | Health score push (JSON). Threshold returned in response headers. |
| 8080 | HTTP | Ingero Extension | Agent -> Fleet | `GET /api/v1/threshold` fallback endpoint |
| 8081 | HTTP | Ingero Extension | Admin -> Fleet | Diagnostics endpoint (loopback only; future release) |
| 55679 | HTTP | zPages | Internal | Health/readiness probes |
| 8888 | HTTP | OTEL Telemetry | Prometheus -> Fleet | Fleet self-monitoring metrics |

### Data Sent per Push

**Agent -> Fleet (OTLP metric payload):**
```
Resource attributes:
ingero.node.id: "gpu-node-01"
ingero.cluster.id: "cluster-prod"

Gauge: ingero.node.health_score = 0.92
ingero.node.state: "active"
ingero.workload_type: "training"

HTTP header:
ingero.cluster.id: cluster-prod (for middleware routing)
```

**Fleet -> Agent (push response headers):**
```
X-Ingero-Threshold: 0.870348
X-Ingero-Quorum-Met: true
```

**Fleet -> Agent (GET fallback response):**
```json
{"threshold":0.870348,"quorum_met":true}
```

Fleet is built as a custom [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) distribution. Two custom components, everything else is standard OTEL:

| Component | Type | What it does |
|-----------|------|-------------|
| Ingero Processor | OTEL processor | Accumulates health scores, computes percentile threshold with EMA smoothing |
| Ingero Extension | OTEL extension | Serves threshold API for agent polling and diagnostics |
| Everything else | Standard OTEL | OTLP receiver, exporters, TLS, auth, batching - zero custom code |

## Why OTEL Collector

- Enterprises already deploy OTEL Collectors - familiar operational model
- Multi-backend export is free (Prometheus, Datadog, Grafana Cloud - just configure an exporter)
- TLS, auth, retry, batching all handled by the framework
- Composable - add the Ingero processor to your existing collector via [OCB](https://opentelemetry.io/docs/collector/custom-collector/)

## Key Properties

- **Stateless.** No database, no disk. Health scores and threshold live in memory. Restart rebuilds state from incoming pushes in ~10 seconds.
- **Fail-open.** If Fleet goes down, agents use their cached threshold, then fall back to local baselines. Straggler detection degrades gracefully, never blocks workloads.
- **Outbound-only.** Agents push to Fleet and poll from Fleet - all outbound connections from GPU nodes. Zero firewall changes for enterprise GPU clusters with restricted inbound access.
- **Tiny.** ~50MB RAM, negligible CPU for typical clusters.

## How It Works

### Health Score

Each agent computes a health score (0.0 - 1.0) from four signals:

| Signal | Weight | What it measures |
|--------|--------|-----------------|
| CUDA throughput | 0.40 | CUDA operations/sec relative to baseline |
| Compute efficiency | 0.25 | Kernel launch rate relative to baseline |
| Memory headroom | 0.20 | Available VRAM fraction |
| CPU availability | 0.15 | Inverse of scheduler contention |

The throughput signal is workload-agnostic - it works for both training (step throughput) and inference (request processing rate). Baselines adapt via exponential moving average.

All four signals are normalized to [0.0, 1.0] against the agent's rolling fast-window baseline, then combined as a weighted sum. A hard floor per signal catches "close to zero" conditions (deep stalls, OOM pressure) that a weighted average could otherwise hide. The agent classifies itself against its local baseline during warmup and switches to the Fleet-computed peer threshold once quorum is met.

### Threshold

Fleet computes the straggler threshold using [Median Absolute Deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) (MAD):

```
threshold = median(scores) - k * MAD * 1.4826
```

MAD resists outliers (50% breakdown point vs 0% for mean/stddev). A single straggler - or even several - cannot shift the threshold. The `k` parameter (default 2.0) controls sensitivity.

The threshold is delivered to agents via the OTLP push response headers, eliminating a separate polling round-trip.

### Straggler Classification

```
if my_score < threshold:
I am a straggler
```

That's it. The agent emits a straggler event via OTLP and the existing remediation protocol (`--remediate` flag).

## High Availability

### Single replica (recommended for most clusters)

`replicaCount: 1` is the chart default. Vertical scale is the path to larger clusters: a single g4dn.xlarge-class node carries 100+ pushing agents at 5s intervals with p99 handler latency under 20 ms.

If a single Fleet pod dies, agents use their cached threshold (~5 min grace), then fall back to local baseline. Restarts repopulate within 1-2 push intervals.

### Multi-replica HA (when you need it)

Each Fleet replica maintains its own in-memory score map. An agent push reaches ONE replica (selected by DNS or the service mesh); that replica's map is the only one that sees the score. Each replica computes its own threshold from its subset of agents.

For multi-replica deployments, put an L7 load balancer with **consistent-hash on the `cluster_id` query parameter** (Envoy / nginx / service mesh) in front of Fleet. Every agent from one cluster lands on the same replica, eliminating cross-replica drift.

Size `statistical_min` for the per-replica visible node count, not the cluster-wide count. Alert on `sum_over_replicas(ingero_fleet_active_nodes) < expected_total_nodes` for replica starvation.

Larger-cluster topologies (gateway-based shared state) are out of scope for this release. Talk to us if you're approaching the per-replica vertical-scale ceiling.

See `docs/ARCHITECTURE.md` for the full behavior model and rationale.

## Fleet Configuration

```yaml
# fleet-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
ingero:
threshold:
k: 2.0 # MAD sensitivity (default: 2.0)
ema_alpha: 0.2 # Threshold smoothing (default: 0.2)
quorum:
statistical_min: 5 # Min active nodes for valid threshold
coverage_fraction: 0.80 # Coverage alert threshold
push_interval: 10s # Expected agent push interval
ttl_multiplier: 5 # Node expiry = push_interval * ttl_multiplier

extensions:
ingero_threshold:
agent_endpoint: 0.0.0.0:8080 # Agent threshold poll (fallback)
admin_endpoint: 127.0.0.1:8081 # Diagnostics (management plane only)

exporters:
prometheus:
endpoint: 0.0.0.0:9090

service:
extensions: [ingero_threshold]
pipelines:
metrics:
receivers: [otlp]
processors: [ingero]
exporters: [prometheus]
```

## Observability

Fleet emits its own metrics:

| Metric | Type | Description |
|--------|------|-------------|
| `ingero_fleet_threshold` | Gauge | Current straggler threshold per cluster |
| `ingero_fleet_active_nodes` | Gauge | Nodes actively reporting |
| `ingero_fleet_idle_nodes` | Gauge | Nodes in idle state |
| `ingero_fleet_coverage_low` | Gauge | 1 if coverage quorum not met |
| `ingero_fleet_panic_mode` | Gauge | 1 if panic mode active |
| `ingero_fleet_median` | Gauge | Fleet health score median |
| `ingero_fleet_mad` | Gauge | Fleet MAD value |

Agent-side metrics:

| Metric | Type | Description |
|--------|------|-------------|
| `ingero_agent_health_score` | Gauge | This node's health score |
| `ingero_agent_detection_mode` | Gauge | Current detection tier (fleet/cached/local/none) |
| `ingero_agent_fleet_reachable` | Gauge | 1 if Fleet is reachable |

## Documentation

- [Architecture](docs/ARCHITECTURE.md) - components, data flow, threshold computation
- [Deployment Guide](docs/deployment.md) - K8s, Slurm, bare metal
- [Configuration Reference](docs/configuration.md) - all config parameters
- [API Reference](docs/api.md) - threshold endpoints
- [End-to-end walkthrough on Lambda Cloud](examples/lambda-e2e/) - A100 + GH200 (arm64) reference deploy

## Requirements

- Ingero agent v0.10+ on each GPU node
- Go 1.22+ (for building from source)
- Kubernetes 1.24+ (for Helm deployment)

## License

Apache License 2.0. See [LICENSE](LICENSE).