https://github.com/ingero-io/ingero-fleet
GPU cluster straggler detection - custom OTEL Collector distribution
https://github.com/ingero-io/ingero-fleet
anomaly-detection distributed-training gpu gpu-observability kubernetes llm-inference machine-learning observability opentelemetry opentelemetry-collector otlp sre straggler-detection
Last synced: about 2 months ago
JSON representation
GPU cluster straggler detection - custom OTEL Collector distribution
- Host: GitHub
- URL: https://github.com/ingero-io/ingero-fleet
- Owner: ingero-io
- Created: 2026-04-14T15:43:46.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-02T19:08:52.000Z (about 2 months ago)
- Last Synced: 2026-05-02T20:32:24.803Z (about 2 months ago)
- Topics: anomaly-detection, distributed-training, gpu, gpu-observability, kubernetes, llm-inference, machine-learning, observability, opentelemetry, opentelemetry-collector, otlp, sre, straggler-detection
- Language: Go
- Homepage:
- Size: 8.03 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# Ingero Fleet - GPU Cluster Straggler Detection
[](LICENSE)
**Version: 0.12.7**
Fleet is a lightweight central service that answers one question across your entire GPU cluster: **"Which node is the straggler?"**
Each [Ingero](https://github.com/ingero-io/ingero) agent computes a health score from its GPU workload, pushes it to Fleet via OTLP, and Fleet computes a fleet-wide threshold using outlier-resistant statistics. Agents poll the threshold and self-classify. No agents need inbound network access - everything is outbound push and pull.
## Quick Start
> **Looking for a worked end-to-end example?** Three multi-node
> quickstart guides take you from zero to a detected straggler on
> three GPU hosts in about 20 minutes. Pick the deployment style that
> matches your environment:
>
> - [Kubernetes (Helm)](docs/quickstart-k8s.md)
> - [Bare-metal binary](docs/quickstart-binary.md)
> - [Docker](docs/quickstart-docker.md)
>
> See [`docs/quickstart.md`](docs/quickstart.md) for a one-page
> comparison if you are not sure which to pick.
### Option A: Add to your existing OTEL Collector
Add the Ingero Go modules to your [OCB](https://opentelemetry.io/docs/collector/custom-collector/) manifest and rebuild:
```yaml
# builder-config.yaml
processors:
# ingero-version:builder-gomod-processor product=ingero-fleet channel=stable
- gomod: github.com/ingero-io/ingero-fleet/processor v0.12.7
extensions:
# ingero-version:builder-gomod-extension product=ingero-fleet channel=stable
- gomod: github.com/ingero-io/ingero-fleet/extension v0.12.7
```
```bash
ocb --config builder-config.yaml
```
### Option B: Use the pre-built Fleet distribution
```bash
# Docker
docker run -p 4317:4317 -p 8080:8080 ghcr.io/ingero-io/ingero-fleet:latest
# Binary
# ingero-version:install-curl-version product=ingero-fleet channel=stable
VERSION=0.12.7
curl -fsSL "https://github.com/ingero-io/ingero-fleet/releases/download/v${VERSION}/ingero-fleet_${VERSION}_linux_amd64.tar.gz" | tar xz
./ingero-fleet --config fleet-config.yaml
```
### Option C: Kubernetes (Helm)
```bash
helm install ingero-fleet ./helm/ingero-fleet
```
(Chart default is `replicaCount: 1`; see High Availability section below for multi-replica guidance.)
### Configure the agent
Point your Ingero agent at Fleet:
```yaml
# ingero.yaml (on each GPU node)
fleet:
endpoint: https://fleet.example.com:4317
```
## Architecture

The detailed Fleet-Agent component view is below.
### Fleet-Agent Overview
```mermaid
graph TB
subgraph GPU Cluster
A1[Ingero Agent
gpu-node-01
score: 0.92]
A2[Ingero Agent
gpu-node-02
score: 0.91]
A3[Ingero Agent
gpu-node-03
score: 0.58]
A4[Ingero Agent
gpu-node-04
score: 0.93]
end
subgraph Fleet Service
R[OTLP Receiver
gRPC :4317 / HTTP :4318]
P[Ingero Processor
Score Map + MAD + EMA]
E[Ingero Extension
Threshold API :8080
Middleware Piggyback]
EX[Prometheus Exporter]
end
subgraph Observability
PR[Prometheus]
GR[Grafana]
end
A1 -->|OTLP push| R
A2 -->|OTLP push| R
A3 -->|OTLP push| R
A4 -->|OTLP push| R
R --> P
P -->|Set threshold| E
P --> EX
EX --> PR
PR --> GR
E -.->|threshold in
push response| A1
E -.->|threshold in
push response| A2
E -.->|threshold in
push response| A3
E -.->|threshold in
push response| A4
style A3 fill:#f66,stroke:#333,color:#fff
```
*Node 03 (score 0.58) is below the fleet threshold (0.87) - detected as straggler.*
### Detailed Communication Flow
```mermaid
sequenceDiagram
participant A as Ingero Agent
(GPU Node)
participant R as OTLP Receiver
:4317 gRPC / :4318 HTTP
participant M as Ingero Extension
(Middleware)
participant P as Ingero Processor
participant S as ThresholdStore
(in-memory)
participant T as Timer
(every 10s)
participant API as Threshold API
:8080
Note over A: Computes health score
from 4 GPU signals
A->>R: OTLP push (HTTP :4318)
metric: ingero.node.health_score = 0.92
attrs: node.id, cluster.id, state
header: ingero.cluster.id = cluster-prod
R->>M: HTTP request passes through middleware
M->>R: Injects response headers:
X-Ingero-Threshold: 0.87
X-Ingero-Quorum-Met: true
R->>P: ConsumeMetrics(OTLP payload)
P->>P: Extract health_score from payload
Write to score map[cluster:node]
R-->>A: HTTP 200 + threshold headers
Note over A: Reads X-Ingero-Threshold
0.92 > 0.87 = healthy
T->>P: Timer tick (every push_interval)
P->>P: Read score map (RLock)
Compute MAD per cluster
Apply EMA smoothing
Check quorum, panic mode
P->>S: Set(cluster_id, ThresholdResult)
Note over M: Next push reads
updated threshold from store
A->>API: GET /api/v1/threshold?cluster_id=cluster-prod
(fallback if no piggyback)
API->>S: Get(cluster_id)
S-->>API: ThresholdResult
API-->>A: {"threshold": 0.87, "quorum_met": true}
```
### Ports and Protocols
| Port | Protocol | Component | Direction | Purpose |
|------|----------|-----------|-----------|---------|
| 4317 | gRPC | OTLP Receiver | Agent -> Fleet | Health score push (binary protobuf) |
| 4318 | HTTP | OTLP Receiver | Agent -> Fleet | Health score push (JSON). Threshold returned in response headers. |
| 8080 | HTTP | Ingero Extension | Agent -> Fleet | `GET /api/v1/threshold` fallback endpoint |
| 8081 | HTTP | Ingero Extension | Admin -> Fleet | Diagnostics endpoint (loopback only; future release) |
| 55679 | HTTP | zPages | Internal | Health/readiness probes |
| 8888 | HTTP | OTEL Telemetry | Prometheus -> Fleet | Fleet self-monitoring metrics |
### Data Sent per Push
**Agent -> Fleet (OTLP metric payload):**
```
Resource attributes:
ingero.node.id: "gpu-node-01"
ingero.cluster.id: "cluster-prod"
Gauge: ingero.node.health_score = 0.92
ingero.node.state: "active"
ingero.workload_type: "training"
HTTP header:
ingero.cluster.id: cluster-prod (for middleware routing)
```
**Fleet -> Agent (push response headers):**
```
X-Ingero-Threshold: 0.870348
X-Ingero-Quorum-Met: true
```
**Fleet -> Agent (GET fallback response):**
```json
{"threshold":0.870348,"quorum_met":true}
```
Fleet is built as a custom [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) distribution. Two custom components, everything else is standard OTEL:
| Component | Type | What it does |
|-----------|------|-------------|
| Ingero Processor | OTEL processor | Accumulates health scores, computes percentile threshold with EMA smoothing |
| Ingero Extension | OTEL extension | Serves threshold API for agent polling and diagnostics |
| Everything else | Standard OTEL | OTLP receiver, exporters, TLS, auth, batching - zero custom code |
## Why OTEL Collector
- Enterprises already deploy OTEL Collectors - familiar operational model
- Multi-backend export is free (Prometheus, Datadog, Grafana Cloud - just configure an exporter)
- TLS, auth, retry, batching all handled by the framework
- Composable - add the Ingero processor to your existing collector via [OCB](https://opentelemetry.io/docs/collector/custom-collector/)
## Key Properties
- **Stateless.** No database, no disk. Health scores and threshold live in memory. Restart rebuilds state from incoming pushes in ~10 seconds.
- **Fail-open.** If Fleet goes down, agents use their cached threshold, then fall back to local baselines. Straggler detection degrades gracefully, never blocks workloads.
- **Outbound-only.** Agents push to Fleet and poll from Fleet - all outbound connections from GPU nodes. Zero firewall changes for enterprise GPU clusters with restricted inbound access.
- **Tiny.** ~50MB RAM, negligible CPU for typical clusters.
## How It Works
### Health Score
Each agent computes a health score (0.0 - 1.0) from four signals:
| Signal | Weight | What it measures |
|--------|--------|-----------------|
| CUDA throughput | 0.40 | CUDA operations/sec relative to baseline |
| Compute efficiency | 0.25 | Kernel launch rate relative to baseline |
| Memory headroom | 0.20 | Available VRAM fraction |
| CPU availability | 0.15 | Inverse of scheduler contention |
The throughput signal is workload-agnostic - it works for both training (step throughput) and inference (request processing rate). Baselines adapt via exponential moving average.
All four signals are normalized to [0.0, 1.0] against the agent's rolling fast-window baseline, then combined as a weighted sum. A hard floor per signal catches "close to zero" conditions (deep stalls, OOM pressure) that a weighted average could otherwise hide. The agent classifies itself against its local baseline during warmup and switches to the Fleet-computed peer threshold once quorum is met.
### Threshold
Fleet computes the straggler threshold using [Median Absolute Deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) (MAD):
```
threshold = median(scores) - k * MAD * 1.4826
```
MAD resists outliers (50% breakdown point vs 0% for mean/stddev). A single straggler - or even several - cannot shift the threshold. The `k` parameter (default 2.0) controls sensitivity.
The threshold is delivered to agents via the OTLP push response headers, eliminating a separate polling round-trip.
### Straggler Classification
```
if my_score < threshold:
I am a straggler
```
That's it. The agent emits a straggler event via OTLP and the existing remediation protocol (`--remediate` flag).
## High Availability
### Single replica (recommended for most clusters)
`replicaCount: 1` is the chart default. Vertical scale is the path to larger clusters: a single g4dn.xlarge-class node carries 100+ pushing agents at 5s intervals with p99 handler latency under 20 ms.
If a single Fleet pod dies, agents use their cached threshold (~5 min grace), then fall back to local baseline. Restarts repopulate within 1-2 push intervals.
### Multi-replica HA (when you need it)
Each Fleet replica maintains its own in-memory score map. An agent push reaches ONE replica (selected by DNS or the service mesh); that replica's map is the only one that sees the score. Each replica computes its own threshold from its subset of agents.
For multi-replica deployments, put an L7 load balancer with **consistent-hash on the `cluster_id` query parameter** (Envoy / nginx / service mesh) in front of Fleet. Every agent from one cluster lands on the same replica, eliminating cross-replica drift.
Size `statistical_min` for the per-replica visible node count, not the cluster-wide count. Alert on `sum_over_replicas(ingero_fleet_active_nodes) < expected_total_nodes` for replica starvation.
Larger-cluster topologies (gateway-based shared state) are out of scope for this release. Talk to us if you're approaching the per-replica vertical-scale ceiling.
See `docs/ARCHITECTURE.md` for the full behavior model and rationale.
## Fleet Configuration
```yaml
# fleet-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
ingero:
threshold:
k: 2.0 # MAD sensitivity (default: 2.0)
ema_alpha: 0.2 # Threshold smoothing (default: 0.2)
quorum:
statistical_min: 5 # Min active nodes for valid threshold
coverage_fraction: 0.80 # Coverage alert threshold
push_interval: 10s # Expected agent push interval
ttl_multiplier: 5 # Node expiry = push_interval * ttl_multiplier
extensions:
ingero_threshold:
agent_endpoint: 0.0.0.0:8080 # Agent threshold poll (fallback)
admin_endpoint: 127.0.0.1:8081 # Diagnostics (management plane only)
exporters:
prometheus:
endpoint: 0.0.0.0:9090
service:
extensions: [ingero_threshold]
pipelines:
metrics:
receivers: [otlp]
processors: [ingero]
exporters: [prometheus]
```
## Observability
Fleet emits its own metrics:
| Metric | Type | Description |
|--------|------|-------------|
| `ingero_fleet_threshold` | Gauge | Current straggler threshold per cluster |
| `ingero_fleet_active_nodes` | Gauge | Nodes actively reporting |
| `ingero_fleet_idle_nodes` | Gauge | Nodes in idle state |
| `ingero_fleet_coverage_low` | Gauge | 1 if coverage quorum not met |
| `ingero_fleet_panic_mode` | Gauge | 1 if panic mode active |
| `ingero_fleet_median` | Gauge | Fleet health score median |
| `ingero_fleet_mad` | Gauge | Fleet MAD value |
Agent-side metrics:
| Metric | Type | Description |
|--------|------|-------------|
| `ingero_agent_health_score` | Gauge | This node's health score |
| `ingero_agent_detection_mode` | Gauge | Current detection tier (fleet/cached/local/none) |
| `ingero_agent_fleet_reachable` | Gauge | 1 if Fleet is reachable |
## Documentation
- [Architecture](docs/ARCHITECTURE.md) - components, data flow, threshold computation
- [Deployment Guide](docs/deployment.md) - K8s, Slurm, bare metal
- [Configuration Reference](docs/configuration.md) - all config parameters
- [API Reference](docs/api.md) - threshold endpoints
- [End-to-end walkthrough on Lambda Cloud](examples/lambda-e2e/) - A100 + GH200 (arm64) reference deploy
## Requirements
- Ingero agent v0.10+ on each GPU node
- Go 1.22+ (for building from source)
- Kubernetes 1.24+ (for Helm deployment)
## License
Apache License 2.0. See [LICENSE](LICENSE).