https://github.com/ingero-io/ingero-fleet

GPU cluster straggler detection - custom OTEL Collector distribution
https://github.com/ingero-io/ingero-fleet
anomaly-detection distributed-training gpu gpu-observability kubernetes llm-inference machine-learning observability opentelemetry opentelemetry-collector otlp sre straggler-detection
Last synced: about 2 months ago
JSON representation
GPU cluster straggler detection - custom OTEL Collector distribution
Host: GitHub
URL: https://github.com/ingero-io/ingero-fleet
Owner: ingero-io
Created: 2026-04-14T15:43:46.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-05-02T19:08:52.000Z (about 2 months ago)
Last Synced: 2026-05-02T20:32:24.803Z (about 2 months ago)
Topics: anomaly-detection, distributed-training, gpu, gpu-observability, kubernetes, llm-inference, machine-learning, observability, opentelemetry, opentelemetry-collector, otlp, sre, straggler-detection
Language: Go
Homepage:
Size: 8.03 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project

README

          # Ingero Fleet - GPU Cluster Straggler Detection

[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)

**Version: 0.12.7**

Fleet is a lightweight central service that answers one question across your entire GPU cluster: **"Which node is the straggler?"**

Each [Ingero](https://github.com/ingero-io/ingero) agent computes a health score from its GPU workload, pushes it to Fleet via OTLP, and Fleet computes a fleet-wide threshold using outlier-resistant statistics. Agents poll the threshold and self-classify. No agents need inbound network access - everything is outbound push and pull.

## Quick Start

> **Looking for a worked end-to-end example?** Three multi-node

> quickstart guides take you from zero to a detected straggler on

> three GPU hosts in about 20 minutes. Pick the deployment style that

> matches your environment:

>

> - [Kubernetes (Helm)](docs/quickstart-k8s.md)

> - [Bare-metal binary](docs/quickstart-binary.md)

> - [Docker](docs/quickstart-docker.md)

>

> See [`docs/quickstart.md`](docs/quickstart.md) for a one-page

> comparison if you are not sure which to pick.

### Option A: Add to your existing OTEL Collector

Add the Ingero Go modules to your [OCB](https://opentelemetry.io/docs/collector/custom-collector/) manifest and rebuild:

```yaml

# builder-config.yaml

processors:

  # ingero-version:builder-gomod-processor product=ingero-fleet channel=stable

  - gomod: github.com/ingero-io/ingero-fleet/processor v0.12.7

extensions:

  # ingero-version:builder-gomod-extension product=ingero-fleet channel=stable

  - gomod: github.com/ingero-io/ingero-fleet/extension v0.12.7

```

```bash

ocb --config builder-config.yaml

```

### Option B: Use the pre-built Fleet distribution

```bash

# Docker

docker run -p 4317:4317 -p 8080:8080 ghcr.io/ingero-io/ingero-fleet:latest

# Binary

# ingero-version:install-curl-version product=ingero-fleet channel=stable

VERSION=0.12.7

curl -fsSL "https://github.com/ingero-io/ingero-fleet/releases/download/v${VERSION}/ingero-fleet_${VERSION}_linux_amd64.tar.gz" | tar xz

./ingero-fleet --config fleet-config.yaml

```

### Option C: Kubernetes (Helm)

```bash

helm install ingero-fleet ./helm/ingero-fleet

```

(Chart default is `replicaCount: 1`; see High Availability section below for multi-replica guidance.)

### Configure the agent

Point your Ingero agent at Fleet:

```yaml

# ingero.yaml (on each GPU node)

fleet:

  endpoint: https://fleet.example.com:4317

```

## Architecture



The detailed Fleet-Agent component view is below.

### Fleet-Agent Overview

```mermaid

graph TB

    subgraph GPU Cluster

        A1[Ingero Agent
gpu-node-01
score: 0.92]

        A2[Ingero Agent
gpu-node-02
score: 0.91]

        A3[Ingero Agent
gpu-node-03
score: 0.58]

        A4[Ingero Agent
gpu-node-04
score: 0.93]

    end

    subgraph Fleet Service

        R[OTLP Receiver
gRPC :4317 / HTTP :4318]

        P[Ingero Processor
Score Map + MAD + EMA]

        E[Ingero Extension
Threshold API :8080
Middleware Piggyback]

        EX[Prometheus Exporter]

    end

    subgraph Observability

        PR[Prometheus]

        GR[Grafana]

    end

    A1 -->|OTLP push| R

    A2 -->|OTLP push| R

    A3 -->|OTLP push| R

    A4 -->|OTLP push| R

    R --> P

    P -->|Set threshold| E

    P --> EX

    EX --> PR

    PR --> GR

    E -.->|threshold in
push response| A1

    E -.->|threshold in
push response| A2

    E -.->|threshold in
push response| A3

    E -.->|threshold in
push response| A4

    style A3 fill:#f66,stroke:#333,color:#fff

```

*Node 03 (score 0.58) is below the fleet threshold (0.87) - detected as straggler.*

### Detailed Communication Flow

```mermaid

sequenceDiagram

    participant A as Ingero Agent
(GPU Node)

    participant R as OTLP Receiver
:4317 gRPC / :4318 HTTP

    participant M as Ingero Extension
(Middleware)

    participant P as Ingero Processor

    participant S as ThresholdStore
(in-memory)

    participant T as Timer
(every 10s)

    participant API as Threshold API
:8080

    Note over A: Computes health score
from 4 GPU signals

    A->>R: OTLP push (HTTP :4318)
metric: ingero.node.health_score = 0.92
attrs: node.id, cluster.id, state
header: ingero.cluster.id = cluster-prod

    R->>M: HTTP request passes through middleware

    M->>R: Injects response headers:
X-Ingero-Threshold: 0.87
X-Ingero-Quorum-Met: true

    R->>P: ConsumeMetrics(OTLP payload)

    P->>P: Extract health_score from payload
Write to score map[cluster:node]

    R-->>A: HTTP 200 + threshold headers

    Note over A: Reads X-Ingero-Threshold
0.92 > 0.87 = healthy

    T->>P: Timer tick (every push_interval)

    P->>P: Read score map (RLock)
Compute MAD per cluster
Apply EMA smoothing
Check quorum, panic mode

    P->>S: Set(cluster_id, ThresholdResult)

    Note over M: Next push reads
updated threshold from store

    A->>API: GET /api/v1/threshold?cluster_id=cluster-prod
(fallback if no piggyback)

    API->>S: Get(cluster_id)

    S-->>API: ThresholdResult

    API-->>A: {"threshold": 0.87, "quorum_met": true}

```

### Ports and Protocols

| Port | Protocol | Component | Direction | Purpose |

|------|----------|-----------|-----------|---------|

| 4317 | gRPC | OTLP Receiver | Agent -> Fleet | Health score push (binary protobuf) |

| 4318 | HTTP | OTLP Receiver | Agent -> Fleet | Health score push (JSON). Threshold returned in response headers. |

| 8080 | HTTP | Ingero Extension | Agent -> Fleet | `GET /api/v1/threshold` fallback endpoint |

| 8081 | HTTP | Ingero Extension | Admin -> Fleet | Diagnostics endpoint (loopback only; future release) |

| 55679 | HTTP | zPages | Internal | Health/readiness probes |

| 8888 | HTTP | OTEL Telemetry | Prometheus -> Fleet | Fleet self-monitoring metrics |

### Data Sent per Push

**Agent -> Fleet (OTLP metric payload):**

```

Resource attributes:

  ingero.node.id:      "gpu-node-01"

  ingero.cluster.id:   "cluster-prod"

Gauge: ingero.node.health_score = 0.92

  ingero.node.state:      "active"

  ingero.workload_type:   "training"

HTTP header:

  ingero.cluster.id: cluster-prod    (for middleware routing)

```

**Fleet -> Agent (push response headers):**

```

X-Ingero-Threshold:  0.870348

X-Ingero-Quorum-Met: true

```

**Fleet -> Agent (GET fallback response):**

```json

{"threshold":0.870348,"quorum_met":true}

```

Fleet is built as a custom [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) distribution. Two custom components, everything else is standard OTEL:

| Component | Type | What it does |

|-----------|------|-------------|

| Ingero Processor | OTEL processor | Accumulates health scores, computes percentile threshold with EMA smoothing |

| Ingero Extension | OTEL extension | Serves threshold API for agent polling and diagnostics |

| Everything else | Standard OTEL | OTLP receiver, exporters, TLS, auth, batching - zero custom code |

## Why OTEL Collector

- Enterprises already deploy OTEL Collectors - familiar operational model

- Multi-backend export is free (Prometheus, Datadog, Grafana Cloud - just configure an exporter)

- TLS, auth, retry, batching all handled by the framework

- Composable - add the Ingero processor to your existing collector via [OCB](https://opentelemetry.io/docs/collector/custom-collector/)

## Key Properties

- **Stateless.** No database, no disk. Health scores and threshold live in memory. Restart rebuilds state from incoming pushes in ~10 seconds.

- **Fail-open.** If Fleet goes down, agents use their cached threshold, then fall back to local baselines. Straggler detection degrades gracefully, never blocks workloads.

- **Outbound-only.** Agents push to Fleet and poll from Fleet - all outbound connections from GPU nodes. Zero firewall changes for enterprise GPU clusters with restricted inbound access.

- **Tiny.** ~50MB RAM, negligible CPU for typical clusters.

## How It Works

### Health Score

Each agent computes a health score (0.0 - 1.0) from four signals:

| Signal | Weight | What it measures |

|--------|--------|-----------------|

| CUDA throughput | 0.40 | CUDA operations/sec relative to baseline |

| Compute efficiency | 0.25 | Kernel launch rate relative to baseline |

| Memory headroom | 0.20 | Available VRAM fraction |

| CPU availability | 0.15 | Inverse of scheduler contention |

The throughput signal is workload-agnostic - it works for both training (step throughput) and inference (request processing rate). Baselines adapt via exponential moving average.

All four signals are normalized to [0.0, 1.0] against the agent's rolling fast-window baseline, then combined as a weighted sum. A hard floor per signal catches "close to zero" conditions (deep stalls, OOM pressure) that a weighted average could otherwise hide. The agent classifies itself against its local baseline during warmup and switches to the Fleet-computed peer threshold once quorum is met.

### Threshold

Fleet computes the straggler threshold using [Median Absolute Deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) (MAD):

```

threshold = median(scores) - k * MAD * 1.4826

```

MAD resists outliers (50% breakdown point vs 0% for mean/stddev). A single straggler - or even several - cannot shift the threshold. The `k` parameter (default 2.0) controls sensitivity.

The threshold is delivered to agents via the OTLP push response headers, eliminating a separate polling round-trip.

### Straggler Classification

```

if my_score < threshold:

    I am a straggler

```

That's it. The agent emits a straggler event via OTLP and the existing remediation protocol (`--remediate` flag).

## High Availability

### Single replica (recommended for most clusters)

`replicaCount: 1` is the chart default. Vertical scale is the path to larger clusters: a single g4dn.xlarge-class node carries 100+ pushing agents at 5s intervals with p99 handler latency under 20 ms.

If a single Fleet pod dies, agents use their cached threshold (~5 min grace), then fall back to local baseline. Restarts repopulate within 1-2 push intervals.

### Multi-replica HA (when you need it)

Each Fleet replica maintains its own in-memory score map. An agent push reaches ONE replica (selected by DNS or the service mesh); that replica's map is the only one that sees the score. Each replica computes its own threshold from its subset of agents.

For multi-replica deployments, put an L7 load balancer with **consistent-hash on the `cluster_id` query parameter** (Envoy / nginx / service mesh) in front of Fleet. Every agent from one cluster lands on the same replica, eliminating cross-replica drift.

Size `statistical_min` for the per-replica visible node count, not the cluster-wide count. Alert on `sum_over_replicas(ingero_fleet_active_nodes) < expected_total_nodes` for replica starvation.

Larger-cluster topologies (gateway-based shared state) are out of scope for this release. Talk to us if you're approaching the per-replica vertical-scale ceiling.

See `docs/ARCHITECTURE.md` for the full behavior model and rationale.

## Fleet Configuration

```yaml

# fleet-config.yaml

receivers:

  otlp:

    protocols:

      grpc:

        endpoint: 0.0.0.0:4317

      http:

        endpoint: 0.0.0.0:4318

processors:

  ingero:

    threshold:

      k: 2.0                    # MAD sensitivity (default: 2.0)

      ema_alpha: 0.2             # Threshold smoothing (default: 0.2)

    quorum:

      statistical_min: 5         # Min active nodes for valid threshold

      coverage_fraction: 0.80    # Coverage alert threshold

    push_interval: 10s           # Expected agent push interval

    ttl_multiplier: 5            # Node expiry = push_interval * ttl_multiplier

extensions:

  ingero_threshold:

    agent_endpoint: 0.0.0.0:8080       # Agent threshold poll (fallback)

    admin_endpoint: 127.0.0.1:8081     # Diagnostics (management plane only)

exporters:

  prometheus:

    endpoint: 0.0.0.0:9090

service:

  extensions: [ingero_threshold]

  pipelines:

    metrics:

      receivers: [otlp]

      processors: [ingero]

      exporters: [prometheus]

```

## Observability

Fleet emits its own metrics:

| Metric | Type | Description |

|--------|------|-------------|

| `ingero_fleet_threshold` | Gauge | Current straggler threshold per cluster |

| `ingero_fleet_active_nodes` | Gauge | Nodes actively reporting |

| `ingero_fleet_idle_nodes` | Gauge | Nodes in idle state |

| `ingero_fleet_coverage_low` | Gauge | 1 if coverage quorum not met |

| `ingero_fleet_panic_mode` | Gauge | 1 if panic mode active |

| `ingero_fleet_median` | Gauge | Fleet health score median |

| `ingero_fleet_mad` | Gauge | Fleet MAD value |

Agent-side metrics:

| Metric | Type | Description |

|--------|------|-------------|

| `ingero_agent_health_score` | Gauge | This node's health score |

| `ingero_agent_detection_mode` | Gauge | Current detection tier (fleet/cached/local/none) |

| `ingero_agent_fleet_reachable` | Gauge | 1 if Fleet is reachable |

## Documentation

- [Architecture](docs/ARCHITECTURE.md) - components, data flow, threshold computation

- [Deployment Guide](docs/deployment.md) - K8s, Slurm, bare metal

- [Configuration Reference](docs/configuration.md) - all config parameters

- [API Reference](docs/api.md) - threshold endpoints

- [End-to-end walkthrough on Lambda Cloud](examples/lambda-e2e/) - A100 + GH200 (arm64) reference deploy

## Requirements

- Ingero agent v0.10+ on each GPU node

- Go 1.22+ (for building from source)

- Kubernetes 1.24+ (for Helm deployment)

## License

Apache License 2.0. See [LICENSE](LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ingero-io/ingero-fleet

Awesome Lists containing this project

README