{"id":49548799,"url":"https://github.com/ingero-io/ingero-fleet","last_synced_at":"2026-05-02T21:00:57.870Z","repository":{"id":353822785,"uuid":"1210625155","full_name":"ingero-io/ingero-fleet","owner":"ingero-io","description":"GPU cluster straggler detection - custom OTEL Collector distribution","archived":false,"fork":false,"pushed_at":"2026-05-02T19:08:52.000Z","size":8425,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-02T20:32:24.803Z","etag":null,"topics":["anomaly-detection","distributed-training","gpu","gpu-observability","kubernetes","llm-inference","machine-learning","observability","opentelemetry","opentelemetry-collector","otlp","sre","straggler-detection"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ingero-io.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-14T15:43:46.000Z","updated_at":"2026-05-02T19:08:55.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ingero-io/ingero-fleet","commit_stats":null,"previous_names":["ingero-io/ingero-fleet"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/ingero-io/ingero-fleet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ingero-io%2Fingero-fleet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ingero-io%2Fingero-fleet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ingero-io%2Fingero-fleet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ingero-io%2Fingero-fleet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ingero-io","download_url":"https://codeload.github.com/ingero-io/ingero-fleet/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ingero-io%2Fingero-fleet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32549387,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-02T19:18:06.202Z","status":"ssl_error","status_checked_at":"2026-05-02T19:16:21.335Z","response_time":132,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anomaly-detection","distributed-training","gpu","gpu-observability","kubernetes","llm-inference","machine-learning","observability","opentelemetry","opentelemetry-collector","otlp","sre","straggler-detection"],"created_at":"2026-05-02T21:00:20.962Z","updated_at":"2026-05-02T21:00:57.842Z","avatar_url":"https://github.com/ingero-io.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ingero Fleet - GPU Cluster Straggler Detection\n\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n\n\u003c!-- ingero-version:install-header product=ingero-fleet channel=stable --\u003e\n**Version: 0.12.7**\n\nFleet is a lightweight central service that answers one question across your entire GPU cluster: **\"Which node is the straggler?\"**\n\nEach [Ingero](https://github.com/ingero-io/ingero) agent computes a health score from its GPU workload, pushes it to Fleet via OTLP, and Fleet computes a fleet-wide threshold using outlier-resistant statistics. Agents poll the threshold and self-classify. No agents need inbound network access - everything is outbound push and pull.\n\n## Quick Start\n\n\u003e **Looking for a worked end-to-end example?** Three multi-node\n\u003e quickstart guides take you from zero to a detected straggler on\n\u003e three GPU hosts in about 20 minutes. Pick the deployment style that\n\u003e matches your environment:\n\u003e\n\u003e - [Kubernetes (Helm)](docs/quickstart-k8s.md)\n\u003e - [Bare-metal binary](docs/quickstart-binary.md)\n\u003e - [Docker](docs/quickstart-docker.md)\n\u003e\n\u003e See [`docs/quickstart.md`](docs/quickstart.md) for a one-page\n\u003e comparison if you are not sure which to pick.\n\n### Option A: Add to your existing OTEL Collector\n\nAdd the Ingero Go modules to your [OCB](https://opentelemetry.io/docs/collector/custom-collector/) manifest and rebuild:\n\n```yaml\n# builder-config.yaml\nprocessors:\n  # ingero-version:builder-gomod-processor product=ingero-fleet channel=stable\n  - gomod: github.com/ingero-io/ingero-fleet/processor v0.12.7\n\nextensions:\n  # ingero-version:builder-gomod-extension product=ingero-fleet channel=stable\n  - gomod: github.com/ingero-io/ingero-fleet/extension v0.12.7\n```\n\n```bash\nocb --config builder-config.yaml\n```\n\n### Option B: Use the pre-built Fleet distribution\n\n```bash\n# Docker\ndocker run -p 4317:4317 -p 8080:8080 ghcr.io/ingero-io/ingero-fleet:latest\n\n# Binary\n# ingero-version:install-curl-version product=ingero-fleet channel=stable\nVERSION=0.12.7\ncurl -fsSL \"https://github.com/ingero-io/ingero-fleet/releases/download/v${VERSION}/ingero-fleet_${VERSION}_linux_amd64.tar.gz\" | tar xz\n./ingero-fleet --config fleet-config.yaml\n```\n\n### Option C: Kubernetes (Helm)\n\n```bash\nhelm install ingero-fleet ./helm/ingero-fleet\n```\n\n(Chart default is `replicaCount: 1`; see High Availability section below for multi-replica guidance.)\n\n### Configure the agent\n\nPoint your Ingero agent at Fleet:\n\n```yaml\n# ingero.yaml (on each GPU node)\nfleet:\n  endpoint: https://fleet.example.com:4317\n```\n\n## Architecture\n\n\u003cimg src=\"docs/assets/architecture.svg\" width=\"800\" alt=\"Ingero architecture: per-node agent emits OTLP to Fleet collector; Fleet aggregates via ingeroprocessor + providerprocessor; backends are Prometheus, Grafana, MCP clients, and UDS sinks\"\u003e\n\nThe detailed Fleet-Agent component view is below.\n\n### Fleet-Agent Overview\n\n```mermaid\ngraph TB\n    subgraph GPU Cluster\n        A1[Ingero Agent\u003cbr/\u003egpu-node-01\u003cbr/\u003escore: 0.92]\n        A2[Ingero Agent\u003cbr/\u003egpu-node-02\u003cbr/\u003escore: 0.91]\n        A3[Ingero Agent\u003cbr/\u003egpu-node-03\u003cbr/\u003escore: 0.58]\n        A4[Ingero Agent\u003cbr/\u003egpu-node-04\u003cbr/\u003escore: 0.93]\n    end\n\n    subgraph Fleet Service\n        R[OTLP Receiver\u003cbr/\u003egRPC :4317 / HTTP :4318]\n        P[Ingero Processor\u003cbr/\u003eScore Map + MAD + EMA]\n        E[Ingero Extension\u003cbr/\u003eThreshold API :8080\u003cbr/\u003eMiddleware Piggyback]\n        EX[Prometheus Exporter]\n    end\n\n    subgraph Observability\n        PR[Prometheus]\n        GR[Grafana]\n    end\n\n    A1 --\u003e|OTLP push| R\n    A2 --\u003e|OTLP push| R\n    A3 --\u003e|OTLP push| R\n    A4 --\u003e|OTLP push| R\n\n    R --\u003e P\n    P --\u003e|Set threshold| E\n    P --\u003e EX\n\n    EX --\u003e PR\n    PR --\u003e GR\n\n    E -.-\u003e|threshold in\u003cbr/\u003epush response| A1\n    E -.-\u003e|threshold in\u003cbr/\u003epush response| A2\n    E -.-\u003e|threshold in\u003cbr/\u003epush response| A3\n    E -.-\u003e|threshold in\u003cbr/\u003epush response| A4\n\n    style A3 fill:#f66,stroke:#333,color:#fff\n```\n\n*Node 03 (score 0.58) is below the fleet threshold (0.87) - detected as straggler.*\n\n### Detailed Communication Flow\n\n```mermaid\nsequenceDiagram\n    participant A as Ingero Agent\u003cbr/\u003e(GPU Node)\n    participant R as OTLP Receiver\u003cbr/\u003e:4317 gRPC / :4318 HTTP\n    participant M as Ingero Extension\u003cbr/\u003e(Middleware)\n    participant P as Ingero Processor\n    participant S as ThresholdStore\u003cbr/\u003e(in-memory)\n    participant T as Timer\u003cbr/\u003e(every 10s)\n    participant API as Threshold API\u003cbr/\u003e:8080\n\n    Note over A: Computes health score\u003cbr/\u003efrom 4 GPU signals\n\n    A-\u003e\u003eR: OTLP push (HTTP :4318)\u003cbr/\u003emetric: ingero.node.health_score = 0.92\u003cbr/\u003eattrs: node.id, cluster.id, state\u003cbr/\u003eheader: ingero.cluster.id = cluster-prod\n\n    R-\u003e\u003eM: HTTP request passes through middleware\n    M-\u003e\u003eR: Injects response headers:\u003cbr/\u003eX-Ingero-Threshold: 0.87\u003cbr/\u003eX-Ingero-Quorum-Met: true\n\n    R-\u003e\u003eP: ConsumeMetrics(OTLP payload)\n    P-\u003e\u003eP: Extract health_score from payload\u003cbr/\u003eWrite to score map[cluster:node]\n\n    R--\u003e\u003eA: HTTP 200 + threshold headers\n\n    Note over A: Reads X-Ingero-Threshold\u003cbr/\u003e0.92 \u003e 0.87 = healthy\n\n    T-\u003e\u003eP: Timer tick (every push_interval)\n    P-\u003e\u003eP: Read score map (RLock)\u003cbr/\u003eCompute MAD per cluster\u003cbr/\u003eApply EMA smoothing\u003cbr/\u003eCheck quorum, panic mode\n    P-\u003e\u003eS: Set(cluster_id, ThresholdResult)\n\n    Note over M: Next push reads\u003cbr/\u003eupdated threshold from store\n\n    A-\u003e\u003eAPI: GET /api/v1/threshold?cluster_id=cluster-prod\u003cbr/\u003e(fallback if no piggyback)\n    API-\u003e\u003eS: Get(cluster_id)\n    S--\u003e\u003eAPI: ThresholdResult\n    API--\u003e\u003eA: {\"threshold\": 0.87, \"quorum_met\": true}\n```\n\n### Ports and Protocols\n\n| Port | Protocol | Component | Direction | Purpose |\n|------|----------|-----------|-----------|---------|\n| 4317 | gRPC | OTLP Receiver | Agent -\u003e Fleet | Health score push (binary protobuf) |\n| 4318 | HTTP | OTLP Receiver | Agent -\u003e Fleet | Health score push (JSON). Threshold returned in response headers. |\n| 8080 | HTTP | Ingero Extension | Agent -\u003e Fleet | `GET /api/v1/threshold` fallback endpoint |\n| 8081 | HTTP | Ingero Extension | Admin -\u003e Fleet | Diagnostics endpoint (loopback only; future release) |\n| 55679 | HTTP | zPages | Internal | Health/readiness probes |\n| 8888 | HTTP | OTEL Telemetry | Prometheus -\u003e Fleet | Fleet self-monitoring metrics |\n\n### Data Sent per Push\n\n**Agent -\u003e Fleet (OTLP metric payload):**\n```\nResource attributes:\n  ingero.node.id:      \"gpu-node-01\"\n  ingero.cluster.id:   \"cluster-prod\"\n\nGauge: ingero.node.health_score = 0.92\n  ingero.node.state:      \"active\"\n  ingero.workload_type:   \"training\"\n\nHTTP header:\n  ingero.cluster.id: cluster-prod    (for middleware routing)\n```\n\n**Fleet -\u003e Agent (push response headers):**\n```\nX-Ingero-Threshold:  0.870348\nX-Ingero-Quorum-Met: true\n```\n\n**Fleet -\u003e Agent (GET fallback response):**\n```json\n{\"threshold\":0.870348,\"quorum_met\":true}\n```\n\nFleet is built as a custom [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) distribution. Two custom components, everything else is standard OTEL:\n\n| Component | Type | What it does |\n|-----------|------|-------------|\n| Ingero Processor | OTEL processor | Accumulates health scores, computes percentile threshold with EMA smoothing |\n| Ingero Extension | OTEL extension | Serves threshold API for agent polling and diagnostics |\n| Everything else | Standard OTEL | OTLP receiver, exporters, TLS, auth, batching - zero custom code |\n\n## Why OTEL Collector\n\n- Enterprises already deploy OTEL Collectors - familiar operational model\n- Multi-backend export is free (Prometheus, Datadog, Grafana Cloud - just configure an exporter)\n- TLS, auth, retry, batching all handled by the framework\n- Composable - add the Ingero processor to your existing collector via [OCB](https://opentelemetry.io/docs/collector/custom-collector/)\n\n## Key Properties\n\n- **Stateless.** No database, no disk. Health scores and threshold live in memory. Restart rebuilds state from incoming pushes in ~10 seconds.\n- **Fail-open.** If Fleet goes down, agents use their cached threshold, then fall back to local baselines. Straggler detection degrades gracefully, never blocks workloads.\n- **Outbound-only.** Agents push to Fleet and poll from Fleet - all outbound connections from GPU nodes. Zero firewall changes for enterprise GPU clusters with restricted inbound access.\n- **Tiny.** ~50MB RAM, negligible CPU for typical clusters.\n\n## How It Works\n\n### Health Score\n\nEach agent computes a health score (0.0 - 1.0) from four signals:\n\n| Signal | Weight | What it measures |\n|--------|--------|-----------------|\n| CUDA throughput | 0.40 | CUDA operations/sec relative to baseline |\n| Compute efficiency | 0.25 | Kernel launch rate relative to baseline |\n| Memory headroom | 0.20 | Available VRAM fraction |\n| CPU availability | 0.15 | Inverse of scheduler contention |\n\nThe throughput signal is workload-agnostic - it works for both training (step throughput) and inference (request processing rate). Baselines adapt via exponential moving average.\n\nAll four signals are normalized to [0.0, 1.0] against the agent's rolling fast-window baseline, then combined as a weighted sum. A hard floor per signal catches \"close to zero\" conditions (deep stalls, OOM pressure) that a weighted average could otherwise hide. The agent classifies itself against its local baseline during warmup and switches to the Fleet-computed peer threshold once quorum is met.\n\n### Threshold\n\nFleet computes the straggler threshold using [Median Absolute Deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) (MAD):\n\n```\nthreshold = median(scores) - k * MAD * 1.4826\n```\n\nMAD resists outliers (50% breakdown point vs 0% for mean/stddev). A single straggler - or even several - cannot shift the threshold. The `k` parameter (default 2.0) controls sensitivity.\n\nThe threshold is delivered to agents via the OTLP push response headers, eliminating a separate polling round-trip.\n\n### Straggler Classification\n\n```\nif my_score \u003c threshold:\n    I am a straggler\n```\n\nThat's it. The agent emits a straggler event via OTLP and the existing remediation protocol (`--remediate` flag).\n\n## High Availability\n\n### Single replica (recommended for most clusters)\n\n`replicaCount: 1` is the chart default. Vertical scale is the path to larger clusters: a single g4dn.xlarge-class node carries 100+ pushing agents at 5s intervals with p99 handler latency under 20 ms.\n\nIf a single Fleet pod dies, agents use their cached threshold (~5 min grace), then fall back to local baseline. Restarts repopulate within 1-2 push intervals.\n\n### Multi-replica HA (when you need it)\n\nEach Fleet replica maintains its own in-memory score map. An agent push reaches ONE replica (selected by DNS or the service mesh); that replica's map is the only one that sees the score. Each replica computes its own threshold from its subset of agents.\n\nFor multi-replica deployments, put an L7 load balancer with **consistent-hash on the `cluster_id` query parameter** (Envoy / nginx / service mesh) in front of Fleet. Every agent from one cluster lands on the same replica, eliminating cross-replica drift.\n\nSize `statistical_min` for the per-replica visible node count, not the cluster-wide count. Alert on `sum_over_replicas(ingero_fleet_active_nodes) \u003c expected_total_nodes` for replica starvation.\n\nLarger-cluster topologies (gateway-based shared state) are out of scope for this release. Talk to us if you're approaching the per-replica vertical-scale ceiling.\n\nSee `docs/ARCHITECTURE.md` for the full behavior model and rationale.\n\n## Fleet Configuration\n\n```yaml\n# fleet-config.yaml\nreceivers:\n  otlp:\n    protocols:\n      grpc:\n        endpoint: 0.0.0.0:4317\n      http:\n        endpoint: 0.0.0.0:4318\n\nprocessors:\n  ingero:\n    threshold:\n      k: 2.0                    # MAD sensitivity (default: 2.0)\n      ema_alpha: 0.2             # Threshold smoothing (default: 0.2)\n    quorum:\n      statistical_min: 5         # Min active nodes for valid threshold\n      coverage_fraction: 0.80    # Coverage alert threshold\n    push_interval: 10s           # Expected agent push interval\n    ttl_multiplier: 5            # Node expiry = push_interval * ttl_multiplier\n\nextensions:\n  ingero_threshold:\n    agent_endpoint: 0.0.0.0:8080       # Agent threshold poll (fallback)\n    admin_endpoint: 127.0.0.1:8081     # Diagnostics (management plane only)\n\nexporters:\n  prometheus:\n    endpoint: 0.0.0.0:9090\n\nservice:\n  extensions: [ingero_threshold]\n  pipelines:\n    metrics:\n      receivers: [otlp]\n      processors: [ingero]\n      exporters: [prometheus]\n```\n\n## Observability\n\nFleet emits its own metrics:\n\n| Metric | Type | Description |\n|--------|------|-------------|\n| `ingero_fleet_threshold` | Gauge | Current straggler threshold per cluster |\n| `ingero_fleet_active_nodes` | Gauge | Nodes actively reporting |\n| `ingero_fleet_idle_nodes` | Gauge | Nodes in idle state |\n| `ingero_fleet_coverage_low` | Gauge | 1 if coverage quorum not met |\n| `ingero_fleet_panic_mode` | Gauge | 1 if panic mode active |\n| `ingero_fleet_median` | Gauge | Fleet health score median |\n| `ingero_fleet_mad` | Gauge | Fleet MAD value |\n\nAgent-side metrics:\n\n| Metric | Type | Description |\n|--------|------|-------------|\n| `ingero_agent_health_score` | Gauge | This node's health score |\n| `ingero_agent_detection_mode` | Gauge | Current detection tier (fleet/cached/local/none) |\n| `ingero_agent_fleet_reachable` | Gauge | 1 if Fleet is reachable |\n\n## Documentation\n\n- [Architecture](docs/ARCHITECTURE.md) - components, data flow, threshold computation\n- [Deployment Guide](docs/deployment.md) - K8s, Slurm, bare metal\n- [Configuration Reference](docs/configuration.md) - all config parameters\n- [API Reference](docs/api.md) - threshold endpoints\n- [End-to-end walkthrough on Lambda Cloud](examples/lambda-e2e/) - A100 + GH200 (arm64) reference deploy\n\n## Requirements\n\n- Ingero agent v0.10+ on each GPU node\n- Go 1.22+ (for building from source)\n- Kubernetes 1.24+ (for Helm deployment)\n\n## License\n\nApache License 2.0. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fingero-io%2Fingero-fleet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fingero-io%2Fingero-fleet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fingero-io%2Fingero-fleet/lists"}