https://github.com/kuldeep-poonia/loadequilibrium

Predictive infrastructure congestion control engine that models distributed systems as coupled queues to forecast saturation risk and generate proactive scaling or load-shedding signals.
https://github.com/kuldeep-poonia/loadequilibrium
autoscaling control-theory distributed-systems infrastructure observability queueing-theory
Last synced: 7 days ago
JSON representation
Predictive infrastructure congestion control engine that models distributed systems as coupled queues to forecast saturation risk and generate proactive scaling or load-shedding signals.
Host: GitHub
URL: https://github.com/kuldeep-poonia/loadequilibrium
Owner: kuldeep-poonia
Created: 2026-03-21T11:51:25.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-06-06T04:16:21.000Z (13 days ago)
Last Synced: 2026-06-06T06:11:46.240Z (12 days ago)
Topics: autoscaling, control-theory, distributed-systems, infrastructure, observability, queueing-theory
Language: Go
Homepage:
Size: 1.13 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # LoadEquilibrium

![Go](https://img.shields.io/badge/Go-1.20+-blue?style=flat-square)

![Docker](https://img.shields.io/badge/Docker-Ready-brightgreen?style=flat-square)

![Kubernetes](https://img.shields.io/badge/Kubernetes-Compatible-326ce5?style=flat-square)

![License](https://img.shields.io/badge/License-MIT-green?style=flat-square)

> Predictive auto-scaling for Docker & Kubernetes. Watches your services, predicts failures 60 seconds ahead, and scales automatically using control theory (MPC + RL).

---

## What It Does

You add one label to each service you want monitored. LoadEquilibrium does the rest:

- Watches your services every 2 seconds

- Builds a live mathematical model of each service's queue, latency, and load

- Predicts when a service is about to fail — before it actually does

- Issues precise scaling decisions automatically

- Shows you everything in a live dashboard

It is not a threshold alarm. It does not wait until latency spikes to react. It uses the same class of control system used in aircraft autopilots — applied to your software.

---

## Getting Started in 3 Steps

### Step 1 — Add a label to each service you want monitored

```yaml

# your existing docker-compose.yml

services:

  my-api:

    image: your-app:latest

    labels:

      le.enable: "true"    # ← add this one line

```

That is the only change you make to your existing service. No agents to install. No config files to write.

### Step 2 — Add LoadEquilibrium to your compose file

```yaml

  loadequilibrium:

    image: ghcr.io/your-org/loadequilibrium:latest

    ports:

      - "8080:8080"

    volumes:

      - /var/run/docker.sock:/var/run/docker.sock:ro

```

### Step 3 — Start it

```bash

docker compose up -d

```

Open your browser at `http://localhost:8080`. You will see your services appear automatically within about 5 seconds.

---

## What Your Services Need to Expose

Your services must expose a `/metrics` endpoint in Prometheus format. This is standard for any service built with:

- Go (`prometheus/client_golang`)

- Python (`prometheus_client`)

- Node.js (`prom-client`)

- Java (`micrometer`)

- Any language with a Prometheus client library

The collector auto-detects the port. If your service exposes metrics on a non-standard port, add one more label:

```yaml

labels:

  le.enable: "true"

  le.port: "9100"    # only needed for non-standard ports

```

---

## What You See in the Dashboard

**Monitor page** — read-only view of what is happening:

- Live metrics per service: requests per second, queue depth, wait time, capacity used

- Failure risk score — how likely each service is to stop responding in the next 60 seconds

- Incident timeline — every problem detected, what the engine predicted, what action it took

- Engine reasoning feed — plain-English explanation of what the autopilot is thinking right now

- Live event stream — everything happening across all services in chronological order

**Control page** — actions you can take:

- Enable or freeze the autopilot (freeze = it keeps watching but stops issuing commands)

- Switch operating policy: Safe Mode / Normal / Performance

- Run a stress test on any service to see how the autopilot responds

- Force the engine to step manually or retrain its model

---

## Architecture: One Image, One Port

The entire system runs in a single Docker container exposing port `8080`:

```

Your Services (with le.enable=true label)

        │

        │  Auto-discovered via Docker socket

        │  Scraped every 2 seconds

        ▼

┌─────────────────────────────────────────┐

│         LoadEquilibrium :8080           │

│                                         │

│  Collector ──► Telemetry Store          │

│                      │                  │

│               Tick Engine (2s)          │

│                 │         │             │

│          Autopilot    Reasoning         │

│          (MPC+RL)     (Events)          │

│                 │         │             │

│          WebSocket Broadcast            │

│                 │                       │

│            UI (React)                  │

└─────────────────────────────────────────┘

        │

        ▼

  http://localhost:8080

```

- Port `8080` — dashboard UI, WebSocket live feed, REST API, Prometheus metrics

- No separate collector container

- No separate nginx

- No Prometheus required (it is optional, for Grafana users)

- No database required (optional, for persistent history)

---

## Optional: Grafana + Prometheus

If you already use Grafana, LoadEquilibrium exposes a `/metrics` endpoint that Prometheus can scrape. A pre-built dashboard is included.

Add to your compose file:

```yaml

  prometheus:

    image: prom/prometheus:v2.52.0

    volumes:

      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro

    ports:

      - "9090:9090"

  grafana:

    image: grafana/grafana:10.4.0

    ports:

      - "3000:3000"

    volumes:

      - ./monitoring/grafana/datasource.yml:/etc/grafana/provisioning/datasources/datasource.yml:ro

      - ./monitoring/grafana/provider.yml:/etc/grafana/provisioning/dashboards/provider.yml:ro

      - ./monitoring/grafana/loadequilibrium-dashboard.json:/var/lib/grafana/dashboards/loadequilibrium-dashboard.json:ro

```

Grafana dashboard shows: traffic, latency, queue depth, autopilot decisions, signal quality, and Go runtime metrics.

---

## Optional: Persistent History (PostgreSQL)

Without a database, the engine runs entirely in memory. Data is lost on restart but the system works perfectly for real-time monitoring.

To keep a history of engine snapshots:

```yaml

  loadequilibrium:

    environment:

      DATABASE_URL: "postgres://le:yourpassword@postgres:5432/le?sslmode=disable"

  postgres:

    image: postgres:16-alpine

    environment:

      POSTGRES_USER:     le

      POSTGRES_PASSWORD: yourpassword

      POSTGRES_DB:       le

```

The schema is created automatically on first start. No migrations to run.

---

## Environment Variables

All settings have safe defaults. You only need to set `INGEST_TOKEN` in production.

| Variable | Default | What it does |

|---|---|---|

| `INGEST_TOKEN` | *(empty)* | Auth token for the ingest API. Set this in production. |

| `DATABASE_URL` | *(empty)* | Postgres DSN. Leave empty to run in-memory. |

| `LISTEN_ADDR` | `:8080` | Port to listen on. |

| `TICK_INTERVAL` | `2s` | How often the engine runs. Do not change unless you have a reason. |

| `UTILISATION_SETPOINT` | `0.70` | Target capacity utilisation (70%). Leave 30% headroom. |

| `MAX_SERVICES` | `200` | Maximum number of services to track. |

| `LE_PORT` | `8080` | Host port (compose only). Change if 8080 is taken on your machine. |

Full list of all variables is in the [Configuration Reference](#configuration-reference) section below.

---

## Kubernetes Deployment

For production on Kubernetes, use the manifests in `k8s/`. The image is the same — just one Deployment.

```bash

# 1. Create secrets (never put real values in the YAML files)

kubectl create secret generic loadequilibrium-secrets \

  --from-literal=database-url='postgres://le:YOURPASS@postgres-svc:5432/le?sslmode=require' \

  --from-literal=ingest-token='your-real-token' \

  -n loadequilibrium

# 2. Apply manifests

kubectl apply -f k8s/00-namespace.yml

kubectl apply -f k8s/01-secrets.yml

kubectl apply -f k8s/02-configmap.yml

kubectl apply -f k8s/03-postgres.yml

kubectl apply -f k8s/04-deployment.yml

kubectl apply -f k8s/05-prometheus.yml

kubectl apply -f k8s/06-grafana.yml

kubectl apply -f k8s/07-ingress.yml   # edit hostnames first

# 3. Watch it start

kubectl rollout status deployment/loadequilibrium -n loadequilibrium

# 4. Access before ingress is ready

kubectl port-forward svc/loadequilibrium-svc 8080:80 -n loadequilibrium

```

**Important**: Run exactly 1 replica. The engine keeps in-memory state that is not distributed across pods. If you need high availability, run active-passive with a shared PostgreSQL backend (bot[...]

---

## How to Send Metrics Manually (Without Docker)

If you are not using Docker (e.g. running services as systemd units, on bare metal, or in a different container runtime), use `collector.py` to push metrics from an existing Prometheus server:

```bash

pip install requests

PROMETHEUS_URL=http://your-prometheus:9090 \

INGEST_URL=http://loadequilibrium:8080/api/v1/ingest \

python3 collector.py

```

Or push directly to the ingest API from your application code:

```bash

curl -X POST http://localhost:8080/api/v1/ingest \

  -H "Content-Type: application/json" \

  -d '[{

    "service_id":    "my-api",

    "request_rate":  142.5,

    "error_rate":    0.002,

    "latency": { "p50": 12.1, "p95": 48.3, "p99": 91.2, "mean": 18.4 },

    "queue_depth":   23,

    "active_conns":  87

  }]'

```

---

## API Reference

| Method | Path | What it does |

|---|---|---|

| `GET` | `/` | Dashboard UI |

| `GET` | `/ws` | WebSocket — live tick stream |

| `GET` | `/api/v1/snapshot` | Last tick state as JSON (no WebSocket needed) |

| `GET` | `/health` | Liveness check — returns `{"status":"ok"}` |

| `GET` | `/metrics` | Prometheus metrics |

| `POST` | `/api/v1/ingest` | Push telemetry points |

| `POST` | `/api/v1/control/toggle` | Enable / freeze autopilot |

| `POST` | `/api/v1/policy/update` | Change policy preset |

| `POST` | `/api/v1/control/chaos-run` | Inject a load spike for testing |

| `POST` | `/api/v1/control/replay-burst` | Replay a traffic burst |

| `POST` | `/api/v1/runtime/step` | Force one engine tick manually |

| `POST` | `/api/v1/alerts/ack` | Acknowledge a reasoning event |

---

## How the Engine Works (For Those Who Want to Know)

Every 2 seconds, this sequence runs:

1. **Collect** — scrape `/metrics` from all labelled services

2. **Window** — compute EWMA fast/slow, variance, confidence score, signal quality per service

3. **Topology** — build a dependency graph from upstream call data (if services report it)

4. **Model** — queue physics per service: utilisation ρ, mean wait time, queue depth, burst amplification

5. **Reason** — rule engine fires events: `collapse_risk`, `cascade_risk`, `saturation_predicted`, `keystone_degraded`

6. **Simulate** — Monte Carlo forward projection: what happens in the next 60 seconds under current trend?

7. **Autopilot** — MPC (Model Predictive Control) + RL (policy gradient) computes target capacity

8. **Decide** — Control authority converts float target to integer replica count, enforces cooldowns

9. **Actuate** — Send scaling directive to your orchestrator (Kubernetes, Nomad, etc.)

10. **Broadcast** — Push full tick state to all WebSocket clients (your dashboard)

The sandbox runs every 10 ticks: it takes the current service state, generates a synthetic load spike from real statistics, runs two competing control strategies against it, and uses the result t[...]

---

## Verifying Everything Works

Run these three commands. All three must pass before deploying to production.

```bash

# Compile check — catches any broken wiring

go build ./...

# Unit + integration tests with race detector

go test -race -count=1 -timeout=300s ./internal/...

# Full autopilot system test — 10 scenarios, must all pass

go build -o system_test_runner ./cmd/system_test_runner/

./system_test_runner 2>/dev/null

# Expected output line: "STABLE_PRODUCTION_GRADE — 10/10 — 0 SLA breaches"

```

---

## Files You Can Delete

These files are safe to remove. They are developer tools that are never included in the Docker image:

| File | What it is | Safe to delete? |

|---|---|---|

| `Dockerfile.collector` | Old sidecar build — replaced by embedded goroutine | ✅ Yes |

| `ui/Dockerfile` | Old standalone nginx UI build — replaced by Go binary | ✅ Yes |

| `run_physics_validation.sh` | Developer script to validate physics model output | ✅ Yes (keep if you develop the engine) |

| `collector.py` | Alternative Python collector for non-Docker environments | ❌ Keep — useful for Prometheus users |

| `Makefile` | Build convenience commands | ❌ Keep — useful for developers |

| `cmd/system_test_runner/` | Autopilot test suite | ❌ Keep — CI depends on it |

---

## Configuration Reference

### Core Engine

| Variable | Default | Description |

|---|---|---|

| `LISTEN_ADDR` | `:8080` | HTTP server bind address |

| `TICK_INTERVAL` | `2s` | Control tick frequency |

| `TICK_DEADLINE` | `1800ms` | Max time per tick before adaptive stretch |

| `MIN_TICK_INTERVAL` | `1s` | Minimum tick interval under stretch |

| `MAX_TICK_INTERVAL` | `10s` | Maximum tick interval under stretch |

| `WORKER_POOL_SIZE` | `8` | Parallel workers for window computation |

### Telemetry

| Variable | Default | Description |

|---|---|---|

| `RING_BUFFER_DEPTH` | `300` | Samples retained per service |

| `MAX_SERVICES` | `200` | Maximum tracked services |

| `STALE_SERVICE_AGE` | `5m` | Prune threshold for inactive services |

| `INGEST_TOKEN` | `` | Auth token — set this in production |

### Control Policy

| Variable | Default | Description |

|---|---|---|

| `UTILISATION_SETPOINT` | `0.70` | Target utilisation — 70% leaves 30% headroom |

| `COLLAPSE_THRESHOLD` | `0.90` | Utilisation above which collapse risk is flagged |

| `EWMA_FAST_ALPHA` | `0.30` | Fast EWMA — responds in ~3 ticks |

| `EWMA_SLOW_ALPHA` | `0.10` | Slow EWMA — trend signal, responds in ~10 ticks |

| `PID_KP` | `-1.5` | PID proportional gain |

| `PID_KI` | `-0.3` | PID integral gain |

| `PID_KD` | `-0.1` | PID derivative gain |

### Simulation

| Variable | Default | Description |

|---|---|---|

| `SIM_BUDGET` | `45ms` | Wall-clock budget per tick for Monte Carlo sim |

| `SIM_HORIZON_MS` | `60000` | Simulation lookahead — 60 seconds |

| `SIM_SHOCK_FACTOR` | `2.0` | Worst-case load multiplier in simulation |

### Persistence

| Variable | Default | Description |

|---|---|---|

| `DATABASE_URL` | `` | Postgres DSN — if empty, runs in-memory |

| `PERSIST_INTERVAL` | `30s` | How often snapshots flush to DB |

---

## License

See `LICENSE` for terms.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kuldeep-poonia/loadequilibrium

Awesome Lists containing this project

README