https://github.com/kuldeep-poonia/loadequilibrium
Predictive infrastructure congestion control engine that models distributed systems as coupled queues to forecast saturation risk and generate proactive scaling or load-shedding signals.
https://github.com/kuldeep-poonia/loadequilibrium
autoscaling control-theory distributed-systems infrastructure observability queueing-theory
Last synced: 7 days ago
JSON representation
Predictive infrastructure congestion control engine that models distributed systems as coupled queues to forecast saturation risk and generate proactive scaling or load-shedding signals.
- Host: GitHub
- URL: https://github.com/kuldeep-poonia/loadequilibrium
- Owner: kuldeep-poonia
- Created: 2026-03-21T11:51:25.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-06-06T04:16:21.000Z (13 days ago)
- Last Synced: 2026-06-06T06:11:46.240Z (12 days ago)
- Topics: autoscaling, control-theory, distributed-systems, infrastructure, observability, queueing-theory
- Language: Go
- Homepage:
- Size: 1.13 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# LoadEquilibrium




> Predictive auto-scaling for Docker & Kubernetes. Watches your services, predicts failures 60 seconds ahead, and scales automatically using control theory (MPC + RL).
---
## What It Does
You add one label to each service you want monitored. LoadEquilibrium does the rest:
- Watches your services every 2 seconds
- Builds a live mathematical model of each service's queue, latency, and load
- Predicts when a service is about to fail — before it actually does
- Issues precise scaling decisions automatically
- Shows you everything in a live dashboard
It is not a threshold alarm. It does not wait until latency spikes to react. It uses the same class of control system used in aircraft autopilots — applied to your software.
---
## Getting Started in 3 Steps
### Step 1 — Add a label to each service you want monitored
```yaml
# your existing docker-compose.yml
services:
my-api:
image: your-app:latest
labels:
le.enable: "true" # ← add this one line
```
That is the only change you make to your existing service. No agents to install. No config files to write.
### Step 2 — Add LoadEquilibrium to your compose file
```yaml
loadequilibrium:
image: ghcr.io/your-org/loadequilibrium:latest
ports:
- "8080:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
```
### Step 3 — Start it
```bash
docker compose up -d
```
Open your browser at `http://localhost:8080`. You will see your services appear automatically within about 5 seconds.
---
## What Your Services Need to Expose
Your services must expose a `/metrics` endpoint in Prometheus format. This is standard for any service built with:
- Go (`prometheus/client_golang`)
- Python (`prometheus_client`)
- Node.js (`prom-client`)
- Java (`micrometer`)
- Any language with a Prometheus client library
The collector auto-detects the port. If your service exposes metrics on a non-standard port, add one more label:
```yaml
labels:
le.enable: "true"
le.port: "9100" # only needed for non-standard ports
```
---
## What You See in the Dashboard
**Monitor page** — read-only view of what is happening:
- Live metrics per service: requests per second, queue depth, wait time, capacity used
- Failure risk score — how likely each service is to stop responding in the next 60 seconds
- Incident timeline — every problem detected, what the engine predicted, what action it took
- Engine reasoning feed — plain-English explanation of what the autopilot is thinking right now
- Live event stream — everything happening across all services in chronological order
**Control page** — actions you can take:
- Enable or freeze the autopilot (freeze = it keeps watching but stops issuing commands)
- Switch operating policy: Safe Mode / Normal / Performance
- Run a stress test on any service to see how the autopilot responds
- Force the engine to step manually or retrain its model
---
## Architecture: One Image, One Port
The entire system runs in a single Docker container exposing port `8080`:
```
Your Services (with le.enable=true label)
│
│ Auto-discovered via Docker socket
│ Scraped every 2 seconds
▼
┌─────────────────────────────────────────┐
│ LoadEquilibrium :8080 │
│ │
│ Collector ──► Telemetry Store │
│ │ │
│ Tick Engine (2s) │
│ │ │ │
│ Autopilot Reasoning │
│ (MPC+RL) (Events) │
│ │ │ │
│ WebSocket Broadcast │
│ │ │
│ UI (React) │
└─────────────────────────────────────────┘
│
▼
http://localhost:8080
```
- Port `8080` — dashboard UI, WebSocket live feed, REST API, Prometheus metrics
- No separate collector container
- No separate nginx
- No Prometheus required (it is optional, for Grafana users)
- No database required (optional, for persistent history)
---
## Optional: Grafana + Prometheus
If you already use Grafana, LoadEquilibrium exposes a `/metrics` endpoint that Prometheus can scrape. A pre-built dashboard is included.
Add to your compose file:
```yaml
prometheus:
image: prom/prometheus:v2.52.0
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
ports:
- "9090:9090"
grafana:
image: grafana/grafana:10.4.0
ports:
- "3000:3000"
volumes:
- ./monitoring/grafana/datasource.yml:/etc/grafana/provisioning/datasources/datasource.yml:ro
- ./monitoring/grafana/provider.yml:/etc/grafana/provisioning/dashboards/provider.yml:ro
- ./monitoring/grafana/loadequilibrium-dashboard.json:/var/lib/grafana/dashboards/loadequilibrium-dashboard.json:ro
```
Grafana dashboard shows: traffic, latency, queue depth, autopilot decisions, signal quality, and Go runtime metrics.
---
## Optional: Persistent History (PostgreSQL)
Without a database, the engine runs entirely in memory. Data is lost on restart but the system works perfectly for real-time monitoring.
To keep a history of engine snapshots:
```yaml
loadequilibrium:
environment:
DATABASE_URL: "postgres://le:yourpassword@postgres:5432/le?sslmode=disable"
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: le
POSTGRES_PASSWORD: yourpassword
POSTGRES_DB: le
```
The schema is created automatically on first start. No migrations to run.
---
## Environment Variables
All settings have safe defaults. You only need to set `INGEST_TOKEN` in production.
| Variable | Default | What it does |
|---|---|---|
| `INGEST_TOKEN` | *(empty)* | Auth token for the ingest API. Set this in production. |
| `DATABASE_URL` | *(empty)* | Postgres DSN. Leave empty to run in-memory. |
| `LISTEN_ADDR` | `:8080` | Port to listen on. |
| `TICK_INTERVAL` | `2s` | How often the engine runs. Do not change unless you have a reason. |
| `UTILISATION_SETPOINT` | `0.70` | Target capacity utilisation (70%). Leave 30% headroom. |
| `MAX_SERVICES` | `200` | Maximum number of services to track. |
| `LE_PORT` | `8080` | Host port (compose only). Change if 8080 is taken on your machine. |
Full list of all variables is in the [Configuration Reference](#configuration-reference) section below.
---
## Kubernetes Deployment
For production on Kubernetes, use the manifests in `k8s/`. The image is the same — just one Deployment.
```bash
# 1. Create secrets (never put real values in the YAML files)
kubectl create secret generic loadequilibrium-secrets \
--from-literal=database-url='postgres://le:YOURPASS@postgres-svc:5432/le?sslmode=require' \
--from-literal=ingest-token='your-real-token' \
-n loadequilibrium
# 2. Apply manifests
kubectl apply -f k8s/00-namespace.yml
kubectl apply -f k8s/01-secrets.yml
kubectl apply -f k8s/02-configmap.yml
kubectl apply -f k8s/03-postgres.yml
kubectl apply -f k8s/04-deployment.yml
kubectl apply -f k8s/05-prometheus.yml
kubectl apply -f k8s/06-grafana.yml
kubectl apply -f k8s/07-ingress.yml # edit hostnames first
# 3. Watch it start
kubectl rollout status deployment/loadequilibrium -n loadequilibrium
# 4. Access before ingress is ready
kubectl port-forward svc/loadequilibrium-svc 8080:80 -n loadequilibrium
```
**Important**: Run exactly 1 replica. The engine keeps in-memory state that is not distributed across pods. If you need high availability, run active-passive with a shared PostgreSQL backend (bot[...]
---
## How to Send Metrics Manually (Without Docker)
If you are not using Docker (e.g. running services as systemd units, on bare metal, or in a different container runtime), use `collector.py` to push metrics from an existing Prometheus server:
```bash
pip install requests
PROMETHEUS_URL=http://your-prometheus:9090 \
INGEST_URL=http://loadequilibrium:8080/api/v1/ingest \
python3 collector.py
```
Or push directly to the ingest API from your application code:
```bash
curl -X POST http://localhost:8080/api/v1/ingest \
-H "Content-Type: application/json" \
-d '[{
"service_id": "my-api",
"request_rate": 142.5,
"error_rate": 0.002,
"latency": { "p50": 12.1, "p95": 48.3, "p99": 91.2, "mean": 18.4 },
"queue_depth": 23,
"active_conns": 87
}]'
```
---
## API Reference
| Method | Path | What it does |
|---|---|---|
| `GET` | `/` | Dashboard UI |
| `GET` | `/ws` | WebSocket — live tick stream |
| `GET` | `/api/v1/snapshot` | Last tick state as JSON (no WebSocket needed) |
| `GET` | `/health` | Liveness check — returns `{"status":"ok"}` |
| `GET` | `/metrics` | Prometheus metrics |
| `POST` | `/api/v1/ingest` | Push telemetry points |
| `POST` | `/api/v1/control/toggle` | Enable / freeze autopilot |
| `POST` | `/api/v1/policy/update` | Change policy preset |
| `POST` | `/api/v1/control/chaos-run` | Inject a load spike for testing |
| `POST` | `/api/v1/control/replay-burst` | Replay a traffic burst |
| `POST` | `/api/v1/runtime/step` | Force one engine tick manually |
| `POST` | `/api/v1/alerts/ack` | Acknowledge a reasoning event |
---
## How the Engine Works (For Those Who Want to Know)
Every 2 seconds, this sequence runs:
1. **Collect** — scrape `/metrics` from all labelled services
2. **Window** — compute EWMA fast/slow, variance, confidence score, signal quality per service
3. **Topology** — build a dependency graph from upstream call data (if services report it)
4. **Model** — queue physics per service: utilisation ρ, mean wait time, queue depth, burst amplification
5. **Reason** — rule engine fires events: `collapse_risk`, `cascade_risk`, `saturation_predicted`, `keystone_degraded`
6. **Simulate** — Monte Carlo forward projection: what happens in the next 60 seconds under current trend?
7. **Autopilot** — MPC (Model Predictive Control) + RL (policy gradient) computes target capacity
8. **Decide** — Control authority converts float target to integer replica count, enforces cooldowns
9. **Actuate** — Send scaling directive to your orchestrator (Kubernetes, Nomad, etc.)
10. **Broadcast** — Push full tick state to all WebSocket clients (your dashboard)
The sandbox runs every 10 ticks: it takes the current service state, generates a synthetic load spike from real statistics, runs two competing control strategies against it, and uses the result t[...]
---
## Verifying Everything Works
Run these three commands. All three must pass before deploying to production.
```bash
# Compile check — catches any broken wiring
go build ./...
# Unit + integration tests with race detector
go test -race -count=1 -timeout=300s ./internal/...
# Full autopilot system test — 10 scenarios, must all pass
go build -o system_test_runner ./cmd/system_test_runner/
./system_test_runner 2>/dev/null
# Expected output line: "STABLE_PRODUCTION_GRADE — 10/10 — 0 SLA breaches"
```
---
## Files You Can Delete
These files are safe to remove. They are developer tools that are never included in the Docker image:
| File | What it is | Safe to delete? |
|---|---|---|
| `Dockerfile.collector` | Old sidecar build — replaced by embedded goroutine | ✅ Yes |
| `ui/Dockerfile` | Old standalone nginx UI build — replaced by Go binary | ✅ Yes |
| `run_physics_validation.sh` | Developer script to validate physics model output | ✅ Yes (keep if you develop the engine) |
| `collector.py` | Alternative Python collector for non-Docker environments | ❌ Keep — useful for Prometheus users |
| `Makefile` | Build convenience commands | ❌ Keep — useful for developers |
| `cmd/system_test_runner/` | Autopilot test suite | ❌ Keep — CI depends on it |
---
## Configuration Reference
### Core Engine
| Variable | Default | Description |
|---|---|---|
| `LISTEN_ADDR` | `:8080` | HTTP server bind address |
| `TICK_INTERVAL` | `2s` | Control tick frequency |
| `TICK_DEADLINE` | `1800ms` | Max time per tick before adaptive stretch |
| `MIN_TICK_INTERVAL` | `1s` | Minimum tick interval under stretch |
| `MAX_TICK_INTERVAL` | `10s` | Maximum tick interval under stretch |
| `WORKER_POOL_SIZE` | `8` | Parallel workers for window computation |
### Telemetry
| Variable | Default | Description |
|---|---|---|
| `RING_BUFFER_DEPTH` | `300` | Samples retained per service |
| `MAX_SERVICES` | `200` | Maximum tracked services |
| `STALE_SERVICE_AGE` | `5m` | Prune threshold for inactive services |
| `INGEST_TOKEN` | `` | Auth token — set this in production |
### Control Policy
| Variable | Default | Description |
|---|---|---|
| `UTILISATION_SETPOINT` | `0.70` | Target utilisation — 70% leaves 30% headroom |
| `COLLAPSE_THRESHOLD` | `0.90` | Utilisation above which collapse risk is flagged |
| `EWMA_FAST_ALPHA` | `0.30` | Fast EWMA — responds in ~3 ticks |
| `EWMA_SLOW_ALPHA` | `0.10` | Slow EWMA — trend signal, responds in ~10 ticks |
| `PID_KP` | `-1.5` | PID proportional gain |
| `PID_KI` | `-0.3` | PID integral gain |
| `PID_KD` | `-0.1` | PID derivative gain |
### Simulation
| Variable | Default | Description |
|---|---|---|
| `SIM_BUDGET` | `45ms` | Wall-clock budget per tick for Monte Carlo sim |
| `SIM_HORIZON_MS` | `60000` | Simulation lookahead — 60 seconds |
| `SIM_SHOCK_FACTOR` | `2.0` | Worst-case load multiplier in simulation |
### Persistence
| Variable | Default | Description |
|---|---|---|
| `DATABASE_URL` | `` | Postgres DSN — if empty, runs in-memory |
| `PERSIST_INTERVAL` | `30s` | How often snapshots flush to DB |
---
## License
See `LICENSE` for terms.