{"id":26731614,"url":"https://github.com/souvik03-136/neurabalancer","last_synced_at":"2026-04-01T20:29:55.470Z","repository":{"id":276187182,"uuid":"927905782","full_name":"souvik03-136/NeuraBalancer","owner":"souvik03-136","description":"Self-Optimizing Load Balancer","archived":false,"fork":false,"pushed_at":"2025-03-26T17:56:05.000Z","size":36859,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-26T18:42:07.765Z","etag":null,"topics":["api-development","auto-recovery","automated-failover","ci-cd","containerization","distributed-tracing","fault-tolerance","github-actions","golang","grafana","kubernetes","load-balancing","metrics-collection","predicting-server-load","prometheus","rate-limiting","reinforcement-learning","scalability","timescaledb","traffic-distributions"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/souvik03-136.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-05T18:29:54.000Z","updated_at":"2025-03-26T17:56:11.000Z","dependencies_parsed_at":"2025-03-02T20:31:46.302Z","dependency_job_id":null,"html_url":"https://github.com/souvik03-136/NeuraBalancer","commit_stats":null,"previous_names":["souvik03-136/neurabalancer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/souvik03-136%2FNeuraBalancer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/souvik03-136%2FNeuraBalancer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/souvik03-136%2FNeuraBalancer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/souvik03-136%2FNeuraBalancer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/souvik03-136","download_url":"https://codeload.github.com/souvik03-136/NeuraBalancer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245946221,"owners_count":20698385,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api-development","auto-recovery","automated-failover","ci-cd","containerization","distributed-tracing","fault-tolerance","github-actions","golang","grafana","kubernetes","load-balancing","metrics-collection","predicting-server-load","prometheus","rate-limiting","reinforcement-learning","scalability","timescaledb","traffic-distributions"],"created_at":"2025-03-28T00:26:00.579Z","updated_at":"2026-04-01T20:29:55.462Z","avatar_url":"https://github.com/souvik03-136.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NeuraBalancer\n\n**An AI-driven, self-optimising HTTP load balancer** — routes requests using real-time server metrics and an ONNX reinforcement-learning model, with full observability (Prometheus · Grafana · Loki · Tempo).\n\n\u003e **Want to see it running?**  \n\u003e The complete end-to-end demo — terminal output, Prometheus graphs, Grafana dashboards, Loki logs, and database verification — is documented in **[DEMO_WALKTHROUGH.md](./DEMO_WALKTHROUGH.md)**.\n\n---\n\n## Table of Contents\n\n1. [Architecture](#architecture)\n2. [Features](#features)\n3. [Project Structure](#project-structure)\n4. [Prerequisites](#prerequisites)\n5. [Quick Start](#quick-start)\n6. [Configuration Reference](#configuration-reference)\n7. [Load Balancing Strategies](#load-balancing-strategies)\n8. [API Reference](#api-reference)\n9. [Observability Stack](#observability-stack)\n10. [ML Model](#ml-model)\n11. [Development Guide](#development-guide)\n12. [Testing](#testing)\n13. [Deployment](#deployment)\n14. [Contributing](#contributing)\n\n---\n\n## Architecture\n\n```\nClients\n  │\n  ▼\nLoad Balancer  :8080   (Go · Echo · Prometheus metrics)\n  ├─► Backend-1 :8001  ◄──┐\n  ├─► Backend-2 :8002     │  Health checks every 5 s\n  └─► Backend-3 :8003  ◄──┘\n  │\n  ├─► ML Service    :8081   (Go · ONNX Runtime)\n  ├─► TimescaleDB   :5432   (request + metrics storage)\n  └─► Observability\n        ├── Prometheus  :9090\n        ├── Grafana     :3000\n        ├── Loki        :3100  (log aggregation)\n        ├── Tempo       :3200  (distributed traces)\n        └── OTel Coll.  :4317  (OTLP receiver)\n```\n\nAll configuration is **environment-driven** — no hardcoded values anywhere in the codebase.\n\n---\n\n## Features\n\n| Category | Detail |\n|---|---|\n| **Strategies** | Round Robin, Weighted Round Robin, Least Connections, Random, ML |\n| **ML Routing** | ONNX inference, LRU prediction cache, circuit breaker, WRR fallback |\n| **Resilience** | Active health checks, automatic server recovery, graceful shutdown |\n| **Observability** | Prometheus metrics, structured JSON logs → Loki, OTLP traces → Tempo → Grafana |\n| **Scalability** | Stateless LB; add backend instances by extending `SERVERS` env var |\n| **Security** | Distroless images, non-root containers, no secrets in images |\n| **Dev Experience** | Single `task up` start, pre-commit hooks, golangci-lint, pytest, coverage |\n\n---\n\n## Project Structure\n\n```\nneurabalancer/\n│\n├── backend/\n│   ├── cmd/\n│   │   ├── api/           # Load balancer entrypoint\n│   │   └── server/        # Generic backend server (config-driven port)\n│   ├── internal/\n│   │   ├── api/           # Echo handlers, middleware, router\n│   │   ├── config/        # Config loader + zap logger factory\n│   │   ├── database/      # PostgreSQL connection + all queries\n│   │   ├── loadbalancer/  # Balancer core, strategies, ML strategy\n│   │   ├── metrics/       # Prometheus collector\n│   │   └── tracer/        # OpenTelemetry setup\n│   └── migrations/        # SQL schema migrations\n│\n├── ml/\n│   ├── model-server/      # ONNX inference HTTP server\n│   ├── models/            # .onnx + scaler.json (git-ignored, generated)\n│   ├── scripts/           # deploy_model.sh\n│   └── training/          # PyTorch training pipeline + tests\n│\n├── configs/\n│   ├── prometheus/        # prometheus.yml\n│   ├── loki/              # loki.yml + promtail.yml\n│   ├── tempo/             # tempo.yml\n│   ├── otel/              # otel-collector.yml\n│   └── grafana/           # Provisioned datasources + dashboards\n│\n├── deployments/\n│   ├── docker/            # Dockerfile.balancer / .backend / .ml\n│   ├── helm/              # Helm chart for Kubernetes\n│   └── k8s/               # Raw Kubernetes manifests\n│\n├── scripts/               # healthcheck.sh, wait-for.sh\n├── .github/workflows/     # CI (test + lint) and CD (build + deploy)\n├── docker-compose.yml     # Full local stack\n├── Taskfile.yml           # Developer task runner\n├── .env.example           # All available configuration keys\n├── .golangci.yml          # Linter configuration\n└── DEMO_WALKTHROUGH.md    # End-to-end demo with screenshots\n```\n\n---\n\n## Prerequisites\n\n| Tool | Version | Purpose |\n|---|---|---|\n| Docker + Compose | 24+ / v2+ | Running the full stack |\n| Go | 1.21+ | Building Go services |\n| Python | 3.11+ | ML training |\n| Task | 3+ | Task runner (`brew install go-task`) |\n| golangci-lint | 1.56+ | Go linting |\n\n---\n\n## Quick Start\n\n### 1 — Clone and configure\n\n```bash\ngit clone https://github.com/souvik03-136/neurabalancer.git\ncd neurabalancer\ncp .env.example .env\n```\n\nOpen `.env` and set **at minimum**:\n\n```dotenv\nDB_PASSWORD=your_secure_password\nGRAFANA_ADMIN_PASSWORD=your_grafana_password\n```\n\n### 2 — Start the full stack\n\n```bash\ntask up\n```\n\nThis builds all images and starts every service. First run takes ~3 minutes.\n\n### 3 — Verify everything is healthy\n\n```bash\ntask ps\n# All containers should show (healthy)\n\ncurl http://localhost:8080/health/live\n# {\"status\":\"ok\",\"ts\":\"...\"}\n\ncurl http://localhost:8080/health/ready\n# {\"status\":\"ready\",\"total\":3,\"healthy\":3}\n\ncurl http://localhost:8080/api/v1/servers\n# Lists all backend servers and their state\n```\n\n### 4 — Open the dashboards\n\n| Service | URL | Credentials |\n|---|---|---|\n| Grafana | http://localhost:3000 | `admin` / `$GRAFANA_ADMIN_PASSWORD` |\n| Prometheus | http://localhost:9090 | — |\n| Load Balancer metrics | http://localhost:8080/metrics | — |\n\nThe **NeuraBalancer — Overview** dashboard is pre-provisioned in Grafana.\n\n---\n\n## Demo Walkthrough\n\nIf you want to verify the system is working end-to-end — or show it to someone — the **[DEMO_WALKTHROUGH.md](./DEMO_WALKTHROUGH.md)** covers everything in order:\n\n| Phase | What it covers |\n|---|---|\n| [Phase 1](./DEMO_WALKTHROUGH.md#phase-1--stack-startup) | All 13 containers healthy, load balancer startup logs, backend server logs |\n| [Phase 2](./DEMO_WALKTHROUGH.md#phase-2--api-health-checks) | Liveness probe, readiness probe, server inventory |\n| [Phase 3](./DEMO_WALKTHROUGH.md#phase-3--sending-real-traffic) | Routing individual requests and 100-request burst |\n| [Phase 4](./DEMO_WALKTHROUGH.md#phase-4--raw-prometheus-metrics) | Raw `neurabalancer_*` metrics from `/metrics` endpoint |\n| [Phase 5](./DEMO_WALKTHROUGH.md#phase-5--database-verification) | TimescaleDB tables — servers registered, requests recorded |\n| [Phase 6](./DEMO_WALKTHROUGH.md#phase-6--prometheus-ui) | Prometheus UI — targets, request rate, P95 latency, CPU graphs |\n| [Phase 7](./DEMO_WALKTHROUGH.md#phase-7--grafana-dashboards--loki-logs) | Grafana overview dashboard, Loki structured log explorer |\n| [Phase 8](./DEMO_WALKTHROUGH.md#phase-8--ml-service-status) | ML service degraded mode — expected on fresh install |\n\n---\n\n## Configuration Reference\n\nAll configuration is loaded from environment variables. Copy `.env.example` to `.env` to get started. Every key has a documented default — see `.env.example` for the full reference.\n\n### Core variables\n\n| Variable | Default | Description |\n|---|---|---|\n| `APP_ENV` | `development` | Environment name (`development`, `staging`, `production`) |\n| `APP_PORT` | `8080` | Load balancer HTTP port |\n| `LOG_LEVEL` | `info` | `debug`, `info`, `warn`, `error` |\n| `LOG_FORMAT` | `json` | `json` for structured logs; `text` for human-readable |\n| `LB_STRATEGY` | `least_connections` | Load balancing algorithm |\n| `SERVERS` | _(required)_ | Comma-separated backend URLs |\n\n### Database\n\n| Variable | Default | Description |\n|---|---|---|\n| `DB_HOST` | `localhost` | PostgreSQL host |\n| `DB_PORT` | `5432` | PostgreSQL port |\n| `DB_NAME` | `neurabalancer` | Database name |\n| `DB_USER` | `postgres` | Database user |\n| `DB_PASSWORD` | _(required)_ | Database password |\n| `DB_SSLMODE` | `disable` | Use `require` in production |\n\n### ML Service\n\n| Variable | Default | Description |\n|---|---|---|\n| `ML_MODEL_ENDPOINT` | `http://ml-service:8081` | ONNX model server URL |\n| `ML_MODEL_TIMEOUT_MS` | `300` | Per-prediction timeout |\n| `ML_CIRCUIT_BREAKER_RESET_SECONDS` | `30` | Seconds before circuit breaker auto-resets |\n| `ML_CACHE_SIZE` | `1000` | LRU prediction cache entries |\n\n---\n\n## Load Balancing Strategies\n\nSelect a strategy by setting `LB_STRATEGY` in `.env`:\n\n| Strategy | Value | Description |\n|---|---|---|\n| Least Connections | `least_connections` | Routes to the server with the fewest active requests. Best general-purpose default. |\n| Round Robin | `round_robin` | Cycles through servers in sequence. Fastest algorithm. |\n| Weighted Round Robin | `weighted_round_robin` | Like round robin but honours per-server `weight` values set in the DB or via `SERVER_WEIGHT_*` env vars. |\n| Random | `random` | Selects a server uniformly at random. |\n| ML | `ml` | Uses ONNX model to predict the lowest-latency server. Falls back to Weighted Round Robin if the model service is unavailable or the circuit breaker is open. |\n\n### Changing strategy at runtime\n\nRestart only the load-balancer container — no data loss:\n\n```bash\n# Edit .env: LB_STRATEGY=ml\ndocker compose up -d --no-deps load-balancer\n```\n\n---\n\n## API Reference\n\n| Method | Path | Description |\n|---|---|---|\n| `GET` | `/health/live` | Liveness probe — always 200 if process is up |\n| `GET` | `/health/ready` | Readiness probe — 503 if no healthy backends |\n| `GET` | `/metrics` | Prometheus metrics scrape endpoint |\n| `GET` | `/api/v1/servers` | List all backend servers and their state |\n| `ANY` | `/api/v1/request` | Proxy a request to the best backend |\n| `ANY` | `/api/v1/request/*` | Proxy with arbitrary path suffix |\n\n### Example: Route a request\n\n```bash\ncurl -X POST http://localhost:8080/api/v1/request \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"key\": \"value\"}'\n```\n\n---\n\n## Observability Stack\n\n### Prometheus metrics\n\nAll metrics are prefixed `neurabalancer_`:\n\n| Metric | Type | Description |\n|---|---|---|\n| `http_requests_total` | Counter | Requests by method, path, status, server_id |\n| `http_request_duration_seconds` | Histogram | Request latency |\n| `server_cpu_usage_percent` | Gauge | Backend CPU % |\n| `server_memory_usage_percent` | Gauge | Backend memory % |\n| `server_active_connections` | Gauge | In-flight connections |\n| `server_error_rate` | Gauge | Rolling error rate |\n| `ml_predictions_total` | Counter | Successful ML predictions |\n| `ml_errors_total` | Counter | ML errors (fallback triggers) |\n| `ml_cache_hits_total` | Counter | LRU cache hits |\n| `ml_circuit_breaker_open` | Gauge | 1 = circuit open, 0 = closed |\n| `ml_inference_duration_seconds` | Histogram | ONNX inference latency |\n\n### Structured logs (Loki)\n\nEvery request produces a JSON log line containing:\n`request_id`, `method`, `path`, `status`, `duration`, `remote_ip`, `server_id`\n\nQuery in Grafana:\n```logql\n{service=\"load-balancer\"} | json | status \u003e= 500\n{service=\"load-balancer\"} | json | duration \u003e 1s\n```\n\n### Distributed traces (Tempo)\n\nEvery request carries an OpenTelemetry trace. View in Grafana → Explore → Tempo. Traces link to logs via `traceID` derived field in Loki.\n\n---\n\n## ML Model\n\n### How it works\n\n1. **Feature collection**: For each healthy server, the ML strategy gathers 6 features: `cpu_usage`, `memory_usage`, `active_conns`, `error_rate`, `response_p95`, `capacity`.\n2. **Inference**: Features are sent to the ONNX model server. The server normalises them using `scaler.json` and runs inference.\n3. **Selection**: The server with the lowest predicted score (= expected latency × load) is selected, subject to capacity constraints.\n4. **Fallback**: If inference fails or the circuit breaker is open, the strategy falls back to Weighted Round Robin automatically.\n\n### Training a new model\n\nThe system must have collected at least a few hours of request data before training.\n\n```bash\n# Ensure DB has data, then:\ntask ml-train\n# Outputs: ml/models/load_balancer.onnx, scaler.json, inference_features.json\n\n# Validate and hot-reload the model server:\nbash ml/scripts/deploy_model.sh\n```\n\n### ONNX input/output names\n\nThe model server reads `MODEL_INPUT_NAME` and `MODEL_OUTPUT_NAME` from environment:\n\n```dotenv\nMODEL_INPUT_NAME=features           # matches torch.onnx.export input_names\nMODEL_OUTPUT_NAME=predicted_score   # matches torch.onnx.export output_names\n```\n\nThese default to the values the training script uses. Only change them if you retrain with a different export configuration.\n\n---\n\n## Development Guide\n\n### Running services individually\n\n```bash\n# Start only infrastructure (DB, Redis, observability)\ndocker compose up -d postgres redis prometheus grafana loki tempo otel-collector\n\n# Run load balancer locally\ntask build-balancer\nDB_HOST=localhost SERVERS=http://localhost:8001 ./bin/neurabalancer\n\n# Run a backend server locally (port from env)\nBACKEND_PORT=8001 BACKEND_INSTANCE_ID=local-1 ./bin/backend-server\n```\n\n### Adding a new backend server\n\n1. Add a new service in `docker-compose.yml` using the same pattern as `backend-1`.\n2. Append its URL to the `SERVERS` env var in `docker-compose.yml`.\n3. Run `docker compose up -d`.\n\nNo code changes required.\n\n### Adding a new load-balancing strategy\n\n1. Create a struct implementing the `Strategy` interface in `backend/internal/loadbalancer/`.\n2. Register it in `backend/internal/loadbalancer/factory.go`.\n3. Add the strategy name to the config validation in `backend/internal/config/config.go`.\n4. Add tests in `strategies_test.go`.\n\n---\n\n## Testing\n\n```bash\n# All Go tests with race detector\ntask test\n\n# Tests + HTML coverage report\ntask test-coverage\n\n# Python feature-alignment tests\ntask test-python\n\n# Run a quick load test (requires 'hey')\ntask load-test\n\n# Lint\ntask lint\n```\n\n---\n\n## Deployment\n\n### Docker Compose (recommended for single-node)\n\n```bash\ncp .env.example .env   # Fill in production values\ntask up\n```\n\n### Kubernetes (Helm)\n\n```bash\n# Install\nhelm install neurabalancer deployments/helm/charts/neurabalancer \\\n  --set image.tag=v1.0.0 \\\n  --set db.password=\u003cpassword\u003e \\\n  --set grafana.adminPassword=\u003cpassword\u003e\n\n# Upgrade\nhelm upgrade neurabalancer deployments/helm/charts/neurabalancer \\\n  --set image.tag=v1.1.0\n\n# Status\nkubectl get pods -l app.kubernetes.io/name=neurabalancer\n```\n\n### CI/CD\n\n- **CI** runs on every pull request and push to `main`/`develop` — linting, tests, Docker build check.\n- **CD** runs on version tags (`v*.*.*`) — builds and pushes images to GHCR, then syncs ArgoCD.\n\nSet these repository secrets for CD:\n- `ARGOCD_SERVER` — your ArgoCD server URL\n- `ARGOCD_AUTH_TOKEN` — ArgoCD authentication token\n\n---\n\n## Contributing\n\n1. Fork the repository\n2. Run `task setup` to install hooks and dependencies\n3. Create a feature branch: `git checkout -b feat/my-feature`\n4. Commit using [Conventional Commits](https://www.conventionalcommits.org/): `feat:`, `fix:`, `chore:`, etc.\n5. Push and open a pull request against `main`\n\nPlease keep PRs focused and small. Include tests for new functionality.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsouvik03-136%2Fneurabalancer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsouvik03-136%2Fneurabalancer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsouvik03-136%2Fneurabalancer/lists"}