{"id":50766533,"url":"https://github.com/kuldeep-poonia/distributed-runtime-brain","last_synced_at":"2026-06-11T14:01:36.383Z","repository":{"id":340556450,"uuid":"1166387095","full_name":"kuldeep-poonia/distributed-runtime-brain","owner":"kuldeep-poonia","description":"Distributed infrastructure platform for deploying, scaling and observing autonomous agents with persistent memory, execution replay and control-plane orchestration.","archived":false,"fork":false,"pushed_at":"2026-03-04T05:13:32.000Z","size":14192,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-03-22T07:52:52.289Z","etag":null,"topics":["backend-architecture","distributed-systems","event-driven","fastapi","infrastructure","microservices","reliability-engineering"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kuldeep-poonia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-25T07:01:40.000Z","updated_at":"2026-03-21T05:14:57.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/kuldeep-poonia/distributed-runtime-brain","commit_stats":null,"previous_names":["poonia-98/ai-infra","kuldeep-poonia/distributed-runtime-brain"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/kuldeep-poonia/distributed-runtime-brain","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuldeep-poonia%2Fdistributed-runtime-brain","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuldeep-poonia%2Fdistributed-runtime-brain/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuldeep-poonia%2Fdistributed-runtime-brain/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuldeep-poonia%2Fdistributed-runtime-brain/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kuldeep-poonia","download_url":"https://codeload.github.com/kuldeep-poonia/distributed-runtime-brain/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kuldeep-poonia%2Fdistributed-runtime-brain/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34201842,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backend-architecture","distributed-systems","event-driven","fastapi","infrastructure","microservices","reliability-engineering"],"created_at":"2026-06-11T14:01:35.230Z","updated_at":"2026-06-11T14:01:36.363Z","avatar_url":"https://github.com/kuldeep-poonia.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AgentPlane\n\nAgentPlane is an infrastructure platform for deploying and running autonomous AI agents at scale. Think of it the way you think about Kubernetes — you don't manually manage where your containers run, how many replicas spin up, or what happens when a node dies. AgentPlane does the same thing, but for agents.\n\nYou give it an agent. It handles the rest.\n\n---\n\n## What it actually does\n\nMost agent frameworks stop at the prompt. You write a system prompt, wire up some tools, and ship it. That works fine until you have 50 agents running in production, one of them starts misbehaving at 3am, you need to debug exactly what decision it made at step 14 of execution #847, and you're staring at logs wondering what happened.\n\nAgentPlane was built around the operational reality of running agents in production:\n\n- Agents need to be deployed somewhere, scaled when load increases, and restarted when they crash\n- Agents need memory that persists across sessions — not just a context window, but real retrieval\n- Agents talking to other agents across organizations need trust boundaries and billing\n- You need to be able to replay any execution and see every single step\n- Sometimes you want to mutate an agent's prompt, run it against a fitness function, and automatically promote the version that performs better\n\nThe architecture reflects these needs directly. There isn't a \"monitoring addon\" bolted on — observability is wired into every service from the start.\n\n---\n\n## Stack\n\n**Backend:** Python 3.11 / FastAPI — handles API, auth, billing, and orchestration logic  \n**Microservices:** Go 1.21 — the 28 Go services handle the actual runtime work  \n**Frontend:** Next.js 14 (App Router) — the control plane UI  \n**Database:** PostgreSQL 16 with pgvector extension  \n**Cache:** Redis 7  \n**Event bus:** NATS JetStream  \n**Observability:** Prometheus + Grafana + OpenTelemetry\n\n---\n\n## Getting started\n\nYou need Docker and Docker Compose v2. That's it.\n\n```bash\ngit clone https://github.com/your-org/agentplane\ncd agentplane\npip install -e cli/\nagentplane install\n```\n\nThe install command will generate your `.env` with random secrets, pull images, run all 10 migrations, seed a demo agent, and open the dashboard. First run takes a few minutes while images download. After that, `agentplane up` starts everything in under 30 seconds.\n\nOnce running:\n\n| URL | What's there |\n|-----|-------------|\n| http://localhost:3000 | Control plane dashboard |\n| http://localhost:8000/docs | API documentation |\n| http://localhost:8200 | Control Brain API |\n| http://localhost:3001 | Grafana (admin/admin) |\n| http://localhost:9090 | Prometheus |\n\n---\n\n## CLI reference\n\n```bash\nagentplane install       # First-time setup — generates secrets, runs migrations, seeds data\nagentplane up            # Start everything\nagentplane down          # Stop everything\nagentplane restart       # Stop then start\nagentplane status        # Live health table of all 35 services\nagentplane logs backend  # Tail logs for a specific service\nagentplane migrate       # Run pending SQL migrations\nagentplane deploy-agent  # Interactive wizard to deploy an agent\nagentplane brain         # Show Control Brain state and agent statuses\nagentplane services      # List all registered microservices and their health\nagentplane reset         # Nuclear option — wipes volumes, starts fresh\nagentplane dashboard     # Open the browser\n```\n\n---\n\n## Project structure\n\n```\nagentplane/\n│\n├── backend/                    # FastAPI application — the main API\n│   ├── main.py                 # App entrypoint, all routers registered here\n│   ├── config.py               # Settings via pydantic-settings\n│   ├── database.py             # Async SQLAlchemy setup\n│   ├── models.py               # ORM models\n│   ├── routers/\n│   │   ├── agents.py           # Core agent CRUD and lifecycle\n│   │   ├── data.py             # Logs, events, metrics, executions\n│   │   ├── infrastructure.py   # Schedules, nodes\n│   │   ├── platform_ops.py     # Config, secrets, audit, queue\n│   │   ├── time_travel.py      # Execution replay API\n│   │   ├── brain.py            # Proxy to Control Brain service\n│   │   ├── enterprise/\n│   │   │   ├── auth.py         # JWT auth, SSO\n│   │   │   ├── organisations.py\n│   │   │   └── platform.py     # Workflows, versions, memory, marketplace\n│   │   └── cloud/\n│   │       ├── billing.py\n│   │       ├── observability.py\n│   │       ├── sandbox.py\n│   │       ├── simulation.py\n│   │       ├── regions.py\n│   │       ├── federation.py\n│   │       ├── evolution.py\n│   │       ├── knowledge.py\n│   │       └── vault.py\n│   ├── services/\n│   │   ├── agent_service.py    # Core agent business logic\n│   │   ├── brain_service.py    # HTTP client to Control Brain\n│   │   ├── cache_service.py    # Redis cache layer (replaces DB polling)\n│   │   ├── memory_service.py   # pgvector semantic memory\n│   │   ├── nats_service.py     # NATS connection and subscriptions\n│   │   ├── redis_service.py    # Redis connection and helpers\n│   │   └── ...\n│   └── middleware/\n│       ├── audit.py            # Every write goes to audit_logs\n│       └── rate_limit.py       # Per-org rate limiting via Redis\n│\n├── control-brain/              # Go — Central orchestration layer\n├── executor/                   # Go — Docker-based agent execution\n├── k8s-executor/               # Go — Kubernetes-based execution\n├── memory-vector-engine/       # Go — pgvector semantic memory\n├── agent-autoscaler/           # Go — Scale agents based on queue depth\n├── agent-gateway/              # Go — Agent-to-agent routing\n├── agent-sandbox-manager/      # Go — Isolation policy enforcement\n├── agent-simulation-engine/    # Go — Synthetic workloads and chaos testing\n├── agent-evolution-engine/     # Go — Genetic algorithm optimization\n├── agent-federation-network/   # Go — Cross-org agent collaboration\n├── agent-observability/        # Go — Distributed tracing and lineage\n├── billing-engine/             # Go — Plans, invoices, usage\n├── subscription-manager/       # Go — Plan management\n├── usage-meter/                # Go — Real-time resource metering\n├── secrets-manager/            # Go — AES-256-GCM vault\n├── marketplace-service/        # Go — Agent marketplace\n├── marketplace-validator/      # Go — Publish validation\n├── global-scheduler/           # Go — Multi-region placement\n├── region-controller/          # Go — Regional agent management and failover\n├── cluster-controller/         # Go — Cluster lifecycle\n├── workflow-engine/            # Go — DAG workflow execution\n├── event-processor/            # Go — NATS event fan-out\n├── log-processor/              # Go — Batched log ingestion\n├── metrics-collector/          # Go — Container metrics scraping\n├── websocket-gateway/          # Go — Real-time push to UI\n├── reconciliation-service/     # Go — Drift detection and self-healing\n├── scheduler-service/          # Go — Cron and trigger scheduling\n├── node-manager/               # Go — Node heartbeat and capacity\n├── agent-runtime/              # Base agent container image\n│\n├── frontend/                   # Next.js 14 control plane UI\n│   └── src/\n│       ├── app/\n│       │   ├── (auth)/login/   # Login page with boot animation\n│       │   └── (dashboard)/\n│       │       ├── page.tsx        # Overview dashboard\n│       │       ├── agents/         # Agent list and detail\n│       │       ├── brain/          # Control Brain state viewer\n│       │       ├── services/       # Service registry\n│       │       ├── memory/         # Agent memory browser\n│       │       ├── time-travel/    # Execution replay debugger\n│       │       ├── federation/     # Federation peer management\n│       │       ├── evolution/      # Experiment viewer\n│       │       ├── simulation/     # Simulation environments\n│       │       ├── sandbox/        # Isolation policies\n│       │       ├── observability/  # Traces and performance\n│       │       ├── regions/        # Region and failover view\n│       │       ├── billing/        # Plans and invoices\n│       │       ├── vault/          # Secrets management\n│       │       └── knowledge/      # Knowledge base\n│       ├── components/\n│       │   └── Sidebar.tsx     # Navigation with 6 sections\n│       ├── contexts/\n│       │   └── AuthContext.tsx # JWT auth context\n│       ├── lib/\n│       │   ├── api.ts          # All API client methods\n│       │   └── auth.ts         # Token management (memory-only, no localStorage)\n│       └── middleware.ts       # Route protection\n│\n├── migrations/                 # SQL migration files\n│   ├── init.sql\n│   ├── 0004_production_infrastructure.sql\n│   ├── 0005_enterprise.sql\n│   ├── 0006_global_platform.sql\n│   ├── 0007_cloud_platform_extension.sql\n│   ├── 0008_complete_cloud_platform.sql\n│   ├── 0009_advanced_features.sql\n│   └── 0010_platform_completion.sql\n│\n├── cli/\n│   └── agentplane.py           # Single-file CLI tool\n│\n├── docs/\n│   └── ARCHITECTURE.md         # Full architecture documentation\n│\n├── helm/                       # Kubernetes Helm charts\n├── grafana/                    # Grafana dashboard definitions\n├── prometheus/                 # Prometheus config\n├── otel/                       # OpenTelemetry collector config\n├── docker-compose.yml\n└── .env.example\n```\n\n---\n\n## How the services talk to each other\n\nEverything runs inside a Docker network called `platform_network`. Services communicate over HTTP and via NATS for events.\n\nThe general flow:\n\n1. A request comes into the **backend** API on port 8000\n2. The backend either handles it directly or delegates to a Go service\n3. State changes get published to NATS as events\n4. Services that care about that event update their own state\n5. The backend subscribes to `agents.events` and keeps Redis in sync — nothing polls the database for agent status\n\nThe **Control Brain** on port 8200 sits above all of this. It runs a reconciliation loop every 10 seconds — comparing what the database says should be running against what's actually running — and issues corrective commands when they diverge. Placement decisions also go through the brain: when you deploy an agent, it picks which node and region gets it based on current capacity.\n\n```\nIncoming request\n    ↓\nBackend API (:8000)\n    ↓\nGo service (executor, billing, etc.)\n    ↓\nNATS: agents.events\n    ↓\nEvent processor fans out\n    ↓\nRedis cache updated — no DB polling\n    ↓\nWebSocket gateway pushes to UI\n```\n\n---\n\n## Key features\n\n### Control Brain\n\nEvery agent deployment, scale event, and failover goes through the Control Brain. It maintains in-memory state of the entire platform and reconciles against reality on a configurable interval. If a service registers and then stops heartbeating, the brain marks it degraded and can trigger failover. The brain's state is visible at `/api/v1/brain/state` and from the CLI with `agentplane brain`.\n\n### Vector Memory\n\nAgents can store memories that persist across executions. The memory engine uses pgvector's HNSW index for cosine similarity search over 1536-dimensional embeddings. Memories have types — episodic, semantic, procedural, working — and TTLs. Short-term memories expire after 24 hours by default. Semantic memories don't expire unless you set a TTL explicitly.\n\nYou need `OPENAI_API_KEY` set for embedding generation, or you can point `EMBEDDING_API_URL` at any OpenAI-compatible endpoint.\n\n### Time Travel Debugger\n\nEvery execution step gets stored — the full prompt, the completion, the tool call and its result, timing, cost. The time travel UI lets you browse the timeline of any execution and click into any step for full detail. You can replay from any step, which creates a new execution branch. You can also edit the prompt at a specific step and see what would have happened differently. Useful both for development debugging and production post-mortems.\n\n### Agent Evolution\n\nThe evolution engine runs a genetic algorithm over agent configurations. You create an experiment, define population size and mutation rate, and let it run. Each genome varies the base agent — different prompt phrasing, tool selection, workflow steps. Fitness is a weighted combination of task success rate, latency, and cost. Top performers survive to the next generation. After enough generations, promote the best genome as the new production agent version.\n\n### Federation\n\nOrganizations can connect their AgentPlane instances as federation peers. A research agent in your deployment can delegate a summarization task to an agent running in a partner organization's deployment. Tasks are HMAC-signed. Trust levels — full, partial, sandboxed — control what the remote agent can access. The requesting org pays for the remote execution through a billing settlement system.\n\n### Simulation Engine\n\nBefore pushing a new agent version to production, you can throw synthetic load at it. The simulation engine supports load tests, chaos scenarios that randomly kill agents mid-execution, tool latency injection, and region outage simulation. Results include p50/p95/p99 latency, failure rates, and estimated cost.\n\n### Sandbox Isolation\n\nAgents run inside one of three isolation modes depending on the configured policy: standard Docker namespaces, gVisor for syscall filtering, or Firecracker microVMs for maximum isolation. Sandbox policies define CPU and memory caps, network access rules, and permitted syscalls.\n\n---\n\n## Authentication\n\nThe frontend uses a two-token system. The access token lives in JavaScript memory only — never localStorage, never sessionStorage. The refresh token is stored as an HttpOnly cookie, so XSS attacks can't steal it. Route protection runs at two layers: Next.js middleware checks for the session cookie before the page loads, and the auth context does a secondary check client-side. Login has a 5-attempt lockout with a 30-second cooldown.\n\nService-to-service calls inside the platform use `X-Service-API-Key` headers, not user JWT tokens.\n\n---\n\n## Configuration\n\nCopy `.env.example` to `.env` before first run. `agentplane install` does this automatically and replaces placeholder values with real random secrets. Things you'll likely want to set manually:\n\n```bash\n# Required for vector memory\nOPENAI_API_KEY=sk-...\n\n# Required for Stripe billing\nSTRIPE_SECRET_KEY=sk_live_...\nSTRIPE_WEBHOOK_SECRET=whsec_...\n\n# Optional — SSO login\nGOOGLE_CLIENT_ID=...\nGITHUB_CLIENT_ID=...\n```\n\nEverything else has working defaults for local development.\n\n---\n\n## Running migrations manually\n\n```bash\n# Via CLI\nagentplane migrate\n\n# Directly via psql\ndocker compose exec -T postgres psql -U postgres -d agentdb \u003c migrations/0010_platform_completion.sql\n```\n\nMigrations are plain SQL files numbered sequentially. They use `CREATE TABLE IF NOT EXISTS` throughout so running them more than once is safe.\n\n---\n\n## Adding a new agent\n\nVia CLI:\n```bash\nagentplane deploy-agent\n```\n\nVia API:\n```bash\ncurl -X POST http://localhost:8000/api/v1/agents \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer \u003ctoken\u003e\" \\\n  -d '{\n    \"name\": \"research-agent\",\n    \"docker_image\": \"agent-runtime:latest\",\n    \"cpu_limit\": 1.0,\n    \"memory_limit\": \"512m\",\n    \"env_vars\": {\n      \"AGENT_ROLE\": \"research\",\n      \"OPENAI_API_KEY\": \"sk-...\"\n    }\n  }'\n```\n\nVia the dashboard: Agents → New Agent.\n\n---\n\n## Port reference\n\n| Port | Service |\n|------|---------|\n| 3000 | Frontend dashboard |\n| 3001 | Grafana |\n| 4222 | NATS |\n| 4317 | OTEL collector (gRPC) |\n| 5432 | PostgreSQL |\n| 6379 | Redis |\n| 8000 | Backend API |\n| 8081 | Executor |\n| 8082 | Event processor |\n| 8083 | Metrics collector |\n| 8084 | WebSocket gateway |\n| 8085 | Reconciliation service |\n| 8086 | Scheduler service |\n| 8087 | Node manager |\n| 8088 | Log processor |\n| 8092 | Workflow engine |\n| 8093 | Agent autoscaler |\n| 8094 | Agent gateway |\n| 8095 | Marketplace service |\n| 8096 | Agent sandbox manager |\n| 8097 | Agent simulation engine |\n| 8098 | Global scheduler |\n| 8099 | Region controller |\n| 8100 | Usage meter |\n| 8101 | Billing engine |\n| 8102 | Subscription manager |\n| 8103 | Memory vector engine |\n| 8104 | Secrets manager |\n| 8105 | Marketplace validator |\n| 8106 | Agent evolution engine |\n| 8107 | Agent federation network |\n| 8108 | Cluster controller |\n| 8109 | Agent observability |\n| 8200 | Control Brain |\n| 9090 | Prometheus |\n\n---\n\n## Deploying to production\n\nThe platform ships with a Helm chart in `helm/` for Kubernetes deployments.\n\n```bash\nhelm install agentplane ./helm/ai-platform \\\n  --namespace agentplane \\\n  --create-namespace \\\n  --set backend.secretKey=\"\u003cyour-secret\u003e\" \\\n  --set postgres.password=\"\u003cyour-password\u003e\" \\\n  --set openai.apiKey=\"\u003cyour-key\u003e\"\n```\n\nFor production you'll want external managed services for Postgres, Redis, and NATS rather than the containerized versions. Set the corresponding `DATABASE_URL`, `REDIS_URL`, and `NATS_URL` environment variables. Put the frontend behind a CDN. Set `CORS_ORIGINS` to your actual domain. Make sure all secrets are real random values at least 32 characters long.\n\n---\n\n## Common issues\n\n**Services failing to build with `missing go.sum entry`**\n\nThe Dockerfile needs to copy all source files before running `go mod tidy`, not before. Use this pattern:\n\n```dockerfile\nCOPY . .\nRUN GONOSUMDB=* GOFLAGS=-mod=mod go mod tidy\nRUN CGO_ENABLED=0 GOOS=linux GONOSUMDB=* GOFLAGS=-mod=mod go build -o service .\n```\n\n**Backend fails to start**\n\nCheck postgres health first. The backend retries 10 times at 3-second intervals, but if postgres takes more than 30 seconds you'll see connection errors. `agentplane logs postgres` will show what's happening.\n\n**pgvector extension missing**\n\nThe migrations enable it, but if you're connecting to external Postgres the instance needs pgvector installed. Most managed cloud providers support it as an optional extension.\n\n**Control Brain shows no services**\n\nServices self-register with the Control Brain on startup. If the brain started after the other services, they haven't had a chance to register yet. Either restart the services (`docker compose restart backend`) or wait for the next reconciliation cycle. The brain reconciles every 10 seconds by default.\n\n**Memory search returns nothing**\n\nVector search requires `OPENAI_API_KEY` or a compatible embedding endpoint set in your `.env`. Without it, memories get stored but without embeddings, so similarity search has nothing to compare against.\n\n---\n\n## License\n\nEnterprise. All rights reserved.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuldeep-poonia%2Fdistributed-runtime-brain","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkuldeep-poonia%2Fdistributed-runtime-brain","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkuldeep-poonia%2Fdistributed-runtime-brain/lists"}