{"id":50557180,"url":"https://github.com/hmintopia03/failure-playground","last_synced_at":"2026-06-04T08:02:09.856Z","repository":{"id":357580754,"uuid":"1237586337","full_name":"hmintopia03/failure-playground","owner":"hmintopia03","description":"A backend/platform engineering playground for failure handling, worker queues, retries, observability, and tracing.","archived":false,"fork":false,"pushed_at":"2026-05-25T05:41:30.000Z","size":2673,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-25T07:24:08.719Z","etag":null,"topics":["backend","docker","fastapi","grafana","jaeger","observability","opentelemetry","platform-engineering","postgresql","prometheus","redis","worker-queue"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hmintopia03.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-13T10:14:03.000Z","updated_at":"2026-05-25T05:41:34.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hmintopia03/failure-playground","commit_stats":null,"previous_names":["hmintopia03/failure-playground"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hmintopia03/failure-playground","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmintopia03%2Ffailure-playground","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmintopia03%2Ffailure-playground/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmintopia03%2Ffailure-playground/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmintopia03%2Ffailure-playground/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hmintopia03","download_url":"https://codeload.github.com/hmintopia03/failure-playground/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmintopia03%2Ffailure-playground/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33895175,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backend","docker","fastapi","grafana","jaeger","observability","opentelemetry","platform-engineering","postgresql","prometheus","redis","worker-queue"],"created_at":"2026-06-04T08:02:09.093Z","updated_at":"2026-06-04T08:02:09.846Z","avatar_url":"https://github.com/hmintopia03.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"﻿# Failure Playground\n\n[![CI](https://github.com/hmintopia03/failure-playground/actions/workflows/ci.yml/badge.svg)](https://github.com/hmintopia03/failure-playground/actions/workflows/ci.yml)\n\n## Overview\n\n**Failure Playground** is a platform engineering project that simulates distributed task processing failures and recovery patterns.\n\nIt combines a FastAPI control plane, PostgreSQL persistence, Redis-backed worker queues, OpenTelemetry tracing, Prometheus/Grafana monitoring, Kubernetes deployment, and GitHub Actions CI/CD into a single local platform.\n\nThe goal is to explore how operators detect, investigate, and recover from queue pressure, worker failures, retry storms, stale workers, and other operational issues commonly found in production systems.\n\nIt is intentionally scoped for learning and portfolio review rather than production use.\n\nDeep implementation notes live outside the landing page:\n\n- Dashboard internals: see [docs/dashboard.md](docs/dashboard.md)\n- Reliability mechanics: see [docs/reliability.md](docs/reliability.md)\n- Observability details: see [docs/observability.md](docs/observability.md)\n\n---\n\n## Highlights\n\n- Multi-worker distributed task processing\n- Retry, backoff, poison-task, and stuck-task recovery flows\n- Real-time operations dashboard with replay-safe metrics\n- OpenTelemetry tracing with Jaeger correlation\n- Prometheus and Grafana observability stack\n- Kubernetes deployment with Horizontal Pod Autoscaling\n- GitHub Actions CI pipeline validating tests, Docker builds, and Kubernetes manifests\n\n---\n\n## Architecture\n\nFailure Playground models a small production-style platform focused on reliability, observability, and operational visibility.\n\nThe system consists of four main layers:\n\n- **Application Layer** ??FastAPI provides the HTTP control plane, health endpoints, metrics endpoint, and WebSocket bridge. Workers consume tasks, perform retries, update task state, and emit operational events.\n    \n- **Operational Signals Layer** ??Redis serves as the task queue, pub/sub event bus, and bounded replay history used for dashboard recovery after refresh or reconnect.\n    \n- **Observability Layer** ??OpenTelemetry traces flow to Jaeger, while Prometheus scrapes application metrics and Grafana visualizes system behavior.\n    \n- **Platform Layer** ??The stack runs through Docker Compose locally and Kubernetes manifests for deployment, persistent storage, observability services, and worker autoscaling.\n    \n\n![architecture](backend/images/architecture.png)\n\n\nObservability:\nFastAPI + Workers -\u003e OpenTelemetry -\u003e Jaeger\nFastAPI /prometheus -\u003e Prometheus -\u003e Grafana\n\nPlatform:\nDocker Compose locally\nKubernetes manifests + HPA\nGitHub Actions: tests + Docker builds + Kubernetes validation\n```\n```\n### Runtime Flows\n\n**Task Processing**\n\nFastAPI writes task state to PostgreSQL and pushes task IDs into Redis. Workers consume queued tasks, process them, and update task state.\n\n**Realtime Operations**\n\nAPI and workers publish operational events through Redis Pub/Sub. The WebSocket dashboard receives both replayed and live events for operational visibility.\n\n**Observability**\n\nOpenTelemetry traces are exported to Jaeger, while Prometheus collects metrics that Grafana uses for dashboards and system monitoring.\n\nPostgreSQL remains the source of truth for task state. Redis is used as transport and short-term operational memory, creating realistic distributed-system scenarios such as duplicate delivery, stale workers, retry storms, and replay boundaries.\n\nFor the complete architecture notes, see [architecture.md](docs/architecture.md).\n\n---\n\n## Core Stack\n\n### Backend\n\n- FastAPI for the HTTP control plane and WebSocket event bridge\n- SQLAlchemy for data access\n- PostgreSQL for durable task, log, worker, and system state\n- Redis for task queueing, pub/sub, and bounded event replay history\n- Alembic for database migrations\n\n### Workers\n\n- Independent Python worker processes\n- Redis queue consumers\n- PostgreSQL task-state transitions\n- Retry, backoff, poison-task handling, and stuck-task recovery\n- OpenTelemetry spans around task lifecycle operations\n\n### Observability\n\n- Structured JSON operational events\n- Prometheus scrape endpoint\n- Grafana dashboard\n- OpenTelemetry tracing\n- Jaeger trace UI\n- Realtime browser dashboard\n\n### Dashboard\n\n- Vanilla JavaScript\n- Native WebSocket API\n- Chart.js for compact charts\n- No separate frontend build pipeline\n\n---\n\n## Dashboard Screenshot\n\nThe realtime operations dashboard combines live operational events,\nreplay-safe metrics, trace correlation, incident workflow tracking,\nand operator-facing system health interpretation.\n\n![Dashboard](backend/images/dashboard-v4.png)\n\nFor dashboard internals, see [docs/dashboard.md](docs/dashboard.md). For observability details, see [docs/observability.md](docs/observability.md).\n\n---\n\n## Kubernetes\n\nThe platform runs on Kubernetes using Deployments, Services, a PostgreSQL PersistentVolumeClaim, shared ConfigMap configuration, and a worker HorizontalPodAutoscaler. The manifests live under [k8s/](k8s/), and the full runbook is in [k8s/README.md](k8s/README.md).\n\n![Kubernetes deployment](backend/images/k8s-deployment.png)\n\n![Kubernetes Pods](backend/images/k8s-pods.png)\n\n![Kubernetes HPA](backend/images/k8s-hpa.png)\n\nKey Kubernetes mapping:\n\n- `api`, `worker`, `postgres`, `redis`, `jaeger`, `prometheus`, and `grafana` run as Deployments.\n- Services provide stable in-cluster networking for API, Redis, PostgreSQL, Jaeger, Prometheus, and Grafana.\n- `postgres-pvc` provides local PostgreSQL persistence.\n- `worker-hpa` keeps the worker Deployment between 2 and 5 pods with a 70% CPU utilization target.\n- Prometheus scrapes the API through `api-service:8000`.\n- API and worker traces export to Jaeger through `http://jaeger-service:4317`.\n\nBasic verification:\n\n```bash\nkubectl apply -f k8s/\nkubectl get pods\nkubectl get svc\nkubectl get hpa\n```\n\nFor reliability mechanics such as retries, poison tasks, and stuck-task recovery, see [docs/reliability.md](docs/reliability.md).\n\n---\n\n## CI/CD\n\nGitHub Actions runs a production-style validation pipeline on push and pull request events targeting `main`.\n\n```text\npush / pull_request to main\n        |\n        +--\u003e backend tests\n        |       |\n        |       +--\u003e set up Python 3.13\n        |       +--\u003e install backend dependencies\n        |       +--\u003e run pytest\n        |\n        +--\u003e Docker image builds\n        |       |\n        |       +--\u003e validate docker compose config\n        |       +--\u003e build API image\n        |       +--\u003e build worker image\n        |\n        +--\u003e Kubernetes validation\n                |\n                +--\u003e set up kubectl and kind\n                +--\u003e server-side dry-run apply k8s manifests\n```\n\nThe workflow checks:\n\n- Backend dependency installation from [backend/requirements.txt](backend/requirements.txt)\n- Backend test coverage with `pytest`\n- Existing lint or format checks when project tooling is configured\n- Dockerfile validity for API and worker images\n- Docker Compose configuration validity\n- Kubernetes manifest API compatibility with `kubectl apply --dry-run=server`\n\nThe CI pipeline is intentionally lightweight. It validates the current architecture without introducing a separate deployment platform, registry push, or heavyweight lint stack before the project needs one.\n\n---\n\n## Engineering Tradeoffs\n\nThis project intentionally favors clear operational mechanics over production-scale abstraction.\n\n- Redis replay history is bounded to keep the system simple and local.\n- Dashboard metrics are derived from operational events rather than a dedicated analytics backend.\n- WebSocket replay is designed for short-term operator context, not long-term audit.\n- Trace context is added to events so the dashboard can link symptoms to Jaeger without introducing a separate correlation service.\n- Worker polling is not heavily traced to keep Jaeger focused on meaningful task lifecycle spans.\n- The dashboard is vanilla JavaScript so the operational behavior is visible without a frontend build system.\n\nThese tradeoffs keep the playground small enough to understand while still surfacing realistic platform engineering problems.\n\n---\n\n## Known Limitations\n\nFailure Playground is designed for local platform engineering learning and portfolio demonstration, not production readiness.\n\n- Kubernetes manifests target single-node or local clusters.\n- Incident Workflow state is frontend-only.\n- Grafana dashboard provisioning may still be Docker Compose-first.\n- There is no authentication, authorization, or RBAC yet.\n- There is no persistent incident storage yet.\n- PostgreSQL and Redis are local containerized services, not production-grade managed database or cache setups.\n- Worker HPA requires `metrics-server` to be installed and healthy in the local cluster.\n- Redis replay history is intentionally limited to the latest 100 non-heartbeat events.\n- Realtime operational history is short-term operator context, not audit-grade retention.\n\n---\n\n## Running Locally\n\nFrom the project root:\n\n```bash\ndocker compose up --build\n```\n\nWhen the API container starts, it runs Alembic migrations before launching the FastAPI server.\n\nOpen:\n\n- Dashboard: \u003chttp://localhost:8001\u003e\n- API docs: \u003chttp://localhost:8001/docs\u003e\n- Prometheus: \u003chttp://localhost:9091\u003e\n- Grafana: \u003chttp://localhost:3000\u003e\n- Jaeger: \u003chttp://localhost:16686\u003e\n\nDefault Grafana login:\n\n```text\nUsername: admin\nPassword: admin\n```\n\nCreate a task:\n\n```bash\ncurl -X POST \"http://localhost:8001/tasks?priority=1\"\n```\n\nCreate a poison task:\n\n```bash\ncurl -X POST \"http://localhost:8001/tasks?priority=1\u0026is_poison=true\"\n```\n\nWatch the realtime event stream from the command line:\n\n```bash\n# Any WebSocket client works; example uses websocat\nwebsocat ws://localhost:8001/ws/operations\n```\n\n\nRun backend tests from the `backend` directory:\n\n```bash\ncd backend\npytest -v\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhmintopia03%2Ffailure-playground","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhmintopia03%2Ffailure-playground","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhmintopia03%2Ffailure-playground/lists"}