{"id":48879228,"url":"https://github.com/viftode4/dds26-17","last_synced_at":"2026-04-16T02:03:23.903Z","repository":{"id":346910798,"uuid":"1158645119","full_name":"viftode4/dds26-17","owner":"viftode4","description":"DDS25 - Distributed microservices with Redis-native durable saga orchestration","archived":false,"fork":false,"pushed_at":"2026-04-12T23:01:42.000Z","size":3516,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-13T00:35:51.234Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/viftode4.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-15T17:50:03.000Z","updated_at":"2026-04-12T22:39:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/viftode4/dds26-17","commit_stats":null,"previous_names":["viftode4/dds26-17"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/viftode4/dds26-17","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viftode4%2Fdds26-17","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viftode4%2Fdds26-17/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viftode4%2Fdds26-17/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viftode4%2Fdds26-17/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/viftode4","download_url":"https://codeload.github.com/viftode4/dds26-17/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viftode4%2Fdds26-17/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31867716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T15:24:51.572Z","status":"online","status_checked_at":"2026-04-16T02:00:06.042Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-16T02:03:21.727Z","updated_at":"2026-04-16T02:03:23.895Z","avatar_url":"https://github.com/viftode4.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Distributed Checkout System\n\nA high-performance, fault-tolerant microservices checkout system built for the\nTU Delft Distributed Data Systems course (DDS26). Implements hybrid 2PC/Saga\ntransaction coordination over NATS JetStream with automatic protocol selection,\ncrash recovery, and horizontal scaling.\n\n# VERY IMPORTANT ALSO CHECK THE CLUSTER-FINAL BRANCH FOR THE REDIS CLUSTER IMPLEMENTATION!!!!!!! WE HAVE 2 VERSIONS. AND WE HAVE 2PC and SAGA IMPLEMENTED ASWELL!!!! LOOK AT WHATEVER FLOATS YOUR BOAT!!!!\n\n## Architecture\n\n```\n                     ┌──────────┐\n        HTTP :8000 → │ HAProxy  │ (path-prefix routing, leastconn)\n                     └────┬─────┘\n              ┌──────────┼──────────┐\n              ▼          ▼          ▼\n       ┌────────────┐ ┌──────┐ ┌──────────┐\n       │  Order ×N  │ │Stock │ │Payment×N │\n       │Orchestrator│ │  ×N  │ │          │\n       └──────┬─────┘ └──┬───┘ └────┬─────┘\n              │   NATS JetStream    │\n              └──────────┬──────────┘\n     ┌───────────────────┼───────────────────┐\n     ▼                   ▼                   ▼\n┌──────────┐      ┌───────────┐       ┌───────────┐\n│ Order DB │      │ Stock DB  │       │Payment DB │\n│ master   │      │ master    │       │ master    │\n│ +2 rep   │      │ +2 rep    │       │ +2 rep    │\n└──────────┘      └───────────┘       └───────────┘\n     ↑                                      ↑\n  Sentinel ×3 ─────────────────────────────┘\n  (failover for all 3 clusters)\n```\n\n**Default deployment (23 containers):** 2 order, 2 stock, 2 payment, 3 Dragonfly\nmasters, 6 Dragonfly replicas, 3 Sentinels, 1 NATS, 1 HAProxy, 1 Jaeger, 1\nPrometheus, 1 Grafana.\n\n### Key Features\n\n**Transaction Coordination:**\n- **Hybrid 2PC/Saga** with adaptive protocol selection (hysteresis on abort rate: 2PC→Saga at 10%, Saga→2PC at 5%)\n- **Parallel saga execution** — stock and payment steps run concurrently via `_broadcast()`\n- **Script-loaded atomic Lua operations** for order, stock, and payment — 2PC prepare/commit/abort + saga execute/compensate + direct ops, all with idempotency checks and poison pill guards\n- **Transport-agnostic orchestrator** — pluggable `Transport` protocol; zero application-specific code\n- **Idempotent checkout** with TTL differentiation — 60s for failed (allow retry), 86400s for success (prevent re-charge)\n- **Backpressure control** — async semaphore caps 500 concurrent in-flight checkouts\n- **Forward recovery** (retry confirms with exponential backoff before compensating)\n- **Reservation TTL** = 60s (prevents resource leaks)\n\n**Messaging \u0026 Serialization:**\n- **NATS JetStream** commands (durable, deduplicated, WorkQueue retention, memory storage ~28µs ack) + Core NATS inbox replies (lowest latency)\n- **Selective deduplication** — prepare/execute get deterministic `Msg-Id` (prevent double-delivery); commit/abort/compensate intentionally skip it (allow retries, they're idempotent at Lua level)\n- **msgpack serialization** for NATS messages (compact binary, faster than JSON)\n- **NATS reconnect resilience** — auto re-subscribe consumers and recreate memory-storage streams after disconnect\n\n**Crash Recovery \u0026 Consistency:**\n- **Dual-structure WAL** — Redis Stream (audit trail) + SET (active sagas) + HASH (per-saga state) for O(1) recovery instead of O(n) stream scan\n- **Recovery state machine** — per-state strategy: PREPARING→abort all, COMMITTING→must commit (irrevocable), EXECUTING→compensate, COMPENSATING→retry\n- **Bounded recovery retries** — recovery worker retries commit/abort/compensate deterministically from WAL state\n- **Reconciliation loop** — periodic (60s) orphan saga detection; aborts sagas idle \u003e120s\n- **Dead Letter Queue** for permanently failed sagas (audit trail + manual resolution)\n- **Active-active leader election** (atomic Lua SET NX + TTL with heartbeat) — only one order instance runs recovery; all instances execute checkouts\n- **In-process result delivery** — asyncio.Future on happy path (no stream hop); pub/sub + key polling fallback for cross-instance recovery\n\n**7-Layer Consistency Defense:**\n\n| # | Mechanism | What it prevents |\n|---|-----------|-----------------|\n| 1 | Force 2PC when circuit breakers open | Irrevocable saga mutations during suspected partitions |\n| 2 | Dual Sentinel failover detection (PubSub + 10s reconciler) | Stale connections reaching demoted old master |\n| 3 | Poison pill in Lua scripts | Late prepare/execute after abort/compensate decision |\n| 4 | Selective NATS retry (1 attempt for prepare/execute) | Double-deduction across Sentinel failover |\n| 5 | No redis-py `retry_on_error` | Late Lua execution after orchestrator moves on |\n| 6 | 2PC commit re-deduction | Lost prepare data after failover (re-applies deduction) |\n| 7 | Saga compensate-on-timeout | Uncompensated mutations from ambiguous timeouts |\n\nFive critical conservation bugs were found during chaos testing (2PC data loss on failover,\nNATS double-deduction, late redis-py retry after abort, timeout without compensation,\nabort/prepare race condition) — all fixed in production code with \u003c0.5ms latency overhead.\nSee [`docs/CONSISTENCY_REPORT.md`](docs/CONSISTENCY_REPORT.md) for the full root cause analysis.\n\n**High Availability \u0026 Performance:**\n- **Sentinel HA** with automatic failover (~5s detection, ~10s failover) and dual detection (fast PubSub event + slow 10s reconciler)\n- **Master-first reads with replica fallback** — GET endpoints prefer the current master and fall back to a replica if the master read fails\n- **Circuit breakers** per service (5 failures → open, 30s recovery → half-open probe)\n- **Connection prewarm** — pre-creates Redis pool connections at startup (256 order, 128 stock/payment)\n- **GC tuning** — gen0 threshold raised 70x (700→50000) to reduce pause frequency for stable throughput\n- **Concurrent JetStream publish + inbox wait** — stream ack (~28µs) runs in parallel with reply wait, no sequential penalty\n\n**Chaos Engineering:**\n- **Fault injection framework** — inject crashes (`os._exit(1)`), delays, errors at any saga phase via HTTP API\n- **Chaos test suite** — network partitions (DB + NATS), cascading multi-service failure, data loss detection, WAL recovery after failover, Sentinel TILT bypass\n\n**Logging:**\n- **Structured JSON logging** via structlog with OpenTelemetry trace ID + span ID auto-injection on every log line\n\n### Observability Stack\n\nAll three services expose Prometheus-format `/metrics` endpoints. The small and default\ncompose configs include a full observability stack:\n\n- **Prometheus** — scrapes all service instances every 5s (http://localhost:9090)\n- **Grafana** — pre-provisioned dashboard with saga latencies, throughput, abort rates (http://localhost:3000, admin/admin)\n- **Jaeger** — distributed traces via OpenTelemetry, spans across NATS + Redis (http://localhost:16686)\n\nMetrics include: saga success/failure counts, abort rate, current protocol (2pc/saga),\nleader status, per-protocol latency histograms (p50/p95/p99), and circuit breaker state.\n\n### Orchestrator Package\n\nThe orchestrator (`orchestrator/`) is a **standalone reusable package** with zero\napplication-specific code. It coordinates distributed transactions for any set of\nservices — adding a new service requires zero orchestrator changes. See\n[`orchestrator/README.md`](orchestrator/README.md) for package documentation.\n\n### Data Management \u0026 Consistency Model\n\n**Database-per-service** (Fowler's \"Decentralized Data Management\"): each service owns an\nisolated database instance. Order cannot access stock's data — all cross-service coordination\nflows through the orchestrator via NATS messaging. This enforces loose coupling and allows\nindependent scaling, deployment, and schema evolution per service.\n\n**Consistency guarantees depend on the active protocol:**\n\n| Protocol | Consistency model | When used |\n|----------|------------------|-----------|\n| **2PC** | Serializability — atomic all-or-nothing commit across services | Default; abort rate \u003c 10% |\n| **Saga** | Eventual consistency — execute then compensate on failure | Abort rate \u003e= 10% (high contention) |\n\nThe adaptive protocol selector dynamically adjusts the consistency level based on system\nhealth. Under normal operation, 2PC provides the strongest guarantee. When the system is\nunder stress (high abort rate from contention or timeouts), it switches to Saga to maintain\nthroughput at the cost of a weaker consistency window during compensation.\n\n**Replication strategy:**\n- **Async master-replica replication** for read scaling and fault tolerance (2 replicas per master)\n- GET endpoints prefer the current master and fall back to a replica on read error\n- Write operations always go to master\n- Sentinel monitors all 3 clusters and promotes replicas automatically on master failure\n\n**Data partitioning:**\n- **Functional partitioning** at service boundary — order, stock, payment each own their keyspace\n- **Vertical scaling** on this branch (single master per service); the `finishing-touches` branch\n  adds horizontal hash-slot sharding via Valkey Cluster (3 masters per service)\n\n### Stack\n\n| Component | Technology |\n|-----------|-----------|\n| HTTP framework | Starlette (ASGI) |\n| HTTP server | Granian with uvloop |\n| Language | Python 3.13 |\n| Messaging | NATS JetStream (commands) + Core NATS (replies) |\n| Serialization | msgpack |\n| Database | Dragonfly (Redis-compatible, multi-threaded) |\n| Client library | redis.asyncio with hiredis |\n| Failover | Valkey Sentinel ×3 |\n| Gateway | HAProxy (leastconn) |\n| Tracing | OpenTelemetry → Jaeger |\n| Metrics | Prometheus + Grafana |\n\n### Technology Evolution\n\nThe system went through multiple architectural iterations, each driven by benchmarking:\n\n| Layer | v1 | v2 | v3 (current) |\n|-------|----|----|---------------|\n| HTTP framework | Flask (sync) | Quart (async) | **Starlette** (ASGI) |\n| HTTP server | Gunicorn | Hypercorn | **Granian** + uvloop |\n| Python | 3.11 | 3.12 | **3.13** |\n| Gateway | Nginx (round-robin) | — | **HAProxy** (leastconn) |\n| Messaging | Redis Streams (poll) | NATS Core (request-reply) | **NATS JetStream** (durable) + Core inbox |\n| Serialization | JSON | JSON | **msgpack** |\n| Database | Redis/Valkey | — | **Dragonfly** (multi-threaded) |\n| Lua scripts | Individual `.lua` files | — | **SCRIPT LOAD + EVALSHA** |\n| Saga execution | Sequential | — | **Parallel** (`_broadcast()`) |\n\n## Prerequisites\n\n- [Docker](https://docs.docker.com/get-docker/) and Docker Compose v2\n- Python 3.11+ (for running tests locally)\n- ~4 GB RAM available for Docker\n\nFor Kubernetes deployment, see [Kubernetes Deployment (Minikube)](#kubernetes-deployment-minikube) below.\n\n## Quick Start\n\n### 1. Start the system\n\n```bash\ndocker compose up --build -d\n```\n\n\u003e **Low-resource machines / shared-host benchmarks:** use the small config. It now targets roughly a 15-CPU system budget so a 16-20 CPU host can keep some headroom for Locust and the OS:\n\u003e ```bash\n\u003e docker compose -f docker-compose-small.yml up --build -d\n\u003e ```\n\nWait for all containers to report healthy (~15-20 seconds):\n\n```bash\ndocker compose ps\n```\n\nAll application services should show `healthy`. The gateway is at **http://localhost:8000**.\n\n### 2. Seed test data\n\n```bash\ncurl -X POST http://localhost:8000/stock/batch_init/100/1000/10\ncurl -X POST http://localhost:8000/payment/batch_init/100/100000\ncurl -X POST http://localhost:8000/orders/batch_init/100/100/100/10\n```\n\n### 3. Smoke test\n\n```bash\ncurl -X POST http://localhost:8000/orders/checkout/0\n# → \"Checkout successful\"\n```\n\n### 4. Run correctness tests\n\nIntegration tests (requires running system):\n\n```bash\npip install requests\npython -m pytest test/test_microservices.py -v\n```\n\nUnit tests (no Docker required):\n\n```bash\npip install -r test/requirements-test.txt\npython -m pytest test/ -v -m \"not integration\"\n```\n\n### 5. Verify consistency\n\nClone the official benchmark:\n\n```bash\ngit clone https://github.com/delftdata/wdm-project-benchmark.git\ncd wdm-project-benchmark\npip install -r requirements.txt\npython run.py\n```\n\nUse the official benchmark as an external consistency check on your machine. Treat the exact outcome as an environment-dependent measurement, not a hardcoded guarantee in this README.\n\n### 6. Stop the system\n\n```bash\ndocker compose down\ndocker compose down -v  # also reset all data\n```\n\n## Deployment Configurations\n\nFour compose files target different hardware profiles:\n\n| Config | File | App Instances | CPU Target | Containers | Use Case |\n|--------|------|---------------|------------|------------|----------|\n| Small | `docker-compose-small.yml` | 1/1/1 | ~15 CPU system budget | 23 | 16-20 CPU host, or Docker Desktop / WSL2 with reduced load |\n| Default | `docker-compose.yml` | 2/2/2 | ~30 CPU | 23 | Development / CI |\n| Medium | `docker-compose-medium.yml` | 4/4/4 | ~34 CPU system budget | 26 | ~40 CPU host with headroom for Locust / OS |\n| Large | `docker-compose-large.yml` | 9/7/7 | ~80 CPU system budget | 36 | ~90 CPU Linux host with dedicated Locust headroom |\n\n\"App Instances\" = order / stock / payment service replicas. Small and default configs\ninclude the full observability stack (Jaeger, Prometheus, Grafana); medium and large\nomit it. Small uses 3 read replicas per DB cluster; all others use 2.\n\nUsage:\n\n```bash\ndocker compose -f docker-compose-medium.yml up --build -d\n```\n\nEach config has a matching HAProxy config (`haproxy-small.cfg`, `haproxy-medium.cfg`,\n`haproxy-large.cfg`) with `maxconn` limits tuned for the target concurrency.\n\n## Observability\n\n### Metrics\n\nAll services expose Prometheus-format metrics:\n\n```bash\ncurl http://localhost:8000/orders/metrics\ncurl http://localhost:8000/stock/metrics\ncurl http://localhost:8000/payment/metrics\n```\n\n### Grafana Dashboard\n\nOpen http://localhost:3000 (admin/admin). A pre-provisioned \"Checkout System\" dashboard\nshows real-time saga throughput, latency percentiles, abort rates, and protocol switches.\n\n### Distributed Tracing\n\nOpen http://localhost:16686 (Jaeger UI). Traces span across HTTP → orchestrator → NATS →\nservice handler → Redis, with full W3C context propagation.\n\n### DLQ Status\n\n```bash\ncurl http://localhost:8000/orders/dlq/status\n```\n\nShows count and recent entries from the Dead Letter Queue.\n\n## Fault Injection (Chaos Engineering)\n\nStock and payment services expose a fault injection API for testing failure scenarios:\n\n```bash\n# Inject a 500ms delay before the prepare phase\ncurl -X POST http://localhost:8000/stock/fault/set \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"point\":\"before_prepare\",\"action\":\"delay\",\"value\":500}'\n\n# View active fault rules\ncurl http://localhost:8000/stock/fault/rules\n\n# Clear all faults\ncurl -X POST http://localhost:8000/stock/fault/clear\n```\n\nSupported injection points: `before_prepare`, `after_prepare`, `before_execute`,\n`after_execute`, `before_compensate`, `after_compensate`.\n\nSupported actions: `crash` (hard kill), `delay:\u003cms\u003e` (simulate slow service),\n`error` (raise exception).\n\n## Kubernetes Deployment (Minikube)\n\n\u003e **Note:** The Kubernetes manifests were written for an earlier version of the stack\n\u003e (pre-Dragonfly, pre-JetStream, pre-observability) and have **not been updated or tested**\n\u003e with the current architecture. They may serve as a starting point but will need\n\u003e modifications to match the current docker-compose setup.\n\nA minikube-based deployment mirrors the same architecture but replaces HAProxy with\nnginx and uses the [Bitnami Valkey Helm chart](https://github.com/bitnami/charts/tree/main/bitnami/valkey)\nfor Redis with Sentinel.\n\n### Prerequisites\n\n- [minikube](https://minikube.sigs.k8s.io/docs/start/) v1.32+\n- [kubectl](https://kubernetes.io/docs/tasks/tools/)\n- [Helm](https://helm.sh/docs/intro/install/) v3\n\n### Deploy\n\n```bash\n./minikube-deploy.sh\n```\n\nThe script starts minikube, builds images inside its Docker daemon, deploys three\nValkey clusters with Sentinel via Helm, and deploys NATS + services + nginx gateway.\n\n```bash\nMINIKUBE_IP=$(minikube ip)\ncurl http://${MINIKUBE_IP}:30080/orders/health\n```\n\n### Teardown\n\n```bash\n./minikube-teardown.sh\n```\n\n### Key differences from docker-compose\n\n| | docker-compose | minikube |\n|---|---|---|\n| Gateway | HAProxy (leastconn) | nginx (round-robin via kube-proxy) |\n| Database | Dragonfly | Bitnami Valkey 9.x Helm chart |\n| Sentinel | 3 standalone containers | Sidecar per Valkey pod |\n| Shared code | Volume-mounted at runtime | Baked into image at build time |\n| Service discovery | Docker DNS | CoreDNS + kube-proxy |\n| Entry point | `localhost:8000` | `\u003cminikube-ip\u003e:30080` (NodePort) |\n\n## Performance Results\n\nBenchmark artifacts and scripts are included in `test/benchmark_results/` and `test/run_*benchmark*.sh`.\nRerun them on the target machine before citing concrete throughput numbers.\n\n\u003e **Note:** Docker Desktop / WSL2 add significant virtualization overhead. Use a native\n\u003e Linux machine for professor-facing measurements when possible. See\n\u003e [`KNOWN_ISSUES.md`](KNOWN_ISSUES.md) for environment-related performance caveats.\n\n## Stress Testing / Benchmarking\n\n### Using the course benchmark\n\n```bash\ncd wdm-project-benchmark\npython run.py\n```\n\nExercises concurrent checkouts and verifies no money or stock is lost.\n\n### Using Locust\n\n```bash\npip install locust\nlocust -f test/locustfile.py --host=http://localhost:8000 --users 200 --spawn-rate 20\n```\n\nOpen http://localhost:8089 for the Locust web UI with live throughput/latency charts.\n\n## Testing Fault Tolerance\n\n### Kill a service instance\n\n```bash\ndocker compose stop order-service-1\n# System continues via order-service-2\ndocker compose start order-service-1\n```\n\n### Kill a database master (triggers Sentinel failover)\n\n```bash\ndocker compose stop stock-db\n# Sentinel promotes replica within ~5 seconds\n# Services reconnect automatically\ndocker compose start stock-db\n```\n\n### Kill during checkout\n\n```bash\n# Terminal 1: start checkout\ncurl -X POST http://localhost:8000/orders/checkout/{order_id}\n\n# Terminal 2: kill stock mid-transaction\ndocker compose stop stock-service\n```\n\nThe WAL ensures the saga is either completed or compensated on recovery.\n\n## Project Structure\n\n```\n├── order/                  # Order service (hosts orchestrator)\n│   ├── app.py              # HTTP endpoints + checkout_tx definition\n│   └── Dockerfile\n├── stock/                  # Stock service (NATS subscriber)\n│   └── app.py\n├── payment/                # Payment service (NATS subscriber)\n│   └── app.py\n├── orchestrator/           # Reusable 2PC/Saga orchestrator package\n│   ├── core.py             # Main orchestrator (adaptive protocol selection)\n│   ├── executor.py         # TwoPCExecutor + SagaExecutor (parallel broadcast)\n│   ├── recovery.py         # RecoveryWorker (WAL scan, reconciliation)\n│   ├── leader.py           # Leader election (SET NX + TTL)\n│   ├── wal.py              # Write-ahead log (Redis Streams)\n│   ├── transport.py        # Transport protocol abstraction\n│   ├── definition.py       # Step + TransactionDefinition\n│   ├── metrics.py          # Latency histograms + abort rate\n│   └── README.md           # Package documentation\n├── common/                 # Shared utilities\n│   ├── config.py           # Redis connection factory (Sentinel-aware)\n│   ├── db.py               # Database helpers\n│   ├── nats_transport.py   # NATS JetStream commands + Core inbox replies\n│   ├── result.py           # Pub/sub wait_for_result\n│   ├── dlq.py              # Dead Letter Queue (Redis Stream)\n│   ├── fault_injection.py  # Chaos engineering fault injector\n│   ├── tracing.py          # OpenTelemetry setup + W3C propagation\n│   └── logging.py          # structlog setup\n├── lua/                    # 16 atomic Lua functions across 3 libraries + individual scripts\n│   ├── order_lib.lua       # order_add_item, order_load_and_claim\n│   ├── stock_lib.lua       # 2PC prepare/commit/abort + saga execute/compensate + direct add/subtract\n│   ├── payment_lib.lua     # 2PC prepare/commit/abort + saga execute/compensate + direct add/subtract\n│   └── *.lua               # Individual script files (loaded via EVALSHA)\n├── observability/\n│   ├── prometheus.yml      # Prometheus scrape config\n│   └── grafana/            # Grafana provisioning + dashboards\n├── test/                   # ~113 tests (unit + integration + chaos)\n│   ├── test_microservices.py         # End-to-end API tests\n│   ├── test_circuit_breaker.py       # Circuit breaker behavior\n│   ├── test_crash_recovery.py        # WAL recovery scenarios\n│   ├── test_executor.py              # 2PC + Saga executor logic\n│   ├── test_orchestrator_core.py     # Orchestrator + leader election\n│   ├── test_recovery.py              # Recovery worker\n│   ├── test_sentinel_failover.py     # Sentinel failover integration\n│   ├── test_stress.py                # Load/stress tests\n│   ├── test_wal_metrics.py           # WAL + metrics\n│   ├── test_chaos.py                 # Network partition, cascading failure\n│   ├── test_chaos_framework.py       # Fault injection framework\n│   ├── test_new_unit_tests.py        # Conservation, multi-item, etc.\n│   ├── test_new_integration_tests.py # Integration coverage\n│   └── locustfile.py                 # Locust load test definition\n├── tla/                    # TLA+ formal specification\n│   └── ServicesConsistencyPlusCal.tla\n├── docs/\n│   ├── plans/2026-02-15-system-design.md     # System design document\n│   ├── architectural_compliance_report.md    # Assignment compliance analysis\n│   ├── CONSISTENCY_REPORT.md                 # Conservation bug audit + fixes\n│   ├── tla_consistency_fault_tolerance.md    # TLA+ consistency proofs\n│   └── stress_test_results.png               # Benchmark chart\n├── docker-compose.yml            # Default 23-container deployment (~30 CPU)\n├── docker-compose-small.yml      # Single-instance + observability (~15 CPU budget)\n├── docker-compose-medium.yml     # 4x instances (~34 CPU budget)\n├── docker-compose-large.yml      # 9/7/7 instances (~80 CPU budget)\n├── haproxy*.cfg                  # HAProxy configs per deployment size\n├── sentinel.conf                 # Sentinel configuration\n├── sentinel-entrypoint.sh        # Sentinel startup script\n├── k8s/                          # Kubernetes manifests\n├── helm-config/                  # Bitnami Valkey Helm values\n├── minikube-deploy.sh            # Minikube deployment script\n├── minikube-teardown.sh          # Minikube teardown script\n├── KNOWN_ISSUES.md               # Known issues + optimization history\n├── contributions.txt             # Team contributions\n└── README.md\n```\n\n## API Reference\n\nAll endpoints are available via the gateway at `http://localhost:8000`.\n\n### Order Service (`/orders/`)\n\n| Method | Endpoint | Description |\n|--------|----------|-------------|\n| POST | `/orders/create/{user_id}` | Create order, returns `{\"order_id\": \"...\"}` |\n| GET | `/orders/find/{order_id}` | Get order details (id, paid, items, user, cost) |\n| POST | `/orders/addItem/{order_id}/{item_id}/{quantity}` | Add item to order |\n| POST | `/orders/checkout/{order_id}` | Execute checkout (2PC or Saga) |\n| POST | `/orders/batch_init/{n}/{n_items}/{n_users}/{item_price}` | Seed test data |\n| GET | `/orders/metrics` | Prometheus-format metrics |\n| GET | `/orders/dlq/status` | Dead Letter Queue status |\n| GET | `/orders/health` | Health check |\n\n### Stock Service (`/stock/`)\n\n| Method | Endpoint | Description |\n|--------|----------|-------------|\n| POST | `/stock/item/create/{price}` | Create item, returns `{\"item_id\": \"...\"}` |\n| GET | `/stock/find/{item_id}` | Get item stock and price |\n| POST | `/stock/add/{item_id}/{amount}` | Add stock |\n| POST | `/stock/subtract/{item_id}/{amount}` | Subtract stock |\n| POST | `/stock/batch_init/{n}/{starting_stock}/{item_price}` | Seed test data |\n| GET | `/stock/metrics` | Prometheus-format metrics |\n| POST | `/stock/fault/set` | Set fault injection rule |\n| POST | `/stock/fault/clear` | Clear all fault rules |\n| GET | `/stock/fault/rules` | View active fault rules |\n| GET | `/stock/health` | Health check |\n\n### Payment Service (`/payment/`)\n\n| Method | Endpoint | Description |\n|--------|----------|-------------|\n| POST | `/payment/create_user` | Create user, returns `{\"user_id\": \"...\"}` |\n| GET | `/payment/find_user/{user_id}` | Get user credit |\n| POST | `/payment/add_funds/{user_id}/{amount}` | Add credit |\n| POST | `/payment/pay/{user_id}/{amount}` | Direct payment (deduct credit) |\n| POST | `/payment/batch_init/{n}/{starting_money}` | Seed test data |\n| GET | `/payment/metrics` | Prometheus-format metrics |\n| POST | `/payment/fault/set` | Set fault injection rule |\n| POST | `/payment/fault/clear` | Clear all fault rules |\n| GET | `/payment/fault/rules` | View active fault rules |\n| GET | `/payment/health` | Health check |\n\n## Formal Verification\n\nThe `tla/` directory contains a TLA+/PlusCal specification (`ServicesConsistencyPlusCal.tla`)\nthat formally verifies the consistency and fault tolerance properties of the checkout\nprotocol. See [`docs/tla_consistency_fault_tolerance.md`](docs/tla_consistency_fault_tolerance.md)\nfor the analysis.\n\n## Documentation\n\n| Document | Description |\n|----------|-------------|\n| [`docs/plans/2026-02-15-system-design.md`](docs/plans/2026-02-15-system-design.md) | Full system design, protocol descriptions, failure analysis |\n| [`docs/architectural_compliance_report.md`](docs/architectural_compliance_report.md) | How design decisions compare to production systems (Amazon, Uber, Stripe) |\n| [`docs/CONSISTENCY_REPORT.md`](docs/CONSISTENCY_REPORT.md) | Conservation bugs found during chaos testing, root causes, and fixes |\n| [`docs/tla_consistency_fault_tolerance.md`](docs/tla_consistency_fault_tolerance.md) | TLA+ formal verification of consistency properties |\n| [`orchestrator/README.md`](orchestrator/README.md) | Orchestrator package API and design |\n| [`KNOWN_ISSUES.md`](KNOWN_ISSUES.md) | Known issues, optimization history, and WSL2 performance notes |\n\n## Logs\n\n```bash\ndocker compose logs -f order-service-1 order-service-2\ndocker compose logs -f stock-service stock-service-2\ndocker compose logs -f payment-service payment-service-2\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fviftode4%2Fdds26-17","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fviftode4%2Fdds26-17","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fviftode4%2Fdds26-17/lists"}