{"id":29416352,"url":"https://github.com/roybidani/sre-lab-infra","last_synced_at":"2026-04-09T17:52:46.896Z","repository":{"id":301361694,"uuid":"1009010077","full_name":"RoyBidani/sre-lab-infra","owner":"RoyBidani","description":"🚀 Complete SRE Training Environment - Production-grade infrastructure with Kubernetes, Prometheus, Grafana, and advanced SRE practices for hands-on learning","archived":false,"fork":false,"pushed_at":"2025-06-26T12:46:52.000Z","size":0,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-26T13:42:03.324Z","etag":null,"topics":["aws","chaos-engineering","devops","grafana","kubernetes","monitoring","prometheus","sre","terraform","training"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RoyBidani.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-26T12:41:21.000Z","updated_at":"2025-06-26T12:46:56.000Z","dependencies_parsed_at":"2025-06-26T13:42:05.831Z","dependency_job_id":"ba43411a-831b-43b4-8474-fcec6f56dbe6","html_url":"https://github.com/RoyBidani/sre-lab-infra","commit_stats":null,"previous_names":["roybidani/sre-lab-infra"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/RoyBidani/sre-lab-infra","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoyBidani%2Fsre-lab-infra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoyBidani%2Fsre-lab-infra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoyBidani%2Fsre-lab-infra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoyBidani%2Fsre-lab-infra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RoyBidani","download_url":"https://codeload.github.com/RoyBidani/sre-lab-infra/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RoyBidani%2Fsre-lab-infra/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264878580,"owners_count":23677451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","chaos-engineering","devops","grafana","kubernetes","monitoring","prometheus","sre","terraform","training"],"created_at":"2025-07-11T19:03:01.326Z","updated_at":"2025-12-30T22:10:57.957Z","avatar_url":"https://github.com/RoyBidani.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🚀 Complete SRE Training Environment\n\nA comprehensive, hands-on Site Reliability Engineering (SRE) training platform built on AWS EKS with Kubernetes, featuring real-world SRE practices including SLO monitoring, alerting, chaos engineering, and incident response.\n\n## 📋 Table of Contents\n\n- [Overview](#overview)\n- [What You'll Build](#what-youll-build)\n- [Prerequisites](#prerequisites)\n- [Architecture](#architecture)\n- [Step-by-Step Setup](#step-by-step-setup)\n- [Understanding Your Environment](#understanding-your-environment)\n- [Testing and Verification](#testing-and-verification)\n- [Learning Exercises](#learning-exercises)\n- [Troubleshooting](#troubleshooting)\n- [Advanced Topics](#advanced-topics)\n- [Cleanup](#cleanup)\n\n## 📚 **Complete Documentation Suite**\n\nThis project includes comprehensive documentation covering every aspect of SRE:\n\n- 🚀 **[MONITORING-SETUP.md](MONITORING-SETUP.md)** - Complete monitoring stack guide with dashboard access\n- 📊 **[DASHBOARD-EXPLANATION.md](DASHBOARD-EXPLANATION.md)** - Detailed explanation of every metric and chart\n- 🎯 **[SRE-FUNDAMENTALS.md](SRE-FUNDAMENTALS.md)** - Complete beginner's guide to SRE concepts and technologies\n- 🛠️ **[OPERATIONAL-PROCEDURES.md](OPERATIONAL-PROCEDURES.md)** - Day-2 operations, backup, security, and troubleshooting\n- 🏗️ **[TECHNOLOGY-GUIDE.md](TECHNOLOGY-GUIDE.md)** - Architecture decisions and technology comparisons\n- 🤝 **[CONTRIBUTING.md](CONTRIBUTING.md)** - Contribution guidelines and development workflow\n\n## 🎯 Overview\n\nThis project creates a production-grade SRE training environment where you'll learn:\n\n- **Infrastructure as Code** with Terraform\n- **Container orchestration** with Kubernetes (EKS)\n- **Observability** with Prometheus and Grafana\n- **SLO/SLI monitoring** and error budget management\n- **Incident response** and chaos engineering\n- **Real-world SRE practices** used by tech giants\n\n# \n\n### Why This Architecture?\n\nWe chose this specific technology stack because:\n\n1. **AWS EKS**: Managed Kubernetes reduces operational overhead while teaching K8s concepts\n2. **Terraform**: Industry-standard IaC tool with extensive AWS support\n3. **Prometheus**: De facto standard for Kubernetes monitoring with powerful query language\n4. **Grafana**: Best-in-class visualization with extensive community dashboards\n5. **Chaos Engineering**: Essential for building resilient systems\n\n## 🏗️ What You'll Build\n\n### Infrastructure Components\n\n- **AWS VPC** with public/private subnets across 2 AZs\n- **EKS Cluster** with managed node groups (2 t3.medium instances)\n- **Application Load Balancers** for external access\n- **NAT Gateways** for private subnet internet access\n- **Security Groups** and IAM roles for least privilege access\n\n### Application Stack\n\n- **3-tier SRE Shop application** (Frontend, Backend, Database)\n- **Nginx frontend** with custom SRE interface\n- **HTTP Echo backend** with health endpoints\n- **Redis database** for session storage\n\n### Monitoring \u0026 SRE Stack\n\n- **Prometheus** for metrics collection and alerting rules\n- **Grafana** with custom SLO dashboards\n- **AlertManager** for intelligent alert routing\n- **Chaos Monkey** for automated resilience testing\n- **SLO monitoring** with error budget tracking\n- **Incident response runbooks** for common scenarios\n\n## ✅ Prerequisites\n\n### Required Tools\n\n```bash\n# Install AWS CLI\ncurl \"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip\" -o \"awscliv2.zip\"\nunzip awscliv2.zip\nsudo ./aws/install\n\n# Install Terraform\nwget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg\necho \"deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/hashicorp.list\nsudo apt update \u0026\u0026 sudo apt install terraform\n\n# Install kubectl\ncurl -LO \"https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl\"\nsudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl\n```\n\n### AWS Account Setup\n\n1. **AWS Account** with administrative privileges\n2. **AWS CLI configured** with access keys\n3. **Sufficient limits** for:\n   - 2 t3.medium EC2 instances\n   - 2 Application Load Balancers\n   - 2 NAT Gateways\n   - 2 Elastic IPs\n\n### Verify Prerequisites\n\n```bash\n# Test AWS access\naws sts get-caller-identity\n\n# Test Terraform\nterraform version\n\n# Test kubectl\nkubectl version --client\n```\n\n## 🏛️ Architecture\n\n### Network Architecture\n\n```\nInternet Gateway\n       |\n   Public Subnets (10.0.1.0/24, 10.0.3.0/24)\n       |\n   Load Balancers \u0026 NAT Gateways\n       |\n   Private Subnets (10.0.2.0/24, 10.0.4.0/24)\n       |\n   EKS Worker Nodes\n```\n\n### Application Architecture\n\n```\nInternet → ALB → Frontend (Nginx) → Backend (HTTP Echo) → Redis\n                     ↓\n                Prometheus ← Metrics\n                     ↓\n                 Grafana ← Visualization\n                     ↓\n               AlertManager ← Notifications\n```\n\n### Why This Architecture?\n\n1. **Security**: Worker nodes in private subnets with no direct internet access\n2. **High Availability**: Resources spread across multiple AZs\n3. **Scalability**: Managed node groups can auto-scale based on demand\n4. **Observability**: Comprehensive monitoring from infrastructure to application\n5. **Resilience**: Chaos engineering tests failure scenarios\n\n## 📖 Step-by-Step Setup\n\n### Phase 1: Infrastructure Deployment\n\n#### Step 1: Clone and Prepare\n\n```bash\ngit clone \u003cyour-repo-url\u003e\ncd sre-lab-infra\n```\n\n#### Step 2: Configure Terraform Variables\n\nThe infrastructure uses sensible defaults, but you can customize:\n\n```bash\n# terraform/variables.tf contains:\n# - aws_region = \"eu-central-1\"\n# - vpc_cidr = \"10.0.0.0/16\" \n# - public_subnets = [\"10.0.1.0/24\", \"10.0.3.0/24\"]\n# - private_subnets = [\"10.0.2.0/24\", \"10.0.4.0/24\"]\n# - cluster_name = \"sre-lab-eks\"\n```\n\n**Why these defaults?**\n\n- **eu-central-1**: Cost-effective region with good availability\n- **10.0.0.0/16**: Provides 65,536 IPs for future expansion\n- **Separate AZs**: Ensures high availability\n- **Public/Private split**: Security best practice\n\n#### Step 3: Deploy Infrastructure\n\n```bash\ncd terraform\nterraform init\nterraform plan\nterraform apply\n```\n\n**What happens here:**\n\n1. **VPC Creation**: Isolated network environment\n2. **Subnet Creation**: Public for load balancers, private for applications\n3. **Gateway Setup**: Internet and NAT gateways for connectivity\n4. **EKS Cluster**: Managed Kubernetes control plane\n5. **Node Groups**: EC2 instances joined to the cluster\n\n**This takes 10-15 minutes** because:\n\n- EKS control plane provisioning: ~10 minutes\n- Node group creation: ~5 minutes\n- DNS propagation: ~2 minutes\n\n#### Step 4: Configure kubectl\n\n```bash\naws eks update-kubeconfig --region eu-central-1 --name sre-lab-eks\nkubectl get nodes\n```\n\n**Troubleshooting Access Issues:**\nIf you get authentication errors:\n\n1. **Check IAM permissions**: User needs `AmazonEKSClusterAdminPolicy`\n2. **Add to EKS access**: Go to AWS Console → EKS → Cluster → Access → Add IAM user\n3. **Verify AWS CLI**: `aws sts get-caller-identity`\n\n### Phase 2: Application Deployment\n\n#### Step 5: Deploy the SRE Shop Application\n\n```bash\n# Deploy application components\nkubectl apply -f k8s-manifests/app/namespace.yaml\nkubectl apply -f k8s-manifests/app/redis.yaml\nkubectl apply -f k8s-manifests/app/backend.yaml\nkubectl apply -f k8s-manifests/app/frontend.yaml\n```\n\n**Understanding the Application:**\n\n1. **Namespace**: Logical isolation within the cluster\n   \n   ```yaml\n   apiVersion: v1\n   kind: Namespace\n   metadata:\n     name: sre-shop\n   ```\n\n2. **Redis Database**: Key-value store for session data\n   \n   - **Why Redis?** Fast, reliable, commonly used in microservices\n   - **Configuration**: Single instance with persistent volume\n   - **Monitoring**: Health checks and resource limits\n\n3. **Backend API**: HTTP echo service\n   \n   - **Why HTTP Echo?** Simple, predictable responses for testing\n   - **Features**: Health endpoints, JSON responses, environment info\n   - **Scaling**: 2 replicas for redundancy\n\n4. **Frontend**: Nginx reverse proxy\n   \n   - **Why Nginx?** Industry standard, efficient, configurable\n   - **Role**: Serves static content and proxies API calls\n   - **Configuration**: Custom HTML with SRE interface\n\n#### Step 6: Verify Application Deployment\n\n```bash\n# Check pod status\nkubectl get pods -n sre-shop\n\n# Expected output:\n# NAME                           READY   STATUS    RESTARTS   AGE\n# backend-api-xxx                1/1     Running   0          2m\n# frontend-xxx                   1/1     Running   0          2m\n# redis-xxx                      1/1     Running   0          2m\n\n# Get application URL\nkubectl get services -n sre-shop\n```\n\n### Phase 3: Monitoring Stack\n\n#### Step 7: Deploy Monitoring Infrastructure\n\n```bash\n# Deploy monitoring namespace and RBAC\nkubectl apply -f k8s-manifests/monitoring/namespace.yaml\nkubectl apply -f k8s-manifests/monitoring/prometheus-rbac.yaml\n\n# Deploy Prometheus\nkubectl apply -f k8s-manifests/monitoring/prometheus-configmap.yaml\nkubectl apply -f k8s-manifests/monitoring/prometheus-deployment.yaml\n\n# Deploy Grafana\nkubectl apply -f k8s-manifests/monitoring/grafana-configmap.yaml\nkubectl apply -f k8s-manifests/monitoring/grafana-deployment.yaml\n```\n\n**Understanding Prometheus Configuration:**\n\n1. **Service Discovery**: Automatically finds Kubernetes services\n   \n   ```yaml\n   kubernetes_sd_configs:\n   - role: pod\n   ```\n\n2. **Scrape Configs**: Defines what metrics to collect\n   \n   ```yaml\n   - job_name: 'sre-shop-backend'\n     kubernetes_sd_configs:\n     - role: pod\n   ```\n\n3. **Alerting Rules**: Conditions that trigger alerts\n   \n   ```yaml\n   - alert: HighErrorRate\n     expr: rate(errors[5m]) \u003e 0.1\n   ```\n\n**Why This Monitoring Stack?**\n\n- **Prometheus**: Pull-based metrics, powerful query language (PromQL)\n- **Grafana**: Rich visualizations, templating, alerting\n- **Integration**: Purpose-built for Kubernetes environments\n\n### Phase 4: SRE Practices Implementation\n\n#### Step 8: Deploy SLO Monitoring\n\n```bash\n./scripts/deploy-sre-practices.sh\n```\n\nThis comprehensive script:\n\n1. **Deploys SLO definitions** and recording rules\n2. **Sets up alerting** based on SLO violations\n3. **Installs chaos engineering** tools\n4. **Configures dashboards** for SLO visualization\n\n**Understanding SLOs (Service Level Objectives):**\n\nSLOs define reliability targets for your service:\n\n1. **Availability SLO**: 99.9% uptime\n   \n   ```promql\n   sre_shop:availability_sli:rate5m = (\n     sum(rate(up{job=\"sre-shop-backend\"}[5m])) /\n     count(up{job=\"sre-shop-backend\"})\n   )\n   ```\n\n2. **Error Rate SLO**: \u003c 0.1% error rate\n   \n   ```promql\n   sre_shop:error_rate_sli:rate5m = (\n     1 - rate(http_requests_total{status=~\"5..\"}[5m]) /\n     rate(http_requests_total[5m])\n   )\n   ```\n\n3. **Latency SLO**: \u003c 500ms P95 response time\n   \n   ```promql\n   sre_shop:latency_sli:p95_5m = \n     histogram_quantile(0.95, rate(response_time_bucket[5m]))\n   ```\n\n**Error Budget Calculation:**\n\n- **Error Budget** = (1 - SLO) × Time Window\n- **Example**: 99.9% SLO = 0.1% error budget = 43.2 minutes/month downtime\n\n## 🔍 Understanding Your Environment\n\n### What Runs Where?\n\n#### **EKS Control Plane** (AWS Managed)\n\n- **Location**: AWS-managed, multi-AZ\n- **Purpose**: Kubernetes API server, etcd, scheduler\n- **Access**: Via kubectl and AWS console\n- **Cost**: $0.10/hour for cluster management\n\n#### **Worker Nodes** (Your EC2 Instances)\n\n- **Instance Type**: t3.medium (2 vCPU, 4GB RAM)\n- **Count**: 2 instances across different AZs\n- **Location**: Private subnets (10.0.2.0/24, 10.0.4.0/24)\n- **Purpose**: Run your application pods\n\n#### **Load Balancers** (AWS Managed)\n\n- **Type**: Application Load Balancer (ALB)\n- **Purpose**: Distribute traffic to application services\n- **Location**: Public subnets\n- **DNS**: Auto-generated AWS hostnames\n\n### Kubernetes Components Explained\n\n#### **Namespaces**: Logical Separation\n\n```bash\nkubectl get namespaces\n\n# sre-shop: Your application\n# monitoring: Prometheus, Grafana\n# kube-system: Kubernetes core components\n# default: Default namespace (unused)\n```\n\n#### **Pods**: Smallest Deployable Units\n\n```bash\nkubectl get pods -n sre-shop\n\n# Each pod contains one or more containers\n# Pods are ephemeral - they come and go\n# Pod IP addresses change when recreated\n```\n\n#### **Services**: Stable Network Endpoints\n\n```bash\nkubectl get services -n sre-shop\n\n# ClusterIP: Internal cluster communication\n# LoadBalancer: External internet access\n# Services provide stable IPs and DNS names\n```\n\n#### **Deployments**: Manage Pod Replicas\n\n```bash\nkubectl get deployments -n sre-shop\n\n# Deployment manages ReplicaSets\n# ReplicaSets manage Pods\n# Provides rolling updates and rollbacks\n```\n\n### How to Identify and Check Components\n\n#### **Check Cluster Health**\n\n```bash\n# Overall cluster status\nkubectl cluster-info\n\n# Node health and capacity\nkubectl describe nodes\n\n# Resource usage\nkubectl top nodes\nkubectl top pods -n sre-shop\n```\n\n#### **Identify Node Roles**\n\n```bash\n# List nodes with labels\nkubectl get nodes --show-labels\n\n# Each node will show:\n# - kubernetes.io/arch=amd64\n# - kubernetes.io/instance-type=t3.medium\n# - topology.kubernetes.io/zone=eu-central-1a\n```\n\n#### **Monitor Application Health**\n\n```bash\n# Pod status and restarts\nkubectl get pods -n sre-shop -o wide\n\n# Pod logs\nkubectl logs \u003cpod-name\u003e -n sre-shop\n\n# Pod events\nkubectl describe pod \u003cpod-name\u003e -n sre-shop\n```\n\n#### **Check SLO Metrics**\n\n```bash\n# Port forward to Prometheus\nkubectl port-forward -n monitoring service/prometheus-service 9090:9090\n\n# In browser: http://localhost:9090\n# Query: sre_shop:availability_sli:rate5m\n```\n\n#### **Monitor Chaos Engineering**\n\n```bash\n# Check Chaos Monkey status\nkubectl get pods -n sre-shop -l app=chaos-monkey\n\n# Watch chaos events\nkubectl logs -f deployment/chaos-monkey -n sre-shop\n```\n\n### Understanding SLI/SLO in Practice\n\n#### **Service Level Indicators (SLIs)**\n\nQuantitative measures of service behavior:\n\n1. **Availability SLI**\n   \n   - **Definition**: Percentage of successful requests\n   - **Measurement**: HTTP 200 responses / Total HTTP requests\n   - **Why Important**: Directly impacts user experience\n\n2. **Latency SLI**\n   \n   - **Definition**: Time to respond to requests\n   - **Measurement**: 95th percentile response time\n   - **Why 95th**: Balances user experience with extreme outliers\n\n3. **Saturation SLI**\n   \n   - **Definition**: Resource utilization levels\n   - **Measurement**: CPU/Memory/Storage usage percentage\n   - **Why Important**: Predicts capacity issues\n\n#### **Service Level Objectives (SLOs)**\n\nTargets for SLI performance:\n\n1. **Setting SLOs**\n   \n   - **Too strict**: Expensive to maintain, limits innovation\n   - **Too loose**: Poor user experience\n   - **Best practice**: Start conservative, adjust based on data\n\n2. **Error Budget**\n   \n   - **Concept**: Amount of failures allowed while meeting SLO\n   - **Usage**: Balance between reliability and feature velocity\n   - **Policy**: When error budget is exhausted, focus on stability\n\n## 🧪 Testing and Verification\n\n### Phase 1: Infrastructure Verification\n\n```bash\n# Run comprehensive verification\n./scripts/verify-setup.sh\n\n# Expected output:\n# ✅ All pods running\n# ✅ Services accessible\n# ✅ Monitoring operational\n```\n\n### Phase 2: Generate Application Traffic\n\n```bash\n# Interactive traffic generation\n./scripts/generate-traffic.sh\n\n# Options available:\n# 1. Light traffic (baseline monitoring)\n# 2. Moderate traffic (realistic load)\n# 3. Heavy traffic (stress testing)\n# 4. Burst traffic (spike testing)\n```\n\n### Phase 3: Observe Monitoring Data\n\n\u003e 📖 **For complete monitoring setup details, dashboard explanations, and troubleshooting, see [MONITORING-SETUP.md](MONITORING-SETUP.md)**\n\n#### **Access Grafana Dashboards**\n\n1. Get Grafana URL: `kubectl get service grafana -n monitoring`\n2. Login: admin/admin123\n3. Navigate to \"SRE Shop - SLO Dashboard\"\n4. All dashboards show real data with clear explanations and color coding\n\n#### **Check Prometheus Metrics**\n\n1. Get Prometheus URL: `kubectl get service prometheus-service -n monitoring`\n\n2. Open Prometheus UI\n\n3. Try these queries:\n   \n   ```promql\n   # Service availability\n   up{job=\"sre-shop-backend\"}\n   \n   # SLO metrics\n   sre_shop:availability_sli:rate5m\n   \n   # Error budget consumption\n   (1 - avg_over_time(sre_shop:availability_sli:rate5m[7d])) / (1 - 0.999)\n   ```\n\n### Phase 4: Test Alerting\n\n#### **Trigger SLO Violation**\n\n```bash\n# Scale down backend to trigger availability alert\nkubectl scale deployment backend-api --replicas=0 -n sre-shop\n\n# Continue generating traffic to trigger alerts\n# Wait 5-10 minutes for alerts to fire\n\n# Check AlertManager\nkubectl get service alertmanager -n monitoring\n# Open AlertManager UI to see active alerts\n\n# Restore service\nkubectl scale deployment backend-api --replicas=2 -n sre-shop\n```\n\n### Phase 5: Chaos Engineering\n\n#### **Monitor Chaos Monkey**\n\n```bash\n# Watch chaos events in real-time\nkubectl logs -f deployment/chaos-monkey -n sre-shop\n\n# Expected output every 5 minutes:\n# 🐒 Chaos Monkey Pod Killer started\n# 🔍 Looking for victims...\n# 🎲 Pod backend-api-xxx: random=94, threshold=5\n# 🍀 Pod backend-api-xxx survives this round\n```\n\n#### **Verify Application Resilience**\n\nDuring chaos events:\n\n1. **Application remains accessible** (frontend still serves traffic)\n2. **Kubernetes recreates killed pods** automatically\n3. **Load balancer routes around failed instances**\n4. **Monitoring captures service degradation**\n\n## 📚 Learning Exercises\n\n\u003e 📊 **Dashboard Guide**: See [DASHBOARD-EXPLANATION.md](DASHBOARD-EXPLANATION.md) for detailed metric explanations\n\u003e \n\u003e 🚀 **Monitoring Setup**: See [MONITORING-SETUP.md](MONITORING-SETUP.md) for complete monitoring stack details\n\n### Exercise 1: Understanding Kubernetes Fundamentals\n\n#### **Pod Lifecycle**\n\n```bash\n# Create a test pod\nkubectl run test-pod --image=nginx -n sre-shop\n\n# Watch pod creation\nkubectl get pods -n sre-shop -w\n\n# Examine pod details\nkubectl describe pod test-pod -n sre-shop\n\n# Delete pod and observe recreation (if part of deployment)\nkubectl delete pod test-pod -n sre-shop\n```\n\n#### **Service Discovery**\n\n```bash\n# Connect to a running pod\nkubectl exec -it \u003cbackend-pod-name\u003e -n sre-shop -- /bin/sh\n\n# Test internal DNS resolution\nnslookup redis-service.sre-shop.svc.cluster.local\nnslookup backend-service.sre-shop.svc.cluster.local\n\n# Test connectivity\nwget -O- redis-service:6379\n```\n\n### Exercise 2: SLO Management\n\n#### **Adjust SLO Targets**\n\n1. Edit `k8s-manifests/sre-practices/slo-monitoring/slo-definitions.yaml`\n2. Change availability target from 99.9% to 99.5%\n3. Apply changes: `kubectl apply -f ...`\n4. Observe different alert thresholds\n\n#### **Create Custom SLI**\n\nAdd a new SLI for frontend response time:\n\n```yaml\n- record: sre_shop:frontend_latency_sli:p99_5m\n  expr: histogram_quantile(0.99, rate(nginx_http_request_duration_seconds_bucket[5m]))\n```\n\n### Exercise 3: Incident Response\n\n#### **Simulate Common Incidents**\n\n1. **Database Failure**\n   \n   ```bash\n   kubectl scale deployment redis --replicas=0 -n sre-shop\n   # Follow runbook: docs/runbooks/database-failure.md\n   ```\n\n2. **High Memory Usage**\n   \n   ```bash\n   # Patch deployment to use more memory\n   kubectl patch deployment backend-api -n sre-shop -p '{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"backend\",\"resources\":{\"limits\":{\"memory\":\"32Mi\"}}}]}}}}'\n   ```\n\n3. **Network Partition**\n   \n   ```bash\n   # Create network policy to isolate components\n   kubectl apply -f examples/network-partition.yaml\n   ```\n\n### Exercise 4: Chaos Engineering Experiments\n\n#### **Design Custom Chaos**\n\n1. **Modify chaos probability** in chaos-monkey.yaml\n2. **Add new failure types** (network latency, disk full)\n3. **Create chaos schedules** (business hours vs off-hours)\n4. **Measure blast radius** (how far failures propagate)\n\n#### **Chaos Experiment Process**\n\n1. **Hypothesis**: \"Application survives 50% pod failures\"\n2. **Blast Radius**: Limit to one service initially\n3. **Monitoring**: Watch SLO metrics during experiment\n4. **Analysis**: Document weaknesses discovered\n5. **Improvements**: Fix issues and repeat\n\n## 🔧 Troubleshooting\n\n### Common Infrastructure Issues\n\n#### **Terraform Errors**\n\n**Error**: \"Insufficient capacity\"\n\n```bash\n# Solution: Try different instance types or regions\n# Edit terraform/variables.tf:\n# instance_types = [\"t3.small\", \"t3.medium\", \"t2.medium\"]\n```\n\n**Error**: \"VPC limit exceeded\"\n\n```bash\n# Solution: Delete unused VPCs or request limit increase\naws ec2 describe-vpcs\naws support create-case ...\n```\n\n#### **EKS Access Issues**\n\n**Error**: \"User is not authorized\"\n\n```bash\n# Solution 1: Add user to EKS cluster\naws eks update-kubeconfig --region eu-central-1 --name sre-lab-eks\n\n# Solution 2: Check IAM permissions\naws sts get-caller-identity\n# User needs AmazonEKSClusterAdminPolicy\n\n# Solution 3: Add via AWS Console\n# EKS → Cluster → Access → Add IAM user\n```\n\n**Error**: \"No nodes found\"\n\n```bash\n# Check node group status\naws eks describe-nodegroup --cluster-name sre-lab-eks --nodegroup-name default\n\n# If failed, recreate:\nterraform destroy -target=module.eks.eks_managed_node_groups\nterraform apply\n```\n\n### Common Kubernetes Issues\n\n#### **Pods Stuck in Pending**\n\n```bash\n# Check node resources\nkubectl describe nodes\n\n# Check events\nkubectl get events -n sre-shop --sort-by='.lastTimestamp'\n\n# Common causes:\n# - Insufficient CPU/memory\n# - Image pull failures\n# - Volume mount issues\n```\n\n#### **Services Not Accessible**\n\n```bash\n# Check service endpoints\nkubectl get endpoints -n sre-shop\n\n# Check pod labels match service selector\nkubectl get pods -n sre-shop --show-labels\nkubectl describe service frontend-service -n sre-shop\n\n# Test internal connectivity\nkubectl run debug --image=nicolaka/netshoot -it --rm -- nslookup frontend-service.sre-shop.svc.cluster.local\n```\n\n#### **LoadBalancer Pending**\n\n```bash\n# Check AWS load balancer controller\nkubectl get pods -n kube-system | grep aws-load-balancer\n\n# Verify subnet tags\naws ec2 describe-subnets --filters \"Name=vpc-id,Values=\u003cvpc-id\u003e\"\n# Should have kubernetes.io/role/elb=1 for public subnets\n```\n\n### Monitoring Issues\n\n#### **No Metrics in Prometheus**\n\n```bash\n# Check Prometheus targets\nkubectl port-forward -n monitoring service/prometheus-service 9090:9090\n# Open http://localhost:9090/targets\n\n# Check service discovery\nkubectl logs deployment/prometheus -n monitoring | grep discovery\n\n# Verify annotations on pods\nkubectl get pods -n sre-shop -o yaml | grep -A5 annotations\n```\n\n#### **Grafana Dashboard Empty**\n\n```bash\n# Check data source connection\nkubectl port-forward -n monitoring service/grafana 3000:3000\n# Login admin/admin → Configuration → Data Sources\n\n# Verify Prometheus URL: http://prometheus-service:9090\n\n# Check time range (set to last 1 hour)\n# Wait 10-15 minutes for data accumulation\n```\n\n#### **Alerts Not Firing**\n\n```bash\n# Check alerting rules syntax\nkubectl logs deployment/prometheus -n monitoring | grep -i alert\n\n# Verify AlertManager configuration\nkubectl get configmap alertmanager-config -n monitoring -o yaml\n\n# Test alert conditions manually in Prometheus\n# Query: ALERTS{alertstate=\"firing\"}\n```\n\n### Application Issues\n\n#### **Frontend Shows 502 Error**\n\n```bash\n# Check backend pod health\nkubectl get pods -n sre-shop\nkubectl logs deployment/backend-api -n sre-shop\n\n# Verify service configuration\nkubectl describe service backend-service -n sre-shop\n\n# Test backend directly\nkubectl port-forward service/backend-service 8080:8080 -n sre-shop\ncurl http://localhost:8080\n```\n\n#### **Database Connection Failures**\n\n```bash\n# Check Redis pod\nkubectl logs deployment/redis -n sre-shop\n\n# Test Redis connectivity\nkubectl exec -it deployment/backend-api -n sre-shop -- wget -qO- redis-service:6379\n\n# Check network policies\nkubectl get networkpolicies -n sre-shop\n```\n\n## 🎓 Advanced Topics\n\n### Scaling Considerations\n\n#### **Horizontal Pod Autoscaling**\n\n```yaml\napiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n  name: backend-hpa\n  namespace: sre-shop\nspec:\n  scaleTargetRef:\n    apiVersion: apps/v1\n    kind: Deployment\n    name: backend-api\n  minReplicas: 2\n  maxReplicas: 10\n  metrics:\n  - type: Resource\n    resource:\n      name: cpu\n      target:\n        type: Utilization\n        averageUtilization: 70\n```\n\n#### **Cluster Autoscaling**\n\n```bash\n# Enable cluster autoscaler\nhelm repo add autoscaler https://kubernetes.github.io/autoscaler\nhelm install cluster-autoscaler autoscaler/cluster-autoscaler \\\n  --namespace kube-system \\\n  --set autoDiscovery.clusterName=sre-lab-eks\n```\n\n### Security Hardening\n\n#### **Network Policies**\n\n```yaml\napiVersion: networking.k8s.io/v1\nkind: NetworkPolicy\nmetadata:\n  name: sre-shop-network-policy\n  namespace: sre-shop\nspec:\n  podSelector: {}\n  policyTypes:\n  - Ingress\n  - Egress\n  ingress:\n  - from:\n    - namespaceSelector:\n        matchLabels:\n          name: sre-shop\n  egress:\n  - to:\n    - namespaceSelector:\n        matchLabels:\n          name: monitoring\n```\n\n#### **Pod Security Standards**\n\n```yaml\napiVersion: v1\nkind: Namespace\nmetadata:\n  name: sre-shop\n  labels:\n    pod-security.kubernetes.io/enforce: restricted\n    pod-security.kubernetes.io/audit: restricted\n    pod-security.kubernetes.io/warn: restricted\n```\n\n### Cost Optimization\n\n#### **Resource Requests vs Limits**\n\n```yaml\nresources:\n  requests:    # Guaranteed resources\n    memory: \"64Mi\"\n    cpu: \"50m\"\n  limits:      # Maximum allowed\n    memory: \"128Mi\"\n    cpu: \"100m\"\n```\n\n#### **Spot Instances**\n\n```terraform\n# In terraform/eks.tf\neks_managed_node_groups = {\n  spot = {\n    capacity_type  = \"SPOT\"\n    instance_types = [\"t3.medium\", \"t3.large\"]\n    desired_size   = 2\n    max_size       = 4\n    min_size       = 1\n  }\n}\n```\n\n## 🧹 Cleanup\n\n### Full Environment Cleanup\n\n```bash\n# Delete Kubernetes resources\nkubectl delete namespace sre-shop\nkubectl delete namespace monitoring\n\n# Destroy infrastructure\ncd terraform\nterraform destroy\n```\n\n### Partial Cleanup\n\n```bash\n# Remove only applications (keep cluster)\nkubectl delete -f k8s-manifests/app/\nkubectl delete -f k8s-manifests/monitoring/\nkubectl delete -f k8s-manifests/sre-practices/\n\n# Remove only SRE practices (keep apps)\nkubectl delete -f k8s-manifests/sre-practices/\n```\n\n### Cost Monitoring\n\n```bash\n# Check current costs\naws ce get-cost-and-usage \\\n  --time-period Start=2024-01-01,End=2024-01-31 \\\n  --granularity MONTHLY \\\n  --metrics BlendedCost \\\n  --group-by Type=DIMENSION,Key=SERVICE\n```\n\n**Expected Monthly Costs:**\n\n- **EKS Cluster**: $72 (cluster management)\n- **EC2 Instances**: $60 (2 x t3.medium)\n- **Load Balancers**: $36 (2 x ALB)\n- **NAT Gateways**: $90 (2 x NAT + data transfer)\n- **Total**: ~$260/month\n\n## 📖 Further Reading\n\n### Essential SRE Resources\n\n- [Google SRE Book](https://sre.google/sre-book/table-of-contents/) - Foundational concepts\n- [Site Reliability Workbook](https://sre.google/workbook/table-of-contents/) - Practical implementation\n- [Prometheus Documentation](https://prometheus.io/docs/) - Monitoring best practices\n- [Kubernetes Documentation](https://kubernetes.io/docs/) - Container orchestration\n\n### Advanced Topics\n\n- [Chaos Engineering Principles](https://principlesofchaos.org/) - Failure testing methodology\n- [OpenTelemetry](https://opentelemetry.io/) - Observability standards\n- [GitOps with ArgoCD](https://argo-cd.readthedocs.io/) - Continuous deployment\n- [Service Mesh with Istio](https://istio.io/) - Advanced networking and security\n\n# \n\n---\n\n**🎯 You now have a complete, production-grade SRE training environment!**\n\nThis setup mirrors what you'd find at companies like Google, Netflix, and Spotify. Use it to practice SRE skills, experiment with new technologies, and build confidence with real-world reliability engineering.\n\n**Next Steps:**\n\n1. **Follow the step-by-step setup** to build your environment\n2. **Complete the learning exercises** to understand each component\n3. **Experiment with configurations** to see how changes affect behavior\n4. **Practice incident response** using the provided runbooks\n5. **Read the deep-dive documentation** to understand why we made each choice\n\nHappy learning! 🚀","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froybidani%2Fsre-lab-infra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Froybidani%2Fsre-lab-infra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froybidani%2Fsre-lab-infra/lists"}