https://github.com/suhasramanand/predictive-reliability-platform
End-to-end predictive reliability platform with anomaly detection, auto-remediation, and comprehensive observability for microservices
https://github.com/suhasramanand/predictive-reliability-platform
anomaly-detection auto-remediation chaos-engineering devops docker fastapi grafana kubernetes microservices monitoring observability predictive-maintenance prometheus python react site-reliability sre typescript
Last synced: 10 days ago
JSON representation
End-to-end predictive reliability platform with anomaly detection, auto-remediation, and comprehensive observability for microservices
- Host: GitHub
- URL: https://github.com/suhasramanand/predictive-reliability-platform
- Owner: suhasramanand
- License: mit
- Created: 2025-10-17T19:02:13.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-10-18T20:38:02.000Z (6 months ago)
- Last Synced: 2025-10-18T21:29:30.905Z (6 months ago)
- Topics: anomaly-detection, auto-remediation, chaos-engineering, devops, docker, fastapi, grafana, kubernetes, microservices, monitoring, observability, predictive-maintenance, prometheus, python, react, site-reliability, sre, typescript
- Language: TypeScript
- Size: 9.02 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Predictive Reliability & Auto-Remediation Platform
A comprehensive cloud-native system that monitors microservices, detects anomalies in real-time, and automatically remediates issues through intelligent policy-driven actions.
## Overview
This platform demonstrates an end-to-end Site Reliability Engineering (SRE) solution featuring:
- **Instrumented Microservices**: 3 production-ready services with built-in observability
- **Real-time Anomaly Detection**: ML-based time-series analysis for proactive issue detection
- **Automated Remediation**: Policy-driven engine that executes recovery actions automatically
- **Complete Observability Stack**: Metrics (Prometheus), Logs (Loki), Traces (Jaeger)
- **Live Dashboard**: Modern React UI for monitoring and control
- **Chaos Engineering**: Built-in failure injection for testing resilience
- **AI-Powered Intelligence**: LLM-driven root cause analysis, incident summarization, and remediation advice
### Screenshots
**Dashboard Overview**

*Real-time system monitoring with service health, auto-remediation status, and quick access to observability tools*
**Anomaly Detection**

*Active anomalies detected with severity classification, confidence scores, and expected value ranges*
**Auto-Remediation Actions**

*Complete history of executed remediation actions with policy triggers and execution details*
**Policy Configuration**

*YAML-driven policy rules with configurable thresholds, actions, and cooldown periods*
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Dashboard (React) │
│ http://localhost:3000 │
└────────────────────┬────────────────────────────────────────────┘
│
┌───────────┴───────────┐
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Anomaly Service │ │ Policy Engine │
│ Port 8080 │───▶│ Port 8081 │
└────────┬────────┘ └────────┬─────────┘
│ │
│ ├──► Docker API (restart containers)
│ └──► Alerts & Actions
│
▼
┌──────────────────────────────────────────┐
│ Prometheus :9090 │
│ (Scrapes metrics every 10s) │
└────┬─────────┬──────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────────┐
│ Orders │ │ Users │ │ Payments │
│ :8001 │ │ :8002 │ │ :8003 │
└─────────┘ └──────────┘ └──────────────┘
│ │ │
└─────────┴──────────┴──► Jaeger :16686 (Traces)
│
└──► Loki :3100 (Logs)
```
## Components
### Microservices
- **Orders Service** (Port 8001): Order management with chaos injection
- **Users Service** (Port 8002): User account management
- **Payments Service** (Port 8003): Payment processing
Each service exposes:
- `/health` - Health check endpoint
- `/metrics` - Prometheus-format metrics
- `/docs` - FastAPI Swagger documentation
- Full OpenTelemetry instrumentation for distributed tracing
### Anomaly Detection Service (Port 8080)
- Pulls metrics from Prometheus every 30 seconds
- Statistical anomaly detection using moving averages and standard deviation
- Monitors: latency (p99), error rates, CPU usage
- Classifies anomalies by severity: normal, info, warning, critical
- REST API for predictions and health status
### Policy & Auto-Remediation Engine (Port 8081)
- YAML-based policy definitions
- Continuous evaluation against detected anomalies
- Actions: `restart_container`, `scale_up`, `alert`
- Cooldown periods to prevent action spam
- Complete action history tracking
- Toggle for enabling/disabling auto-remediation
### Dashboard (Port 3000)
- **Overview**: System health, auto-remediation status
- **Anomalies**: Real-time anomaly detection and predictions
- **Actions**: Remediation action history
- **Policies**: Active policy configurations
See [Dashboard Screenshots](#screenshots) above for visual examples.
### Observability Stack
- **Prometheus** (9090): Metrics collection and time-series database
- **Grafana** (3001): Visualization and dashboards (admin/admin)
- **Loki** (3100): Log aggregation
- **Jaeger** (16686): Distributed tracing
- **AI Service** (8090): LLM-powered intelligence (requires GROQ_API_KEY)
**Monitoring Tools**

*Grafana Explore interface with Prometheus data source for metrics visualization*

*Prometheus scrape targets showing all services health status*

*Prometheus metrics query interface with time-series visualization*

*Jaeger distributed tracing interface for trace analysis*
### Chaos Simulator
Python-based tool for injecting failures:
- Random failures and latency spikes
- Traffic generation and load testing
- Chaos engineering experiments
See the [chaos_simulator/README.md](chaos_simulator/README.md) for detailed usage.
### AI Service (Port 8090) - NEW
LLM-powered intelligence layer using Groq API:
- **Natural Language Queries**: Ask questions about your system in plain English
- **Incident Summarization**: Auto-generate incident reports from metrics, logs, and traces
- **Root Cause Analysis**: AI identifies likely failure subsystems from observability data
- **Remediation Advice**: LLM recommends best corrective actions with rationale
**Endpoints:**
- `POST /chat` - General SRE Q&A with context
- `POST /summarize` - Generate incident summary from observability data
- `POST /rca` - Root cause analysis from logs and metrics correlation
- `POST /advice` - Remediation action recommendation
**Configuration:**
Set the `GROQ_API_KEY` environment variable to enable AI features. See [AI Configuration](#ai-configuration) below.
## API Documentation
All services provide interactive OpenAPI (Swagger) documentation:
**Anomaly Detection Service**

*Anomaly detection REST API with endpoints for predictions, health checks, and manual detection*
**Policy Engine**

*Policy engine REST API for status, policy management, and remediation actions*
**Microservices APIs**
Orders Service API
Users Service API
Payments Service API
## Quick Start
### Prerequisites
- Docker & Docker Compose
- Python 3.11+ (for chaos simulator)
- 8GB+ RAM recommended
- Ports available: 3000, 3001, 8001-8003, 8080-8081, 9090, 16686
### 1. Start the Platform
```bash
# Clone or navigate to the project
cd predictive-reliability-platform
# Start all services
make up
# This will start:
# - 3 Microservices
# - Anomaly Detection Service
# - Policy Engine
# - Dashboard
# - Prometheus, Grafana, Loki, Jaeger
```
Wait 30-60 seconds for all services to initialize.
### 2. Access the Interfaces
- **Dashboard**: http://localhost:3000
- **Grafana**: http://localhost:3001 (admin/admin)
- **Prometheus**: http://localhost:9090
- **Jaeger**: http://localhost:16686
- **Anomaly API Docs**: http://localhost:8080/docs
- **Policy Engine API Docs**: http://localhost:8081/docs
- **AI Service API Docs**: http://localhost:8090/docs
### 3. Generate Traffic & Trigger Anomalies
```bash
# Generate steady load
make chaos-load
# Or inject random chaos
make chaos
# Or create a traffic spike
make chaos-spike
```
### 4. Watch the Magic Happen
1. Go to the **Dashboard** (http://localhost:3000)
2. Navigate to **Anomalies** tab - watch real-time detections
3. Check **Actions** tab - see auto-remediation in action
4. View **Grafana** for detailed metrics visualization
See the [Screenshots](#screenshots) section above for visual examples of each interface.
## Detailed Usage
### Makefile Commands
```bash
make help # Show all commands
make up # Start all services
make down # Stop all services
make build # Build Docker images
make rebuild # Rebuild and restart
make logs # View logs
make status # Check service status
make health # Health check all services
make chaos # Inject random chaos
make chaos-load # Generate steady load
make chaos-spike # Generate traffic spike
make clean # Clean everything (including volumes)
make test # Run end-to-end test
make urls # Display all service URLs
```
### Chaos Simulator CLI
```bash
cd chaos_simulator
# Install dependencies
pip install -r requirements.txt
# Check health of all services
python chaos.py health
# Generate load on specific service
python chaos.py load --service orders --requests 100
# Traffic spike
python chaos.py spike --service payments --duration 60
# Random chaos for 2 minutes
python chaos.py chaos --duration 120
# Steady background load for 5 minutes
python chaos.py steady --duration 300
```
### Policy Configuration
Edit `policy_engine/policies.yml`:
```yaml
policies:
- name: "orders_high_latency_restart"
condition: "latency > 0.5" # Trigger when latency > 500ms
action: "restart_container" # Action to execute
service: "orders" # Target service
cooldown: 300 # Wait 5 minutes before repeating
enabled: true # Enable/disable policy
```
Available actions:
- `restart_container`: Restart the Docker container
- `scale_up`: Scale service replicas (K8s)
- `alert`: Send alert notification
### API Examples
**Get Anomalies:**
```bash
curl http://localhost:8080/predict | jq
```
**Get Services Health:**
```bash
curl http://localhost:8080/services/health | jq
```
**Get Policy Status:**
```bash
curl http://localhost:8081/status | jq
```
**Get Remediation Actions:**
```bash
curl http://localhost:8081/actions | jq
```
**Toggle Auto-Remediation:**
```bash
curl -X POST http://localhost:8081/toggle | jq
```
## Testing End-to-End Flow
### Scenario 1: High Latency Detection & Recovery
```bash
# 1. Start the platform
make up
# 2. Generate traffic with latency spikes
make chaos-spike
# 3. Watch the dashboard
open http://localhost:3000
# Expected outcome:
# - Anomaly service detects high latency
# - Policy engine triggers restart action
# - Service recovers automatically
# - All actions logged in dashboard
```
### Scenario 2: High Error Rate
```bash
# 1. Enable chaos mode (already enabled in docker-compose)
# 2. Generate high load
cd chaos_simulator
python chaos.py load --service payments --requests 200
# 3. Monitor
# - Check Anomalies tab for error_rate anomalies
# - Check Actions tab for remediation history
# - View Grafana for error rate graphs
```
## Grafana Dashboards
Access Grafana at http://localhost:3001 (admin/admin)
Pre-configured dashboard includes:
- Service health overview
- Request latency (p99) per service
- Error rate trends
- CPU usage
- Request rate
To import additional dashboards:
1. Click "+" → "Import"
2. Upload `monitoring/grafana/dashboards/main-dashboard.json`
## Configuration
### Environment Variables
**Microservices:**
- `CHAOS_ENABLED`: Enable chaos injection (default: true)
- `FAILURE_RATE`: Probability of failures (default: 0.1)
- `LATENCY_SPIKE_RATE`: Probability of latency spikes (default: 0.15)
**Anomaly Service:**
- `PROMETHEUS_URL`: Prometheus endpoint
- `CHECK_INTERVAL`: Detection interval in seconds (default: 30)
**Policy Engine:**
- `AUTO_REMEDIATION_ENABLED`: Enable auto-remediation (default: true)
- `CHECK_INTERVAL`: Evaluation interval in seconds (default: 30)
### Adjusting Sensitivity
Edit `anomaly_service/main.py`:
```python
detector = SimpleAnomalyDetector(
window_size=20, # Number of historical data points
sensitivity=2.5 # Standard deviations for threshold
)
```
Lower sensitivity = more anomalies detected
Higher sensitivity = only severe anomalies
### AI Configuration
The AI service requires a Groq API key to enable LLM-powered features.
**Option 1: Environment Variable (Recommended for Production)**
```bash
export GROQ_API_KEY="your-groq-api-key-here"
docker compose up -d
```
**Option 2: .env File (Local Development)**
```bash
# Create .env file in project root
echo "GROQ_API_KEY=your-groq-api-key-here" > .env
# Start with env file
docker compose --env-file .env up -d
```
**Option 3: GitHub Secrets (CI/CD)**
```bash
# Add secret to GitHub repository
gh secret set GROQ_API_KEY -b"your-groq-api-key-here" -R suhasramanand/predictive-reliability-platform
# Or via GitHub UI:
# Repository → Settings → Secrets and variables → Actions → New repository secret
```
**Verify AI Service:**
```bash
curl http://localhost:8090/health
# Expected: {"status":"healthy","service":"ai-service"}
# Test chat endpoint
curl -X POST http://localhost:8090/chat \
-H "Content-Type: application/json" \
-d '{"query":"What is SRE?"}'
```
**Without GROQ_API_KEY:**
- AI features will be disabled gracefully
- Dashboard will show "AI Unavailable" status
- All other platform features continue to work normally
**Getting a Groq API Key:**
1. Visit https://console.groq.com
2. Sign up for a free account
3. Navigate to API Keys
4. Create a new API key
5. Copy and set as environment variable
## Troubleshooting
### Services won't start
```bash
# Check Docker is running
docker ps
# Check port conflicts
lsof -i :3000,8001,8002,8003,8080,8081,9090
# View logs
make logs
```
### Anomalies not detected
```bash
# Verify Prometheus is scraping
open http://localhost:9090/targets
# Check anomaly service logs
docker logs anomaly-service
# Generate more traffic
make chaos-load
```
### Auto-remediation not working
```bash
# Check policy engine status
curl http://localhost:8081/status | jq
# Verify Docker socket is mounted
docker exec policy-engine ls -la /var/run/docker.sock
# Check policies are loaded
curl http://localhost:8081/policies | jq
```
### Dashboard not loading data
```bash
# Check service connectivity
docker exec dashboard ping anomaly-service
docker exec dashboard ping policy-engine
# Check nginx proxy config
docker logs dashboard
```
## Project Structure
```
predictive-reliability-platform/
├── services/
│ ├── orders_service/ # Orders microservice
│ ├── users_service/ # Users microservice
│ └── payments_service/ # Payments microservice
├── anomaly_service/ # Anomaly detection service
├── policy_engine/ # Auto-remediation engine
├── chaos_simulator/ # Chaos engineering tool
├── dashboard/ # React TypeScript dashboard
├── monitoring/ # Observability configs
│ ├── prometheus.yml
│ ├── loki-config.yml
│ └── grafana/
├── docker-compose.yml # Orchestration
├── Makefile # Automation commands
└── README.md # This file
```
## Production Deployment
### AWS EKS (Terraform)
```bash
cd terraform
terraform init
terraform plan
terraform apply
# Update kubeconfig
aws eks update-kubeconfig --name predictive-reliability-cluster
# Deploy
kubectl apply -f k8s/
```
### Key Considerations
1. **Security**: Use secrets management (AWS Secrets Manager, Vault)
2. **Scaling**: Configure HPA for microservices
3. **Persistence**: Use RDS for state, EBS for Prometheus
4. **Monitoring**: Send alerts to PagerDuty/Slack
5. **Networking**: Configure ALB/NLB for ingress
6. **Observability**: Consider managed solutions (Amazon Managed Prometheus, Grafana Cloud)
## Learning Outcomes
This project demonstrates:
- **Microservices Architecture**: Service isolation, API design
- **Observability**: Metrics, logs, traces (Prometheus, Loki, Jaeger)
- **SRE Practices**: SLO/SLI monitoring, error budgets, incident response
- **Machine Learning**: Time-series analysis, anomaly detection
- **Automation**: Policy-driven remediation, self-healing systems
- **DevOps**: Docker, Docker Compose, CI/CD concepts
- **Chaos Engineering**: Failure injection, resilience testing
- **Full-Stack Development**: React, TypeScript, Python, FastAPI
## Future Enhancements
- [ ] Kubernetes deployment manifests
- [ ] Terraform modules for AWS/GCP/Azure
- [ ] Advanced ML models (LSTM, Prophet)
- [ ] Slack/PagerDuty integration
- [ ] Custom Grafana dashboards with alerts
- [ ] Service mesh integration (Istio)
- [ ] Cost optimization recommendations
- [ ] Performance profiling
- [ ] Security scanning and compliance checks
## Contributing
This is a proof-of-concept project. Feel free to:
- Fork and extend functionality
- Add new microservices
- Improve anomaly detection algorithms
- Create additional policies
- Enhance the dashboard
## License
MIT License - Feel free to use this project for learning and demonstration purposes.
## Author
Built as a comprehensive SRE/DevOps demonstration project.
## Acknowledgments
- Prometheus Project
- Grafana Labs
- Jaeger/OpenTelemetry
- FastAPI Framework
- React Community
---
**Ready to see it in action?** Run `make up` and visit http://localhost:3000!