https://github.com/marckmorris/chaos-engineering-databases
https://github.com/marckmorris/chaos-engineering-databases
automation ci-cd cloud database devops disaster-recovery kubernetes monitoring reliability reliability-engineering sre terraform
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/marckmorris/chaos-engineering-databases
- Owner: MarckMorris
- License: mit
- Created: 2025-11-21T00:17:34.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-11-21T00:33:53.000Z (4 months ago)
- Last Synced: 2025-11-21T02:33:11.812Z (4 months ago)
- Topics: automation, ci-cd, cloud, database, devops, disaster-recovery, kubernetes, monitoring, reliability, reliability-engineering, sre, terraform
- Language: Python
- Size: 15.6 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Chaos Engineering Databases
   


## Overview
Chaos engineering framework for testing database resilience and failure scenarios. This project demonstrates enterprise-grade reliability engineering practices with a focus on automation, observability, and operational excellence.
## Features
- **High Availability**: Designed for 99.99% uptime with automated failover
- **Scalability**: Horizontal scaling capabilities with load-based auto-scaling
- **Security**: Industry-standard security practices and compliance
- **Monitoring**: Comprehensive observability with metrics, logs, and traces
- **Automation**: Infrastructure as Code and GitOps workflows
## Architecture
```
┌─────────────────┐
│ Application │
└────────┬────────┘
│
┌────────▼────────┐
│ Load Balancer │
└────────┬────────┘
│
┌────┴────┐
│ │
┌───▼──┐ ┌──▼───┐
│ DB 1 │ │ DB 2 │
└──────┘ └──────┘
```
## Tech Stack
- **LitmusChaos**
- **Python**
- **Terraform**
- **Kubernetes**
- **PostgreSQL**
## Prerequisites
- Docker 20.10+
- Kubernetes 1.24+ (if applicable)
- Terraform 1.5+
- Python 3.9+
- Cloud provider account (AWS/GCP/Azure)
## Quick Start
### Installation
```bash
# Clone the repository
git clone https://github.com/yourusername/chaos-engineering-databases.git
cd chaos-engineering-databases
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your settings
# Deploy infrastructure
cd terraform
terraform init
terraform plan
terraform apply
```
### Configuration
Key configuration parameters in `configs/config.yaml`:
```yaml
database:
type: postgresql
version: "14"
instance_type: db.m5.large
monitoring:
prometheus_port: 9090
scrape_interval: 15s
scaling:
min_replicas: 2
max_replicas: 10
target_cpu: 70
```
## Usage
### Basic Operations
```bash
# Start the system
./scripts/start.sh
# Check health
./scripts/health-check.sh
# View metrics
open http://localhost:3000 # Grafana dashboard
# Run tests
pytest tests/
```
### Advanced Operations
```bash
# Trigger failover
./scripts/failover.sh --region us-west-2
# Scale up
./scripts/scale.sh --replicas 5
# Backup database
./scripts/backup.sh --type full
```
## Testing
```bash
# Unit tests
pytest tests/unit/
# Integration tests
pytest tests/integration/
# Load tests
locust -f tests/load/locustfile.py
# Chaos tests
./scripts/chaos-test.sh
```
## Monitoring & Observability
### Metrics
Key metrics tracked:
- Query latency (p50, p95, p99)
- Connection pool utilization
- Replication lag
- Error rates
- Resource utilization (CPU, memory, disk)
### Dashboards
Access Grafana dashboards at `http://localhost:3000`:
- Overview Dashboard
- Performance Metrics
- Replication Status
- Alert History
### Alerts
Configured alerts:
- High error rate (>1%)
- Replication lag (>30s)
- Disk usage (>80%)
- Connection saturation (>90%)
## Performance
Benchmark results on m5.xlarge instances:
| Metric | Value |
|--------|-------|
| Max QPS | 10,000 |
| P99 Latency | 25ms |
| Uptime | 99.99% |
| MTTR | <5 min |
## Security
- **Encryption**: At-rest and in-transit encryption enabled
- **Authentication**: mTLS for service communication
- **Secrets**: HashiCorp Vault integration
- **Compliance**: SOC2, HIPAA-ready configurations
- **Auditing**: Complete audit logs with retention
## Disaster Recovery
- **RTO**: 15 minutes
- **RPO**: 5 minutes
- **Backup Schedule**: Hourly incremental, daily full
- **Geo-redundancy**: Multi-region replication
- **Automated Failover**: Health-check based switching
## Troubleshooting
### Common Issues
**Issue**: High replication lag
```bash
# Check replication status
./scripts/check-replication.sh
# Force sync
./scripts/force-sync.sh
```
**Issue**: Connection pool exhausted
```bash
# Check active connections
./scripts/check-connections.sh
# Increase pool size
./scripts/scale-connections.sh --size 200
```
## Contributing
Contributions are welcome! Please follow these guidelines:
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## Roadmap
- [ ] Multi-cloud support expansion
- [ ] Advanced ML-based auto-tuning
- [ ] Enhanced chaos engineering scenarios
- [ ] GraphQL API support
- [ ] Real-time analytics dashboard
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- Built with industry best practices from Google SRE handbook
- Inspired by Netflix's reliability engineering
- Community contributions and feedback
## Contact
- **Issues**: [GitHub Issues](https://github.com/yourusername/chaos-engineering-databases/issues)
- **Discussions**: [GitHub Discussions](https://github.com/yourusername/chaos-engineering-databases/discussions)
---
**Note**: This is a production-grade implementation. Always test in staging before deploying to production.