An open API service indexing awesome lists of open source software.

https://github.com/marckmorris/chaos-engineering-databases


https://github.com/marckmorris/chaos-engineering-databases

automation ci-cd cloud database devops disaster-recovery kubernetes monitoring reliability reliability-engineering sre terraform

Last synced: 3 months ago
JSON representation

Awesome Lists containing this project

README

          

# Chaos Engineering Databases

![LitmusChaos](https://img.shields.io/badge/LitmusChaos-blue) ![Python](https://img.shields.io/badge/Python-blue) ![Terraform](https://img.shields.io/badge/Terraform-blue) ![Kubernetes](https://img.shields.io/badge/Kubernetes-blue)
![License](https://img.shields.io/badge/license-MIT-green)
![Status](https://img.shields.io/badge/status-production--ready-brightgreen)

## Overview

Chaos engineering framework for testing database resilience and failure scenarios. This project demonstrates enterprise-grade reliability engineering practices with a focus on automation, observability, and operational excellence.

## Features

- **High Availability**: Designed for 99.99% uptime with automated failover
- **Scalability**: Horizontal scaling capabilities with load-based auto-scaling
- **Security**: Industry-standard security practices and compliance
- **Monitoring**: Comprehensive observability with metrics, logs, and traces
- **Automation**: Infrastructure as Code and GitOps workflows

## Architecture

```
┌─────────────────┐
│ Application │
└────────┬────────┘

┌────────▼────────┐
│ Load Balancer │
└────────┬────────┘

┌────┴────┐
│ │
┌───▼──┐ ┌──▼───┐
│ DB 1 │ │ DB 2 │
└──────┘ └──────┘
```

## Tech Stack

- **LitmusChaos**
- **Python**
- **Terraform**
- **Kubernetes**
- **PostgreSQL**

## Prerequisites

- Docker 20.10+
- Kubernetes 1.24+ (if applicable)
- Terraform 1.5+
- Python 3.9+
- Cloud provider account (AWS/GCP/Azure)

## Quick Start

### Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/chaos-engineering-databases.git
cd chaos-engineering-databases

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your settings

# Deploy infrastructure
cd terraform
terraform init
terraform plan
terraform apply
```

### Configuration

Key configuration parameters in `configs/config.yaml`:

```yaml
database:
type: postgresql
version: "14"
instance_type: db.m5.large

monitoring:
prometheus_port: 9090
scrape_interval: 15s

scaling:
min_replicas: 2
max_replicas: 10
target_cpu: 70
```

## Usage

### Basic Operations

```bash
# Start the system
./scripts/start.sh

# Check health
./scripts/health-check.sh

# View metrics
open http://localhost:3000 # Grafana dashboard

# Run tests
pytest tests/
```

### Advanced Operations

```bash
# Trigger failover
./scripts/failover.sh --region us-west-2

# Scale up
./scripts/scale.sh --replicas 5

# Backup database
./scripts/backup.sh --type full
```

## Testing

```bash
# Unit tests
pytest tests/unit/

# Integration tests
pytest tests/integration/

# Load tests
locust -f tests/load/locustfile.py

# Chaos tests
./scripts/chaos-test.sh
```

## Monitoring & Observability

### Metrics

Key metrics tracked:
- Query latency (p50, p95, p99)
- Connection pool utilization
- Replication lag
- Error rates
- Resource utilization (CPU, memory, disk)

### Dashboards

Access Grafana dashboards at `http://localhost:3000`:
- Overview Dashboard
- Performance Metrics
- Replication Status
- Alert History

### Alerts

Configured alerts:
- High error rate (>1%)
- Replication lag (>30s)
- Disk usage (>80%)
- Connection saturation (>90%)

## Performance

Benchmark results on m5.xlarge instances:

| Metric | Value |
|--------|-------|
| Max QPS | 10,000 |
| P99 Latency | 25ms |
| Uptime | 99.99% |
| MTTR | <5 min |

## Security

- **Encryption**: At-rest and in-transit encryption enabled
- **Authentication**: mTLS for service communication
- **Secrets**: HashiCorp Vault integration
- **Compliance**: SOC2, HIPAA-ready configurations
- **Auditing**: Complete audit logs with retention

## Disaster Recovery

- **RTO**: 15 minutes
- **RPO**: 5 minutes
- **Backup Schedule**: Hourly incremental, daily full
- **Geo-redundancy**: Multi-region replication
- **Automated Failover**: Health-check based switching

## Troubleshooting

### Common Issues

**Issue**: High replication lag
```bash
# Check replication status
./scripts/check-replication.sh

# Force sync
./scripts/force-sync.sh
```

**Issue**: Connection pool exhausted
```bash
# Check active connections
./scripts/check-connections.sh

# Increase pool size
./scripts/scale-connections.sh --size 200
```

## Contributing

Contributions are welcome! Please follow these guidelines:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## Roadmap

- [ ] Multi-cloud support expansion
- [ ] Advanced ML-based auto-tuning
- [ ] Enhanced chaos engineering scenarios
- [ ] GraphQL API support
- [ ] Real-time analytics dashboard

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Built with industry best practices from Google SRE handbook
- Inspired by Netflix's reliability engineering
- Community contributions and feedback

## Contact

- **Issues**: [GitHub Issues](https://github.com/yourusername/chaos-engineering-databases/issues)
- **Discussions**: [GitHub Discussions](https://github.com/yourusername/chaos-engineering-databases/discussions)

---

**Note**: This is a production-grade implementation. Always test in staging before deploying to production.