https://github.com/mghaith/reusable-monitoring-stack
Reusable Monitoring Stack
https://github.com/mghaith/reusable-monitoring-stack
alertmanager ansible docker docker-compose github-actions grafana prometheus terraform
Last synced: 3 months ago
JSON representation
Reusable Monitoring Stack
- Host: GitHub
- URL: https://github.com/mghaith/reusable-monitoring-stack
- Owner: MGhaith
- Created: 2025-09-03T03:53:25.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-09-10T07:55:31.000Z (10 months ago)
- Last Synced: 2025-09-10T11:15:56.708Z (10 months ago)
- Topics: alertmanager, ansible, docker, docker-compose, github-actions, grafana, prometheus, terraform
- Language: HCL
- Homepage:
- Size: 64.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Reusable Monitoring Stack
## Overview
This repository contains a reusable monitoring stack featuring Prometheus, Grafana, and Alertmanager. It's designed to be deployed locally or in the cloud.
## Deployment
### Local Deployment (Docker Environment)
#### 🖼️ Architecture Diagram

#### Prerequisites
1. Install [Docker Desktop](https://docs.docker.com/get-started/get-docker/)
2. Make sure Docker is running on your machine
#### Setup Steps
1. Copy the environment file:
```bash
cp .env.example .env
```
2. Edit the `.env` file
3. To get alerts using Alert Manager: Edit `prometheus/alertmanager.yml` file to add your **Slack** and/or **Gmail** credentials.
4. Start the stack:
```bash
docker-compose up -d
```
#### Alerts
Alerts are configured to notify via email/Slack when certain thresholds are met. The following alerts are currently configured in `prometheus/alerts.yml`:
- **High CPU Usage**: Triggers when CPU usage exceeds 80% for 5 minutes.
- **High Memory Usage**: Alerts when memory usage is above 85% for 5 minutes.
- **Instance Down**: Fires when a monitored instance is unreachable for 2 minutes.
- **Disk Space Running Low**: Alerts when disk usage reaches 90% capacity.
To view and manage these alerts, you can access them through the **Prometheus UI** or directly modify the `prometheus/alerts.yml` file.
#### Access
* Prometheus: http://localhost:9090
* Alertmanager: http://localhost:9093
* Grafana: http://localhost:3000 (Username: ```admin```, Password: ```"your Grafana password"```)
### Cloud Deployment (AWS)
> ⚠️ **WARNING: Cost Implications** ⚠️
>
> Deploying this stack on AWS will incur costs for several resources, including but not limited to:
> - **NAT Gateway** (~$32/month per gateway)
> - **Application Load Balancer** (~$20/month)
> - **EFS Storage** (pricing varies based on usage, ~$0.30/GB-month)
>
> Please review AWS pricing carefully before deployment to avoid unexpected charges.
> **Note**: By default, this stack will be deployed in the `eu-central-1` region. To change the deployment region, modify the `provider.tf` file in the terraform directory.
#### 🖼️ Architecture Diagram

#### Prerequisites
1. [AWS CLI](https://aws.amazon.com/cli/) installed and configured
2. [Terraform](https://www.terraform.io/downloads.html) (v1.0.0+) installed
3. AWS account with appropriate permissions
#### Infrastructure Components
- **VPC**: Isolated network with public and private subnets
- **ECS Cluster**: For running containerized services
- **Application Load Balancer**: For routing traffic to services
- **EFS**: For persistent storage of monitoring data
- **IAM Roles**: For proper service permissions
#### Setup Steps
1. Configure AWS credentials:
```bash
aws configure
```
2. Initialize Terraform:
```bash
cd terraform
terraform init
```
3. Plan the deployment:
```bash
terraform plan -out=tfplan
```
4. Apply the configuration:
```bash
terraform apply tfplan
```
5. After successful deployment, Terraform will output the ALB DNS name.
#### Access
* Prometheus: http://[ALB-DNS-NAME]/prometheus
* Alertmanager: http://[ALB-DNS-NAME]/alertmanager
* Grafana: http://[ALB-DNS-NAME]/grafana (Username: ```admin```, Password: ```admin```)
#### Cleanup
To destroy all created resources:
```bash
terraform destroy
```
**Note**: This will remove all resources including persistent data stored in EFS.
## CI/CD Workflow
This project uses GitHub Actions for continuous integration. The workflow includes:
1. **Validation**:
- Validates Docker Compose configuration
- Validates Prometheus configuration using promtool
- Validates Alertmanager configuration using amtool
2. **Smoke Testing**:
- Builds and starts the complete monitoring stack
- Waits for services to initialize
- Verifies container health status
- Performs cleanup by tearing down the stack
The workflow runs automatically on:
- Push events to the main branch
- Pull request events targeting the main branch
This ensures that all configuration files are valid and the stack can be successfully deployed before changes are merged.
## Troubleshooting
### Common Issues
#### Local Deployment
- **Port Conflicts**: If services fail to start, check if ports 9090, 9093, or 3000 are already in use.
- **Container Startup Failures**: Check logs with `docker-compose logs [service_name]`.
- **Permission Issues**: Ensure proper permissions on mounted volumes.
#### Cloud Deployment
- **Health Check Failures**: Verify ALB target group settings and container health check endpoints.
- **ECS Task Failures**: Check CloudWatch Logs for container startup issues.
- **Networking Issues**: Verify security group rules allow necessary traffic.
## 🎯 Stretch Goals
- **Add Kubernetes Support**: Enable deployment on Kubernetes clusters for better scalability.
- **Add Loki for Log Aggregation**: Enhance observability by aggregating logs.
- **Implement Service Discovery**: Add automatic service discovery for dynamic environments.
## License
MIT
---
*Feel free to clone this repository and customize it as needed. If you have any questions or need further assistance, don't hesitate to ask!*