An open API service indexing awesome lists of open source software.

https://github.com/nexmonyx/health-controller

Health check aggregation and monitoring service
https://github.com/nexmonyx/health-controller

controller go health kubernetes microservice nexmonyx

Last synced: 8 months ago
JSON representation

Health check aggregation and monitoring service

Awesome Lists containing this project

README

          

# Nexmonyx Health Controller

The Nexmonyx Health Controller is a standalone microservice responsible for monitoring the health of servers, services, APIs, databases, and external dependencies in the Nexmonyx platform.

## Features

### Health Monitoring Capabilities
- **Server Health**: Heartbeat monitoring, resource utilization
- **Service Health**: Systemd service status monitoring
- **API Health**: HTTP endpoint availability and response time monitoring
- **Database Health**: Database connectivity and query performance
- **External Health**: Third-party service monitoring (AWS, Stripe, Auth0, etc.)
- **Custom Health**: User-defined health checks with custom scripts

### Advanced Features
- **Health Scoring**: 0-100 health score calculation with configurable weights
- **Predictive Analysis**: Anomaly detection and predictive failure detection
- **Historical Tracking**: 30-day health history retention
- **Incident Management**: Automatic incident creation and resolution
- **Maintenance Windows**: Health check suspension during maintenance
- **Alerting Integration**: Integration with alert controller for notifications

### Performance & Scalability
- **Concurrent Execution**: Configurable worker pool for parallel health checks
- **Batch Processing**: Efficient batch processing of health checks
- **Rate Limiting**: Built-in rate limiting to prevent API overload
- **Caching**: Local caching for improved performance
- **High Availability**: Leader election support for multi-instance deployments

## Configuration

The controller uses environment variables for configuration. Copy `.env.example` to `.env` and update the values as needed.

### Server Configuration
```bash
HEALTH_SERVER_HOST=0.0.0.0
HEALTH_SERVER_PORT=8080
HEALTH_SERVER_READ_TIMEOUT=30s
HEALTH_SERVER_WRITE_TIMEOUT=30s
HEALTH_SERVER_SHUTDOWN_TIMEOUT=10s
```

### Health Monitoring Configuration
```bash
HEALTH_CHECK_INTERVAL=30s # How often to schedule health checks
HEALTH_HEARTBEAT_THRESHOLD=2m # Threshold for heartbeat checks
HEALTH_WARNING_THRESHOLD=2m # Warning threshold
HEALTH_CRITICAL_THRESHOLD=5m # Critical threshold
HEALTH_MAX_CONCURRENT_CHECKS=100 # Max parallel health checks
HEALTH_HISTORY_RETENTION_DAYS=30 # Health history retention
HEALTH_SUMMARY_UPDATE_INTERVAL=1m # How often to update summaries
HEALTH_ANOMALY_DETECTION_ENABLED=true # Enable anomaly detection
HEALTH_PREDICTIVE_ANALYSIS_ENABLED=true # Enable predictive analysis
HEALTH_ALERTING_ENABLED=true # Enable alerting
HEALTH_CHECK_BATCH_SIZE=50 # Batch size for health checks
```

### Health Score Weights
```bash
HEALTH_SCORE_CRITICAL_WEIGHT=0 # Score for critical status
HEALTH_SCORE_WARNING_WEIGHT=60 # Score for warning status
HEALTH_SCORE_HEALTHY_WEIGHT=100 # Score for healthy status
HEALTH_SCORE_UNKNOWN_WEIGHT=25 # Score for unknown status
```

### Resource Thresholds
```bash
HEALTH_CPU_WARNING_PERCENT=80.0
HEALTH_CPU_CRITICAL_PERCENT=95.0
HEALTH_MEMORY_WARNING_PERCENT=85.0
HEALTH_MEMORY_CRITICAL_PERCENT=95.0
HEALTH_DISK_WARNING_PERCENT=80.0
HEALTH_DISK_CRITICAL_PERCENT=90.0
HEALTH_NETWORK_WARNING_LATENCY_MS=100
HEALTH_NETWORK_CRITICAL_LATENCY_MS=500
HEALTH_NETWORK_WARNING_LOSS_PERCENT=1.0
HEALTH_NETWORK_CRITICAL_LOSS_PERCENT=5.0
```

### Nexmonyx API Configuration
```bash
NEXMONYX_BASE_URL=https://api.nexmonyx.com
NEXMONYX_ACCESS_KEY=your_access_key_here
NEXMONYX_ACCESS_SECRET=your_access_secret_here
NEXMONYX_TIMEOUT=30s
NEXMONYX_RETRY_COUNT=3
NEXMONYX_RETRY_DELAY=1s
NEXMONYX_RATE_LIMIT_RPS=100
```

### Leader Election Configuration
```bash
HEALTH_LEADER_ELECTION_ENABLED=true
HEALTH_LEADER_ELECTION_LOCK_NAME=nexmonyx-health-controller
HEALTH_LEADER_ELECTION_LOCK_NAMESPACE=nexmonyx-system
HEALTH_LEADER_ELECTION_LEASE_DURATION=15s
HEALTH_LEADER_ELECTION_RENEW_DEADLINE=10s
HEALTH_LEADER_ELECTION_RETRY_PERIOD=2s
```

## Health Check Types

### 1. Heartbeat Checks
Monitor server heartbeat and last seen time:
```json
{
"check_type": "heartbeat",
"threshold": {
"warning_minutes": 2,
"critical_minutes": 5
}
}
```

### 2. Service Checks
Monitor systemd service status:
```json
{
"check_type": "service",
"config": {
"service_name": "nginx"
}
}
```

### 3. Resource Checks
Monitor CPU, memory, disk usage:
```json
{
"check_type": "resource",
"config": {
"resource_type": "cpu"
},
"threshold": {
"warning_percent": 80,
"critical_percent": 95
}
}
```

### 4. API Checks
Monitor HTTP endpoint availability:
```json
{
"check_type": "api",
"config": {
"url": "https://api.example.com/health",
"method": "GET",
"expected_status": 200,
"headers": {
"Authorization": "Bearer token"
}
}
}
```

### 5. Database Checks
Monitor database connectivity:
```json
{
"check_type": "database",
"config": {
"db_type": "postgresql",
"host": "db.example.com",
"port": 5432,
"database": "myapp",
"username": "monitor"
}
}
```

### 6. External Service Checks
Monitor third-party services:
```json
{
"check_type": "external",
"config": {
"service_type": "aws",
"region": "us-east-1",
"service": "ec2"
}
}
```

### 7. Custom Checks
Execute custom scripts:
```json
{
"check_type": "custom",
"config": {
"script": "#!/bin/bash\necho 'Health check passed'\nexit 0",
"interpreter": "bash"
}
}
```

## API Endpoints

### Health and Status
- `GET /health` - Controller health status
- `GET /ready` - Readiness check with statistics
- `GET /metrics` - Prometheus metrics
- `GET /stats` - Detailed statistics

### Controller Management
- `GET /api/v1/status` - Controller status and statistics

## Deployment

### Docker
```bash
# Build the image
docker build -t nexmonyx-health-controller .

# Run the container
docker run -d \
--name health-controller \
-p 8080:8080 \
-p 9090:9090 \
-e NEXMONYX_ACCESS_KEY=your-access-key \
-e NEXMONYX_ACCESS_SECRET=your-access-secret \
nexmonyx-health-controller
```

### Kubernetes
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nexmonyx-health-controller
spec:
replicas: 2
selector:
matchLabels:
app: nexmonyx-health-controller
template:
metadata:
labels:
app: nexmonyx-health-controller
spec:
containers:
- name: health-controller
image: nexmonyx-health-controller:latest
ports:
- containerPort: 8080
- containerPort: 9090
env:
- name: NEXMONYX_ACCESS_KEY
valueFrom:
secretKeyRef:
name: nexmonyx-secrets
key: access-key
- name: NEXMONYX_ACCESS_SECRET
valueFrom:
secretKeyRef:
name: nexmonyx-secrets
key: access-secret
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
```

## Monitoring and Observability

### Metrics
The controller exposes Prometheus metrics at `/metrics`:
- `health_checks_total` - Total health checks executed
- `health_checks_successful_total` - Successful health checks
- `health_checks_failed_total` - Failed health checks
- `health_check_average_duration_ms` - Average check duration
- `health_workers_active` - Active worker count
- `health_checks_queued` - Queued health checks

### Logging
Structured JSON logging with configurable levels:
- `trace` - Detailed execution flow
- `debug` - Debug information
- `info` - General information
- `warn` - Warning conditions
- `error` - Error conditions

### Health Checks
- Liveness probe: `GET /health`
- Readiness probe: `GET /ready`

## Architecture

### Components
1. **Health Service** - Core health monitoring logic
2. **Worker Pool** - Concurrent health check execution
3. **Check Executors** - Type-specific health check implementations
4. **Configuration** - Environment-based configuration management
5. **HTTP Server** - REST API and metrics endpoints

### Data Flow
1. Health Service schedules health checks based on intervals
2. Due checks are submitted to the Worker Pool
3. Workers execute health checks using appropriate executors
4. Results are stored via the Nexmonyx API
5. Health summaries are updated periodically
6. Incidents are automatically created/resolved based on health status

## Development

### Prerequisites
- Go 1.24+
- Docker
- Access to Nexmonyx API

### Building
```bash
# Install dependencies
go mod download

# Build the binary
go build -o health-controller .

# Run locally with environment file
cp .env.example .env
# Edit .env with your configuration
./health-controller
```

### Testing
```bash
# Run tests
go test ./...

# Run with coverage
go test -cover ./...
```

## Integration

### Nexmonyx SDK
The controller uses the official Nexmonyx Go SDK for all API operations:
- Health check CRUD operations
- Server information retrieval
- Health history management
- Integration with alerting system

### Alert Controller
Automatic integration with the alert controller for:
- Health status change notifications
- Incident creation and updates
- Escalation policies
- Communication channels

## Performance Considerations

### Scalability
- Supports monitoring 100,000+ servers
- 1M+ health checks per minute
- Horizontal scaling with leader election
- Efficient resource utilization

### Optimization
- Batch processing for reduced API calls
- Local caching for improved performance
- Configurable concurrency limits
- Rate limiting to prevent API overload

### Resource Usage
- Memory: < 200MB under normal load
- CPU: Scales with concurrent check count
- Network: Configurable rate limiting
- Storage: Local SQLite for caching

## Security

### Authentication
- Access key/secret authentication with Nexmonyx API
- Secure credential management via environment variables
- RBAC integration through API permissions

### Network Security
- HTTPS communication with Nexmonyx API
- Configurable TLS settings
- Network isolation support

### Runtime Security
- Non-root container execution
- Read-only filesystem where possible
- Resource limits and quotas
- Secure secret handling