An open API service indexing awesome lists of open source software.

https://github.com/garryshtern/network-device-upgrade-system

Production-ready AWX-based network device upgrade management system for 1000+ heterogeneous network devices with comprehensive validation, security, and monitoring.
https://github.com/garryshtern/network-device-upgrade-system

ansible awx cisco devops firmware-upgrade fortinet grafana influxdb infrastructure-automation monitoring netbox network-automation network-devices network-engineering opengear validation

Last synced: about 1 month ago
JSON representation

Production-ready AWX-based network device upgrade management system for 1000+ heterogeneous network devices with comprehensive validation, security, and monitoring.

Awesome Lists containing this project

README

          

# Network Device Upgrade Management System

A complete AWX-based network device upgrade management system designed for managing firmware upgrades across 1000+ heterogeneous network devices with comprehensive validation, security, and monitoring.

## Overview

This system provides automated firmware upgrade capabilities for:
- **Cisco NX-OS** (Nexus Switches) with ISSU support
- **Cisco IOS-XE** (Enterprise Routers/Switches) with Install Mode
- **Opengear** (Console Servers/Smart PDUs) with multi-architecture support
- **FortiOS** (Fortinet Firewalls) with HA coordination

**Status**: Production ready for all platforms. See [Platform Implementation Status](docs/platform-implementation-status.md) for detailed status.

## Key Features

### โœ… **Phase-Separated Upgrade Process**
- **Phase 1**: Image Loading (business hours safe)
- **Phase 2**: Image Installation (maintenance window)
- Complete rollback capabilities

### ๐Ÿ”’ **Maximum Security Compliance**
- **Server-Initiated PUSH Transfers Only** - All firmware pushed from upgrade server to devices
- **Zero Device-Initiated Operations** - No device-to-server connections for firmware retrieval
- **SSH Key Authentication Priority** - SSH keys preferred over password authentication
- **SHA512 Hash Verification** - Complete integrity validation for all firmware images
- **Cryptographic Signature Verification** - Where supported by platform
- **Complete Security Audit Trail** - All operations logged and verified

### ๐Ÿ“Š **Advanced Validation**
- Pre/post upgrade network state comparison
- BGP, BFD, IGMP/multicast, routing validation
- IPSec tunnel and VPN connectivity validation
- Interface optics and transceiver health monitoring
- Protocol convergence timing with baseline comparison

### ๐Ÿš€ **Enterprise Integration**
- Native systemd service deployment (AWX and NetBox)
- Pre-existing NetBox integration
- InfluxDB v2 metrics integration
- โœ… **Complete Grafana dashboard automation** with multi-environment support
- โœ… **Real-time operational monitoring** with 15-second refresh dashboards
- Existing monitoring system integration

## Quick Start

### System Installation

```bash
# 1. Install base system
./install/setup-system.sh

# 2. Setup AWX with native services
./install/setup-awx.sh

# 3. Setup NetBox with native services
./install/setup-netbox.sh

# 4. Configure monitoring integration
./install/configure-telegraf.sh

# 5. Set up SSL certificates
./install/setup-ssl.sh

# 6. Start all services
./install/create-services.sh

# 7. Deploy Grafana dashboards
cd integration/grafana
export INFLUXDB_TOKEN="your_token_here"
./provision-dashboards.sh
```

### Workflow Execution

**Single Entry Point**: All upgrade operations use `main-upgrade-workflow.yml` with tag-based execution.

```bash
# Health check (connectivity validation) - STEP 1
ansible-playbook ansible-content/playbooks/main-upgrade-workflow.yml --tags step1 \
-e target_hosts=mydevice -e max_concurrent=5

# Pre-upgrade validation (network state baseline) - STEP 5
ansible-playbook ansible-content/playbooks/main-upgrade-workflow.yml --tags step5 \
-e target_hosts=mydevice -e target_firmware=fw.bin -e max_concurrent=5

# Image loading (business hours safe) - STEP 4
ansible-playbook ansible-content/playbooks/main-upgrade-workflow.yml --tags step4 \
-e target_hosts=mydevice -e target_firmware=fw.bin -e max_concurrent=5

# Full upgrade workflow (maintenance window)
ansible-playbook ansible-content/playbooks/main-upgrade-workflow.yml \
-e target_hosts=mydevice -e target_firmware=fw.bin \
-e max_concurrent=5 -e maintenance_window=true
```

**Container Usage** (Docker/Podman):
```bash
# Health check using container
docker run --rm -v $(pwd)/inventory:/inventory \
ghcr.io/garryshtern/network-device-upgrade-system:latest \
playbook main-upgrade-workflow.yml --tags step1 \
-e target_hosts=mydevice -e max_concurrent=5

# Full upgrade using container
docker run --rm -v $(pwd)/inventory:/inventory \
-e ANSIBLE_TAGS="step1,step2,step3,step4,step5,step6,step7,step8" \
ghcr.io/garryshtern/network-device-upgrade-system:latest \
playbook main-upgrade-workflow.yml \
-e target_hosts=mydevice -e target_firmware=fw.bin \
-e max_concurrent=5 -e maintenance_window=true
```

**Deprecated Playbooks**: Individual playbooks have been consolidated into the main workflow:
- `health-check.yml` โ†’ Use `--tags step1` instead
- `network-validation.yml` โ†’ Use `--tags step5` (pre-upgrade) or `--tags step7` (post-upgrade)
- `image-loading.yml` โ†’ Use `--tags step4` instead
- `image-installation.yml` โ†’ Use `--tags step6` instead
- `emergency-rollback.yml` โ†’ Use `--tags step8` instead

**Standalone Operational Playbooks** (still separate):
- `compliance-audit.yml` - Security and compliance auditing
- `config-backup.yml` - Configuration backup operations

## ๐Ÿงช Testing Framework

**Comprehensive testing capabilities for Mac/Linux development without physical devices:**

### ๐Ÿ“Š **Current Test Results** (Updated: November 5, 2025)
- **โœ… Syntax Validation: 100% CLEAN** - All 129+ Ansible files pass syntax checks
- **โœ… Comprehensive Test Suite: 100% PASS** - All 50 test suites passing โœ…
- **โœ… Critical Gap Test Suite: 100% PASS** - All 5 business-critical tests passing ($2.8M risk mitigation) โœ…
- **โœ… Security Validation: 100% COMPLIANT** - All secure transfer and security boundary tests passing
- **โœ… Container Integration: SUCCESS** - Multi-architecture images (amd64/arm64) available
- **โœ… End-to-End Testing: VERIFIED** - Complete workflow validation across all platforms

### ๐Ÿš€ **Quick Testing**
```bash
# Syntax validation (100% clean)
ansible-playbook --syntax-check ansible-content/playbooks/main-upgrade-workflow.yml \
-e target_hosts=localhost -e target_firmware=test.bin \
-e maintenance_window=true -e max_concurrent=1

# Mock device testing (all 5 platforms)
ansible-playbook -i tests/mock-inventories/all-platforms.yml --check \
ansible-content/playbooks/main-upgrade-workflow.yml \
-e target_hosts=all -e target_firmware=test.bin \
-e maintenance_window=true -e max_concurrent=5

# Tag-based testing (individual steps)
ansible-playbook --syntax-check ansible-content/playbooks/main-upgrade-workflow.yml \
--tags step1 -e target_hosts=localhost -e max_concurrent=1

# Complete test suite
./tests/run-all-tests.sh

# Molecule testing (requires Docker)
cd tests/molecule-tests && molecule test

# Container testing (production ready)
docker run --rm ghcr.io/garryshtern/network-device-upgrade-system:latest
podman run --rm ghcr.io/garryshtern/network-device-upgrade-system:latest
```

### โœ… **Testing Categories - FULLY IMPLEMENTED**
- **Mock Inventory Testing** - Simulated device testing for all platforms โœ…
- **Variable Validation** - Requirements and constraint validation โœ…
- **Template Rendering** - Jinja2 template testing without connections โœ…
- **Workflow Logic** - Decision path and conditional testing โœ…
- **Error Handling** - Error condition and recovery validation โœ…
- **Integration Testing** - Complete workflow with mock devices โœ…
- **Performance Testing** - Execution time and resource measurement โœ…
- **Molecule Testing** - Container-based advanced testing โœ…
- **Platform-Specific Testing** - Vendor-specific comprehensive testing โœ…
- **YAML/JSON Validation** - File syntax and structure validation โœ…
- **CI/CD Integration** - GitHub Actions automated testing โœ…

**See comprehensive guide**: [Documentation Hub](docs/README.md) - Complete testing and setup documentation

## ๐Ÿ“š Documentation

**Complete documentation with architectural diagrams and implementation guides:**

- **[๐Ÿ“– Documentation Hub](docs/README.md)** - Start here for comprehensive guides
- **[โš™๏ธ Installation & Configuration](CLAUDE.md)** - Complete system documentation including installation, parameters, and troubleshooting
- **[๐Ÿ”„ Upgrade Workflow Guide](docs/user-guides/upgrade-workflow-guide.md)** - Upgrade process and safety mechanisms
- **[๐Ÿณ Container Deployment Guide](docs/user-guides/container-deployment.md)** - Docker/Podman deployment
- **[๐Ÿ—๏ธ Platform Implementation Status](docs/platform-guides/platform-implementation-status.md)** - Technical implementation details and feature support
- **[๐Ÿงช Pre-Commit Setup Guide](docs/testing/pre-commit-setup.md)** - Quality gates and testing requirements
- **[๐Ÿ“ Internal Documentation Index](docs/internal/INDEX.md)** - Developer reference guides and analysis documents

## Architecture

### System Overview

```mermaid
graph TD
A[AWX Services
Job Control
systemd] --> B[Ansible Engine
Playbook Execution
Role-Based]
B --> C[Network Devices
1000+ Supported
Multi-Vendor]

D[NetBox
Inventory DB
Pre-existing] --> B
E[Telegraf
Metrics Agent
Collection] --> F[InfluxDB v2
Time Series
Existing]

C --> F
F --> H[Grafana
Dashboards
Existing]

C -.-> I[Cisco NX-OS]
C -.-> J[Cisco IOS-XE]
C -.-> K[FortiOS]
C -.-> L[Opengear]

style A fill:#e1f5fe
style C fill:#f3e5f5
style F fill:#e8f5e8
style H fill:#fff3e0
```

**Alternative System Flow:**

| Component | Function | Integration |
|-----------|----------|-------------|
| **AWX Services (systemd)** | Job orchestration and workflow control | โ†’ Ansible Engine |
| **Ansible Engine** | Playbook execution and device automation | โ†’ Network Devices |
| **NetBox (Pre-existing)** | Device inventory and IPAM management | โ†’ Ansible Engine |
| **Telegraf** | Metrics collection agent | โ†’ InfluxDB v2 |
| **Network Devices** | Target devices for upgrades | โ†’ Metrics Export |
| **InfluxDB v2** | Time-series metrics storage | โ†’ Grafana |
| **Grafana** | Monitoring dashboards and visualization | Final consumer |

### Component Interaction Flow

```mermaid
flowchart TD
U[User Request] --> A[AWX Web UI]
A --> B[Job Templates]
B --> C[Workflows]

B --> D[Dynamic Inventory]
D --> E[NetBox
Device Data
Variables]
C --> F[Ansible Execution]
D --> F

F --> G[Network Devices]
G --> H[Metrics Collection]
H --> I[InfluxDB]
I --> J[Grafana
Dashboards]

subgraph "Job Templates"
B1[Health Check]
B2[Image Load]
B3[Validation]
end

subgraph "Workflows"
C1[Phase 1: Load]
C2[Phase 2: Install]
C3[Phase 3: Verify]
end

style U fill:#ffeb3b
style G fill:#f3e5f5
style I fill:#e8f5e8
style J fill:#fff3e0
```

**Simplified Data Flow:**

1. **User Request** โ†’ AWX Web Interface
2. **AWX** โ†’ Executes Ansible playbooks
3. **Ansible** โ†’ Connects to network devices via SSH/API
4. **NetBox** โ†’ Provides device inventory to Ansible
5. **Network Devices** โ†’ Export metrics during operations
6. **Telegraf** โ†’ Collects metrics and sends to InfluxDB
7. **InfluxDB** โ†’ Stores time-series data for Grafana
8. **Grafana** โ†’ Displays dashboards and reports to users

## Resource Requirements

### Minimum System Requirements
- **OS**: RHEL/CentOS 8+ or Ubuntu 20.04+
- **CPU**: 4 cores minimum
- **RAM**: 8GB minimum
- **Storage**: 100GB+ for firmware and logs
- **Network**: Reliable connectivity to all managed devices

### Software Requirements
- **Python**: 3.14.0 with pip - *Latest stable version (released Oct 7, 2025)*
- **Ansible**: 11.0.0 (ansible-core 2.18.10) - *Latest stable version*
- **Git**: Latest stable version

### Supported Platforms
- **Single Server Deployment**: No clustering required
- **Container-based AWX**: Podman/Docker container deployment
- **Pre-existing NetBox**: Uses existing NetBox installation
- **SystemD User Services**: Native Linux user service management for base components

## Directory Structure

```
network-upgrade-system/
โ”œโ”€โ”€ deployment/ # Service-based deployment structure
โ”‚ โ”œโ”€โ”€ system/ # Base system setup (SSL, system config)
โ”‚ โ”œโ”€โ”€ services/ # Individual service deployments
โ”‚ โ”‚ โ”œโ”€โ”€ awx/ # AWX automation platform
โ”‚ โ”‚ โ”œโ”€โ”€ netbox/ # NetBox IPAM & device inventory
โ”‚ โ”‚ โ”œโ”€โ”€ grafana/ # โœ… Complete dashboard automation
โ”‚ โ”‚ โ”œโ”€โ”€ telegraf/ # Metrics collection
โ”‚ โ”‚ โ””โ”€โ”€ redis/ # Caching & job queue
โ”‚ โ””โ”€โ”€ scripts/ # General deployment scripts
โ”œโ”€โ”€ ansible-content/ # Ansible automation content
โ”‚ โ”œโ”€โ”€ playbooks/ # Main orchestration playbooks
โ”‚ โ”œโ”€โ”€ roles/ # Vendor-specific upgrade roles
โ”‚ โ””โ”€โ”€ collections/ # Ansible collection requirements
โ”œโ”€โ”€ tests/ # Comprehensive test suites
โ”œโ”€โ”€ docs/ # Complete documentation
โ”œโ”€โ”€ tools/ # Development and utility tools
โ””โ”€โ”€ .claude/ # Claude Code commands and workflows
```

## Workflow Execution Modes

The system uses **`main-upgrade-workflow.yml`** as the single entry point for all upgrade operations. Individual steps can be executed using Ansible tags, with automatic dependency resolution.

### Available Execution Tags

| Tag | Step Name | Description | Dependencies | Safe During Business Hours |
|-----|-----------|-------------|--------------|---------------------------|
| `step1` | Connectivity Check | Initial SSH/NETCONF connectivity validation | None | โœ… Yes |
| `step2` | Version Check | Collect current firmware version and verify file exists | step1 (direct); steps 1-2 (via tags) | โœ… Yes |
| `step3` | Space Check | Verify sufficient disk space, auto-clean if needed | step1 (direct); steps 1-3 (via tags) | โœ… Yes |
| `step4` | Image Upload | Upload firmware and verify SHA512 hash (PHASE 1) | step1 (direct); steps 1-4 (via tags) | โœ… Yes |
| `step5` | Config Backup & Pre-Validation | Backup config and capture network state baseline | step1 (direct); steps 1-5 (via tags) | โœ… Yes |
| `step6` | Installation & Reboot | Install firmware and reboot device (PHASE 2) | step1 (direct); steps 1-6 (via tags) | โš ๏ธ Maintenance Window |
| `step7` | Post-Upgrade Validation | Validate network state after upgrade (PHASE 3) | step1 (direct); steps 1-7 (via tags) | โš ๏ธ Maintenance Window |
| `step8` | Emergency Rollback | Restore previous firmware and configuration | step1 (direct); triggered by step7 or manual | โš ๏ธ Maintenance Window |

### Execution Examples

**Individual Step Execution:**
```bash
# Run only health check (STEP 1)
ansible-playbook ansible-content/playbooks/main-upgrade-workflow.yml \
--tags step1 \
-e target_hosts=mydevice \
-e max_concurrent=5

# Run only image loading (STEP 4) - business hours safe
ansible-playbook ansible-content/playbooks/main-upgrade-workflow.yml \
--tags step4 \
-e target_hosts=mydevice \
-e target_firmware=nxos-10.3.5.bin \
-e max_concurrent=5
```

**Multiple Step Execution:**
```bash
# Run PHASE 1: Health check + backup + image loading
ansible-playbook ansible-content/playbooks/main-upgrade-workflow.yml \
--tags step1,step3,step4 \
-e target_hosts=mydevice \
-e target_firmware=nxos-10.3.5.bin \
-e max_concurrent=5

# Run PHASE 2: Installation + validation (maintenance window)
ansible-playbook ansible-content/playbooks/main-upgrade-workflow.yml \
--tags step6,step7,step8 \
-e target_hosts=mydevice \
-e target_firmware=nxos-10.3.5.bin \
-e maintenance_window=true \
-e max_concurrent=5
```

**Full Workflow Execution:**
```bash
# Execute all steps (complete upgrade)
ansible-playbook ansible-content/playbooks/main-upgrade-workflow.yml \
-e target_hosts=mydevice \
-e target_firmware=nxos-10.3.5.bin \
-e maintenance_window=true \
-e max_concurrent=5
```

### Required Variables by Execution Mode

| Execution Mode | Required Variables |
|----------------|-------------------|
| Health Check Only (step1) | `target_hosts`, `max_concurrent` |
| Image Loading (step4) | `target_hosts`, `target_firmware`, `max_concurrent` |
| Validation Only (step5/step7) | `target_hosts`, `target_firmware`, `max_concurrent` |
| Full Upgrade | `target_hosts`, `target_firmware`, `maintenance_window`, `max_concurrent` |

### Automatic Dependency Resolution

**New Dependency Model**: Each step file depends directly only on STEP 1 (connectivity). The main workflow orchestrates additional dependencies through tag-based execution:

- **Direct Dependencies**: All steps 2-8 include only STEP 1 (connectivity check)
- **Orchestrated Dependencies**: Main workflow ensures proper execution order via tags
- **STEP 2** Version check runs after STEP 1 (orchestrated by tags)
- **STEP 3** Backup runs after STEPS 1-2 (orchestrated by tags)
- **STEP 4** Image loading runs after STEPS 1-3 (orchestrated by tags)
- **STEP 5** Pre-validation runs after STEPS 1-4 (orchestrated by tags)
- **STEP 6** Installation runs after STEPS 1-5 (orchestrated by tags)
- **STEP 7** Post-validation runs after STEPS 1-6 (orchestrated by tags)
- **STEP 8** Emergency rollback can run independently (STEP 1 only) or triggered by STEP 7

**Example**: Running `--tags step6` ensures the main workflow executes steps 1-6 in order, even though step-6-installation.yml only includes step-1-connectivity.yml directly.

### Playbook Migration Guide

For users migrating from legacy individual playbooks:

| Legacy Playbook | New Command |
|----------------|-------------|
| `health-check.yml` | `main-upgrade-workflow.yml --tags step1` |
| `network-validation.yml` (pre) | `main-upgrade-workflow.yml --tags step5` |
| `network-validation.yml` (post) | `main-upgrade-workflow.yml --tags step7` |
| `image-loading.yml` | `main-upgrade-workflow.yml --tags step4` |
| `image-installation.yml` | `main-upgrade-workflow.yml --tags step6` |
| `emergency-rollback.yml` | `main-upgrade-workflow.yml --tags step8` |

**Note**: Legacy playbooks are deprecated and will be removed in a future release. Migrate to tag-based execution.

## Support

For technical support and questions:
- Check the [CLAUDE.md](CLAUDE.md) for complete documentation and troubleshooting
- Review platform-specific procedures in [Platform Implementation Guide](docs/platform-guides/platform-implementation-status.md)
- Examine log files in `$HOME/.local/share/network-upgrade/logs/`
- Use the built-in health check: `./scripts/system-health.sh`

## License

This project is licensed under the MIT License - see the LICENSE file for details.