An open API service indexing awesome lists of open source software.

https://github.com/tysker/cloud_devops_lab

This repository is a complete end-to-end DevOps learning project built around a small Python Flask application. The goal is to gradually build a realistic production-like environment .
https://github.com/tysker/cloud_devops_lab

ansible api cloudflare devops dns docker dockerfile github-actions grafana linode prometheus python terraform

Last synced: 2 months ago
JSON representation

This repository is a complete end-to-end DevOps learning project built around a small Python Flask application. The goal is to gradually build a realistic production-like environment .

Awesome Lists containing this project

README

          

# DevOps Project

## Project Description

This repository is a complete end-to-end DevOps learning project built around a small Python Flask
application. All access follows a bastion-based, non-root security model.

The goal is to gradually build a realistic production-like environment that includes:

- containerization with Docker
- CI/CD pipelines (GitHub Actions)
- artifact registries (Docker Registry & GitHub Packages)
- infrastructure provisioning (Terraform)
- configuration management (Ansible roles)
- monitoring and visualization (Prometheus & Grafana)
- security best practices (jump host, SSH hardening, TLS certificates)

The project grows in clear stages. Each stage is documented with **what was done**, **why it matters**,
and **how it was implemented**, so it becomes both a learning journal and a portfolio project.

**Current status:** Stages 1–11 completed. Application is deployed and monitored (Prometheus + Grafana), and served via HTTPS using Caddy + Let’s Encrypt. SSH access is restricted to a bastion host and allow listed source IPs.

## Structure

Current project layout:

```
cloud_devops_lab/
├── ansible
│   ├── ansible.cfg
│   ├── ansible.log
│   ├── group_vars
│   │   ├── all
│   │   │   └── vars.yml
│   │   ├── app
│   │   │   └── vars.yml
│   │   └── monitoring
│   │   ├── vars.yml
│   │   └── vault.yml
│   ├── hosts.ini
│   ├── playbooks
│   │   ├── bootstrap_1.yml
│   │   ├── bootstrap_2.yml
│   │   ├── caddy.yml
│   │   ├── deploy_app.yml
│   │   ├── monitoring_grafana.yml
│   │   ├── monitoring_node_exporter.yml
│   │   ├── monitoring_prometheus.yml
│   │   ├── security_fail2ban.yml
│   │   └── unattended_upgrades.yml
│   ├── README.md
│   └── roles
│   ├── bootstrap_user
│   ├── caddy
│   ├── common
│   ├── deploy_app
│   ├── docker
│   ├── fail2ban
│   ├── grafana
│   ├── node_exporter
│   ├── prometheus
│   ├── ssh_hardening
│   └── unattended_upgrades
├── app
│   ├── Dockerfile
│   ├── gunicorn.conf.py
│   ├── requirements.txt
│   ├── src
│   │   ├── app.py
│   │   ├── routes
│   │   │   ├── health.py
│   │   │   ├── metrics.py
│   │   │   └── root.py
│   │   └── utils
│   │   ├── counters.py
│   └── venv
├── docs
│   └── project-checklist.md
├── IAAS.md
├── infrastructure
│   └── terraform
│   ├── main.tf
│   ├── modules
│   │   └── compute
│   ├── outputs.tf
│   ├── providers.tf
│   ├── terraform.tfstate
│   ├── terraform.tfstate.backup
│   ├── terraform.tfvars
│   └── variables.tf
├── LICENSE
└── README.md

```

## Requirements (current)

- Python 3.12+
- pip / venv
- Git
- Ansible
- Terraform
- Linode for server hosting
- Cloudflare (DNS)
- Domain registrar
- Grafana
- Prometheus

## Running the Application Locally

```
python -m venv venv
source venv/bin/activate
pip install -r app/requirements.txt
python -m app.src.app
```

```
Application runs at:
http://localhost:5000/
```

## Stages

The project is built in incremental stages. Each stage adds a new DevOps capability on top of the existing system.

### Stages Overview

- Stage 1: Flask application
- Stage 2: Containerization with Docker
- Stage 3: CI/CD pipeline (GitHub Actions & GHCR)
- Stage 4: Infrastructure (Terraform – servers, networking, firewalls)
- Stage 5: DNS & domain management (Cloudflare)
- Stage 6: Ansible bootstrap & access control
- Stage 7: SSH hardening
- Stage 8: Docker installation (via Ansible)
- Stage 9: Application deployment
- Stage 10: Monitoring stack (Prometheus & Grafana)
- Stage 11: TLS certificates & reverse proxy (Caddy))

### Stage 1 — Flask Application

**What:** Implemented a minimal Flask API with initial routing.
**Why:** A simple application is required before adding Docker, CI/CD, infrastructure and monitoring.
**How:** Created project folder structure, used Blueprints, tested locally with Python.

- Basic Flask application runs locally.
- Endpoints:
- `/` – root
- `/health`
- Foundation for Dockerization, CI/CD, monitoring and future infrastructure work.

### Stage 2 — Containerization with Docker

**What:**
Created a production-ready Dockerfile for the Flask application using a multi-stage build.

**Why:**
Containerizing the application allows consistent deployment across environments and provides the
foundation for CI/CD pipelines, registries, deployment automation, and infrastructure scaling.

**How:**

- Implemented a two-stage Dockerfile (builder + runtime).
- Installed dependencies in an isolated build layer.
- Copied only necessary runtime dependencies into a slim final image.
- Added a non-root application user for security.
- Added a Docker HEALTHCHECK hitting `/health`.
- Exposed port 5000 and used Gunicorn as the production WSGI(Web Server Gateway Interface) server.
- Built and ran the image locally to verify functionality.

**How to build and run**

1. Build image: `docker build -t cloud-devops-app:0.1 .`
2. Run container: `docker run -p 5000:5000 cloud-devops-app:0.1`
3. Test health endpoint: `curl http://localhost:5000/health`

### Stage 3 — CI/CD Pipeline (GHCR Integration)

**What:**
Extended the GitHub Actions workflow to build Docker images with tags and push them to
GitHub Container Registry (GHCR).

**Why:**
A registry is required for deployment automation and ensures versioned, reproducible artifacts
that can be pulled by servers during deployment.

**How:**

- Added permissions for GitHub Actions to write to GHCR.
- Logged in to GHCR using `GITHUB_TOKEN`.
- Created two image tags (`latest` and short commit SHA).
- Pushed images automatically on changes to `develop` and `main`.

### Stage 4 - Infrastructure (Terraform – servers, networking, firewalls)

Infrastructure is provisioned using Terraform on Linode (Akamai).

#### Architecture Overview

- **Jump Server**
- Public + private IP
- SSH entry point (bastion host)

- **Application Server**
- Private network only
- Runs application containers

- **Monitoring Server**
- Private network only
- Runs Prometheus and Grafana

All servers share a private network.
Only the jump server is reachable from the public internet.

#### Security Model

- Bastion (jump server) pattern
- SSH key authentication only
- No private keys stored on servers
- App and monitoring servers accessible only via private network
- Network access enforced using Linode Firewalls
- SSH agent forwarding used for hop-based access

#### Terraform Structure

```
infrastructure
└── terraform
├── main.tf
├── modules
│   └── compute
│   ├── main.tf
│   ├── outputs.tf
│   ├── providers.tf
│   └── variables.tf
├── outputs.tf
├── providers.tf
├── terraform.tfstate
├── terraform.tfstate.backup
├── terraform.tfvars
└── variables.tf
```

This stage establishes the baseline infrastructure but does not yet deploy applications.

### Stage 5 - DNS & Domain Management

The domain `clouddevopslab.eu` is registered at simply.com and delegated to Cloudflare
for DNS management and security features.

### Stage 6 — Ansible Bootstrap & Access Control

**What:**
Introduced Ansible to centrally manage all servers using a bastion (jump host) model.
Bootstrapped a non-root `devops` user with SSH key access and sudo privileges.

**Why:**
Manual server configuration does not scale and is error-prone.
Ansible provides reproducible, auditable configuration management and enforces
least-privilege access by avoiding root logins.

**How:**

- Configured Ansible inventory with a jump host (bastion pattern).
- Enabled SSH agent forwarding for secure multi-hop access.
- Created a reusable `common` role for connectivity checks.
- Added a `bootstrap_users` role to:
- create a `devops` user
- configure passwordless sudo
- install SSH public keys
- Switched Ansible to run as `devops` with privilege escalation (`become`).

### Stage 7 — SSH Hardening

**What:**
Hardened SSH access across all servers by disabling insecure authentication
methods and enforcing least-privilege access.

**Why:**
SSH is the primary attack surface on servers. Hardening reduces the risk of
brute-force attacks, credential abuse, and privilege escalation.

**How:**

- Disabled password-based SSH authentication.
- Disabled challenge-response authentication.
- Restricted SSH access using an explicit `AllowUsers` list.
- Disabled root SSH login entirely.
- Enforced bastion-only access using a jump host.
- Ensured Ansible operates as a non-root user with controlled privilege escalation.

### Stage 8 — Docker Installation

**What:**
Installed Docker Engine and Docker Compose plugin on application and monitoring servers.

**Why:**
Containers provide consistent, reproducible runtime environments and are the foundation
for application deployment and monitoring.

**How:**

- Installed Docker from the official Docker APT repository.
- Enabled and started the Docker service.
- Added the non-root `devops` user to the `docker` group.
- Verified operation using `docker run hello-world`.

Docker is intentionally not installed on the jump server.

### Stage 9 — Application Deployment (Docker + Ansible)

**What:**
Deployed the Flask application container to the application server using Ansible.

**Why:**
A repeatable deployment reduces manual steps and ensures consistent environments.

**How:**
- Pulled a pinned image tag from GHCR (`ghcr.io/tysker/cloud_devops_app:77ecd38`).
- Ran the container with `restart: unless-stopped`.
- Exposed HTTP on port 80 mapped to container port 5000.
- Added an Ansible health check against `/health`.

### Stage 10 — Monitoring stack (Prometheus & Grafana)

This stage introduces full observability for both the infrastructure and the application.

#### Part 1 — Node Exporter

**What:**
Deployed Node Exporter on the application and monitoring servers.

**Why:**
Host-level metrics (CPU, memory, disk, network) are essential for understanding system health and capacity.

**How:**
- Installed Node Exporter via Docker using Ansible.
- Metrics exposed on port `9100`.
- Targets scraped via private IPs.

---

#### Part 2 — Prometheus

**What:**
Deployed Prometheus on the monitoring server.

**Why:**
Prometheus acts as the central metrics collection and storage system.

**How:**
- Prometheus deployed via Docker using Ansible.
- Configuration rendered from a template (`prometheus.yml`).
- Scrapes:
- Node Exporter on app + monitoring servers
- Flask application metrics
- Persistent data directory mounted on the host.

---

#### Part 3 — Grafana

**What:**
Deployed Grafana for metrics visualization.

**Why:**
Metrics are only useful if they can be explored and visualized effectively.

**How:**
- Grafana deployed via Docker using Ansible.
- Prometheus configured as a data source.
- Access restricted to SSH port forwarding (no public exposure).
- Imported **Node Exporter Full** dashboard (ID 1860).

---

#### Part 4 — Flask application metrics

**What:**
Exposed application metrics in Prometheus format.

**Why:**
Application-level observability enables insight into runtime behavior, performance, and stability.

**How:**
- Added `/metrics` endpoint using `prometheus_client`.
- Removed the earlier JSON-based metrics endpoint.
- Prometheus scrapes the app at:
- `http://:80/metrics`
- Metrics verified in Prometheus and visualized in Grafana.

### Stage 11 — TLS certificates & reverse proxy (Caddy) + hardening

This stage secures the application with HTTPS and adds additional server hardening.

#### Part 1 — Reverse proxy + HTTPS (Caddy)

**What:**
Deployed Caddy on the application server to act as a reverse proxy and terminate TLS.

**Why:**
HTTPS is required for production-like deployments. A reverse proxy enables secure traffic, clean routing, and allows the application container to stay private (localhost only).

**How:**
- Opened inbound port 443 on the application firewall.
- Deployed Caddy via Ansible using Docker (`network_mode: host`).
- Configured Caddy to serve:
- `clouddevopslab.eu` and `www.clouddevopslab.eu` via HTTPS (Let’s Encrypt)
- private-IP HTTP access for Prometheus scraping
- Added basic security headers in the Caddyfile.
- Updated app deployment so the Flask container is bound to `127.0.0.1:5000` (not publicly reachable).

#### Part 2 — Stage 11 hardening (Option A)

**What:**
Implemented baseline security hardening for the environment.

**Why:**
Reduce attack surface and align with least-privilege and operational security practices.

**How:**
- Restricted SSH access to the jump server using a Terraform allowlist (`ssh_allowed_ips`).
- Installed and enabled Fail2ban on the jump server (`sshd` jail).
- Enabled automatic security updates (`unattended-upgrades`) on all servers.
- Moved Grafana admin password into **Ansible Vault** (no secrets stored in Git).

### Access Model

- Direct SSH access is allowed only to the jump server.
- All internal servers are accessed via the jump server using SSH agent forwarding.
- Ansible connects as a non-root `devops` user and escalates privileges only when required.
- Root SSH login is fully disabled.
- All access is performed via the non-root `devops` user with sudo escalation.

#### DNS Flow

- Domain registered at simply.com
- Nameservers delegated to Cloudflare
- DNS records managed in Cloudflare
- Application traffic will later be proxied via Cloudflare

#### Current Records

- `clouddevopslab.eu` → A record → application server
- `www.clouddevopslab.eu` → A record → application server

At this stage, DNS records exist and the application is reachable via HTTPS through Caddy. Cloudflare proxy is still disabled (DNS-only).

Note: During early stages, application IP addresses may change when infrastructure
is recreated. A reserved IPv4 address will be introduced later to provide a stable
DNS target.

## Learning Log

A chronological log describing the work done in each stage.

## Next Step

- Proceed to Stage 2: Containerization With Docker, Where The Application Will Be Packaged Into A Production-Ready Container Image.
- Procced to Stage 3: CI/CD pipeline (GitHub Actions & GHCR Integration)
- Procced to Stage 4: Infrastructure (Terraform – servers, networking, firewalls)
- Procced to Stage 5: DNS & domain management (Cloudflare)
- Procced to Stage 6: Ansible bootstrap & access control
- Procced to Stage 7: SSH hardening
- Procced to Stage 8: Docker installation (via Ansible)
- Procced to Stage 9: Application deployment using Docker and GHCR
- Procced to Stage 10: Stage 10: Monitoring stack (Prometheus & Grafana)
- Proceeded to Stage 11: TLS certificates & reverse proxy (Caddy)
- Next: Stage 12: Cloudflare proxy + restrict origin access to Cloudflare IP ranges

Stage 11 will introduce HTTPS, automatic TLS certificates, and a reverse proxy
in front of the application. This enables secure traffic, prepares the setup
for Cloudflare proxying, and allows stricter firewall rules on the application server.

## Git Workflow & Conventions

This repository uses a simple branching and commit strategy to keep the history clean and understandable.

### Branches

- `main`
Always deployable and represents the most stable state. Release tags will be created from this branch.

- `develop`
Integration branch for day-to-day work. Features, fixes and infrastructure changes are merged here before going to `main`.

- Short-lived branches
All work is done on short-lived branches and merged via pull requests:
- `feature/` – new functionality
- `fix/` – bug fixes
- `infra/` – infrastructure (Terraform, Ansible, etc.)
- `docs/` – documentation updates
- `ci/` – CI/CD pipeline changes

Examples:
- `feature/add-metrics-endpoint`
- `infra/add-terraform-app-server`
- `docs/update-readme-git-strategy`
- `ci/add-docker-build-workflow`

### Commit Messages

Commit messages follow a light version of Conventional Commits:

`(): `

Types used in this project:

- `feat` – new features (app or infra)
- `fix` – bug fixes
- `docs` – documentation changes
- `infra` – infrastructure code changes
- `ci` – CI/CD configuration
- `refactor` – code changes that don’t change behaviour
- `chore` – maintenance tasks, formatting, small cleanups

Examples:

- `feat(api): add /metrics endpoint`
- `docs(readme): document phase 1 (Flask app)`
- `infra(terraform): create linode instances for app and monitoring`
- `ci(docker): add image build and push workflow`

## Infrastructure Changes

All infrastructure and configuration changes are performed via:

- Terraform (provisioning)
- Ansible (configuration)

Manual changes on servers are avoided to ensure reproducibility.

## License

TBD – Will be added later in the project.