https://github.com/vsingh55/homelab-ops
A production-grade Hybrid Cloud Platform spanning On-Prem (Proxmox) and GCP. Engineered with Terraform, Ansible, K3s, and WireGuard Mesh to demonstrate Zero-Trust networking, FinOps, and SRE principles.
https://github.com/vsingh55/homelab-ops
ansible automation devops finops gcp gitops grafana hybrid-cloud infrastructure-as-code kubernetes observability platform-engineering proxmox self-hosted sre terraform wireguard zero-trust
Last synced: 2 days ago
JSON representation
A production-grade Hybrid Cloud Platform spanning On-Prem (Proxmox) and GCP. Engineered with Terraform, Ansible, K3s, and WireGuard Mesh to demonstrate Zero-Trust networking, FinOps, and SRE principles.
- Host: GitHub
- URL: https://github.com/vsingh55/homelab-ops
- Owner: vsingh55
- License: mit
- Created: 2025-11-25T07:47:40.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-01-27T17:03:40.000Z (3 months ago)
- Last Synced: 2026-01-28T03:42:06.980Z (3 months ago)
- Topics: ansible, automation, devops, finops, gcp, gitops, grafana, hybrid-cloud, infrastructure-as-code, kubernetes, observability, platform-engineering, proxmox, self-hosted, sre, terraform, wireguard, zero-trust
- Language: HCL
- Homepage: https://vijaysingh.cloud/
- Size: 8.69 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 21
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Homelab-Ops: The Evolution of a Sovereign Cloud
-orange?style=for-the-badge)

> **"This is not just a server in a closet. It is an enterprise-standard R&D platform simulating real-world constraints—Data Sovereignty, CGNAT traversal, Zero-Trust Networking, and Automated Disaster Recovery."**
---
## 📖 The Engineering Journey (From Bare Metal to Hybrid Cloud)
This repository documents the complete lifecycle of building a production infrastructure from scratch. It is organized by **Engineering Phases**, showcasing how the architecture evolved to solve increasingly complex problems.
### Phase 1: The Bare Metal Foundation (v1.0.0)
**Goal:** Establish a virtualization platform and experiment with Kubernetes orchestration.
* **The Hardware:** Sourcing a Mini PC (Intel i5, 16GB RAM) and configuring storage tiering (NVMe for OS, HDD for Backups).
* **The Stack:** Installed **Proxmox VE** as the Type-1 Hypervisor.
* **The Logic:** Created "Logical Zones" to separate Management (Ansible Control Node) from Production (K3s Cluster) and Lab (Ephemeral K8s nodes).
* **The Limitation:** The cluster was isolated behind a home router with no public access (CGNAT).

---
### Phase 2: The Hybrid Bridge (v2.0.0)
**Goal:** Break the CGNAT barrier and establish a public presence without exposing the home network.
* **The Solution:** Architected a **Site-to-Site WireGuard Mesh**.
* **Cloud Gateway:** Provisioned a Google Cloud Platform (GCP) instance in Mumbai to act as the public "Front Door."
* **Traffic Flow:** Public traffic hits GCP -> Encrypted Tunnel -> On-Prem Traefik Ingress.
* **Infrastructure as Code:** Migrated manual setups to **Terraform** (GCP) and **Ansible** (On-Prem).
#### **⚡ Featured Implementation: Hybrid Cloud Automation (n8n)**
*A real-world stress test of the hybrid architecture: securely hosting a webhook-driven automation platform.*
**The Mission:** Build a "Sovereign Cloud" alternative to Zapier. The system must run workloads On-Premises (to save costs) but accept traffic securely from the Public Internet (GitHub Webhooks).
**Architecture Evolution (The path to stability):**
1. **Stage 1: The "Direct Connect" Attempt (Naive)**
* *Design:* Port Forwarding on Home Router.
* *Why Rejected:* **Security Risk.** Exposing the home network directly violated "Zero Trust" principles.
2. **Stage 2: The "Split-Brain" Dynamic Cloud (Experimental)**
* *Design:* GCP **Spot VM** as a Gateway + "Watchdog" scripts to auto-heal the tunnel when preempted.
* *Failure Mode:* **"Zombie States."** When IP addresses changed, DNS propagation lag caused Webhook failures. It also created significant **Ansible Drift**, as the inventory file was constantly outdated.
3. **Stage 3: The "Stable Mesh" (Production Grade)**
* *Design:* Migrated to **GCP Standard VM + Static IP** in Mumbai (`asia-south1`).
* *Result:* Eliminated the "Circuit Breaker" issue. If the home lab goes down, the Gateway now serves a clean 502 error instead of a connection timeout. Latency dropped to <30ms.
> **Technical Challenges & Solutions:** You can read the full deep-dive [here](docs/journal/project-hybrid-cloud-automation.md).


> The K3s Cluster architecture incorporates several planned future upgrades.

---
### Phase 3: The Platform Era (Roadmap & Active Dev)
**Goal:** Shift from "Building Infrastructure" to "Platform Engineering"—focusing on Supply Chain Security, Event-Driven Architectures, and GreenOps using GCP services.
#### 🚧 Upcoming Implementation Specs
| Feature | Architecture / Implementation Plan | GCP Services / Cloud Tech |
| :--- | :--- | :--- |
| **1. Serverless "Burst" Worker** | **Event-Driven Hybrid Pattern:**
Instead of running heavy OCR tasks locally, MinIO upload events will trigger a container in the cloud.
_Why? Offloads compute-heavy tasks to Google Cloud Free Tier._ | **Cloud Run**, **Eventarc**, **Pub/Sub** |
| **2. Supply Chain Security** | **Secure Registry Pipeline:**
Implementing image signing and vulnerability scanning before any container reaches the Production cluster. | **Artifact Registry** (Optional), Trivy, Cosign, Kyverno |
| **3. GreenOps Automation** | **"Eco-Mode" Lab Manager:**
An **n8n** workflow that interacts with the Proxmox API to automatically freeze/thaw the 16GB "Lab Zone" based on study schedules. | Proxmox API, n8n |
| **4. GitOps Transformation** | **Pull-Based State Management:**
Migrating from Ansible-push to **ArgoCD**. The cluster will sync itself with this repo, ensuring "Configuration Drift" is impossible. | ArgoCD, Kustomize |
| **5. Hybrid Identity (IAM)** | **Single Sign-On (SSO):**
Centralizing access for service, Traefik, and SSH under one identity provider with MFA enforcement. | Keycloak, OIDC |
---
#### 📉 Architecture Evolution Plan
> *Current Focus: Moving stateful workloads (Postgres) to High-Availability Operators.*
* **Now:** Static Postgres Pods (Hard to scale, manual failover).
* **Next:** **CloudNativePG Operator** with automatic failover, Point-in-Time Recovery (PITR) to S3, and replica pooling.
---
## Technical Deep Dive
### 1. Infrastructure as Code (IaC)
I strictly adhere to the **Dry (Don't Repeat Yourself)** principle using modular design.
* **Terraform:** Split into `infrastructure/gcp` (Cloud Edge) and `infrastructure/on-prem` (Proxmox Resources).
* **Ansible:** Uses a "Control Node" pattern. The `ops-center` node bootstraps the entire fleet using Roles for Hardening, Docker, K3s, and Monitoring.
### 2. The "Hydration" Pattern (Security)
To maintain **Zero Trust** and keep secrets out of Git, I developed a "Hydration" workflow:
1. Secrets are encrypted AES-256 in Ansible Vault (`vault.yml`).
2. A specialized playbook (`hydrate_infra.yml`) decrypts these values in memory.
3. It generates ephemeral `terraform.tfvars` files strictly on the deployment machine.
4. **Result:** Terraform plans run with full context, but no secrets ever touch the disk unencrypted.
### 3. Observability & FinOps
* **Monitoring:** Full Prometheus/Grafana stack monitoring Kubernetes metrics and Hardware thermals.
* **Cost Control:** The entire cloud footprint is engineered to stay under minimal costs (~$5-$10/month) using reserved instances and efficient resource sizing.
---
###
**Google Cloud Implementation Details**
| Service | Usage in Homelab-Ops |
| :--- | :--- |
| **Compute Engine (GCE)** | Hosts the WireGuard Gateway acting as the public "Front Door" to the private lab. |
| **VPC & Static IP** | Reserved External IP ensures 100% reliability for incoming Webhooks (GitHub -> n8n). |
| **Cloud NAT** | Provides secure outbound internet access for private cloud subnets (without exposing them). |
| **Cloud Run** *(Planned)* | Serverless compute target for sporadic, high-intensity tasks (OCR/PDF Processing). |
| **Eventarc** *(Planned)* | Event bus routing storage events (MinIO) to Cloud Run functions. |