https://github.com/meroxdotdev/infrastructure
🏠 Personal Homelab Infrastructure Production-ready Kubernetes homelab with Talos Linux and GitOps automation. Multi-node setup with automated deployments via Flux and comprehensive infrastructure management.
https://github.com/meroxdotdev/infrastructure
blog homelab k8s kubesearch linux server
Last synced: 24 days ago
JSON representation
🏠 Personal Homelab Infrastructure Production-ready Kubernetes homelab with Talos Linux and GitOps automation. Multi-node setup with automated deployments via Flux and comprehensive infrastructure management.
- Host: GitHub
- URL: https://github.com/meroxdotdev/infrastructure
- Owner: meroxdotdev
- License: mit
- Created: 2025-06-13T10:15:00.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2026-04-26T20:08:25.000Z (about 1 month ago)
- Last Synced: 2026-04-26T20:29:20.989Z (about 1 month ago)
- Topics: blog, homelab, k8s, kubesearch, linux, server
- Language: YAML
- Homepage: https://merox.dev/
- Size: 1.85 MB
- Stars: 13
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# merox.dev Infrastructure
GitOps-managed homelab: Kubernetes cluster (Talos + Flux) + VPS services + Infrastructure agent.
---
## Architecture
```
┌─────────────────────────────────────────────────────────┐
│ VPS (Oracle/Hetzner) cloudlab-infrastructure/ │
│ ├── Traefik (reverse proxy + SSL) │
│ ├── Pi-hole (DNS) │
│ ├── Portainer (container management) │
│ ├── Homepage (dashboard) │
│ ├── Netdata (monitoring) │
│ └── Garage S3 (Longhorn backup target) │
└────────────────────┬────────────────────────────────────┘
│ Tailscale mesh VPN
┌────────────────────▼────────────────────────────────────┐
│ Kubernetes Cluster (Talos Linux + Flux) │
│ ├── Cilium (CNI + Gateway API) │
│ ├── Longhorn (storage → backs up to Garage S3) │
│ ├── cert-manager, external-dns, k8s-gateway │
│ └── Apps: see kubernetes/apps/ │
└─────────────────────────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ OpenClaw — openclaw.ai │
│ Telegram bot → Claude API → kubectl/docker │
│ Config: agent/openclaw.json Skill: agent/skills/infra │
└─────────────────────────────────────────────────────────┘
```
---
## Hardware
| Device | Role | Specs |
|--------|------|-------|
| Dell OptiPlex 3050 #1 | K8s node (Proxmox VM) | i5-6500T, 16GB, 128GB NVMe |
| Dell OptiPlex 3050 #2 | K8s node (Proxmox VM) | i5-6500T, 16GB, 128GB NVMe |
| Beelink GTi 13 Pro | K8s node (Proxmox VM) | i9-13900H, 64GB, 2x2TB NVMe |
| Dell PowerEdge R720 | Proxmox Backup Server | 2x Xeon E5-2697v2, 192GB |
| Synology DS223+ | NAS / NFS + Backup | 2x2TB HDD RAID1 |
| XCY X44 | pfSense Firewall | N100, 8GB |
| Oracle/Hetzner VPS | Off-site services | 4 vCPU, 8GB |
---
## Repository Layout
```
infrastructure/
├── cloudlab-infrastructure/ # Ansible — VPS provisioning
├── kubernetes/
│ ├── apps/ # Flux app manifests (namespaced)
│ ├── flux/ # Flux bootstrap + HelmRepositories
│ └── components/ # Shared Kustomize components (common, repos)
├── talos/ # Talos node configs + patches
├── bootstrap/ # Cluster bootstrap vars
├── agent/ # OpenClaw config template + infra skill
│ ├── openclaw.json # Gateway config (no secrets — use ~/.openclaw/.env)
│ └── skills/infra/ # kubectl/docker skill context
├── DEPLOY.md # Full rebuild + DR guide
└── Taskfile.yaml # Task runner (talosctl, flux, etc.)
```
---
## Disaster Recovery
> Full step-by-step rebuild guide: **[DEPLOY.md](DEPLOY.md)**
| Scenario | Where to look |
|----------|--------------|
| Full rebuild (new server + new cluster) | [DEPLOY.md — Phase 1 (VPS)](DEPLOY.md#phase-1--vps) → [Phase 2 (K8s)](DEPLOY.md#phase-2--kubernetes-cluster) → [Phase 3 (Agent)](DEPLOY.md#phase-3--agent-openclaw) |
| Restore Longhorn volumes from S3 backup | [DEPLOY.md — Phase 2, step 7](DEPLOY.md#phase-2--kubernetes-cluster): `task restore:longhorn` |
| New hardware (different IPs/disks) | [DEPLOY.md — Phase 2, step 3](DEPLOY.md#phase-2--kubernetes-cluster): update `talos/talconfig.yaml`, `cluster-vars.yaml`, `cilium/networks.yaml` |
| Intel iGPU absent on new hardware | Remove `gpu.intel.com/i915` from `kubernetes/apps/default/jellyfin/app/helmrelease.yaml` and disable `intel-device-plugin-operator` |
| Jellyfin restored but streaming slow / Tailscale broken | See **[docs/jellyfin-post-restore.md](docs/jellyfin-post-restore.md)** — manual UI steps required after every restore |
| Re-install OpenClaw agent only | [DEPLOY.md — Phase 3](DEPLOY.md#phase-3--agent-openclaw) |
**The two things to back up before decommissioning a server:**
1. `age.key` — losing this = losing all SOPS-encrypted secrets
2. `~/.openclaw/.env` — Anthropic API key, Telegram tokens
---
## Day-to-Day Operations
### Cluster
```bash
# Check overall health
kubectl get nodes
kubectl get kustomizations -A
kubectl get helmreleases -A
cilium status
# Force Flux sync
task reconcile
# Regenerate Talos config (after editing talconfig.yaml)
task talos:generate-config
# Apply updated config to a node
task talos:apply-node IP=10.57.57.80
# Upgrade Talos on a node (update talenv.yaml version first)
task talos:upgrade-node IP=10.57.57.80
task talos:upgrade-k8s
# Reset entire cluster (destructive)
task talos:reset
```
### VPS
```bash
cd cloudlab-infrastructure/
make health-check # Verify all services running
make setup # Full redeploy (idempotent)
make update # OS package updates only
make check # Dry-run (--check --diff)
make check-resources # Disk, memory, Docker usage
make cleanup # Remove unused Docker images/volumes
```
---
## Troubleshooting
### Flux not reconciling
```bash
flux get sources git -A # Check git source is reachable
flux get kustomizations -A # Find which ks is failing
flux logs --level=error # See error messages
flux reconcile kustomization cluster-apps --with-source # Force sync
```
### HelmRelease stuck / failed
```bash
kubectl get helmreleases -A | grep -v True
flux logs --kind HelmRelease --name -n
flux reconcile helmrelease -n --with-source
# If values changed and Helm refuses — suspend + resume:
flux suspend helmrelease -n
flux resume helmrelease -n
```
### Pod issues
```bash
kubectl -n get pods -o wide
kubectl -n describe pod
kubectl -n logs -f
kubectl -n logs --previous # crashed container
kubectl -n get events --sort-by='.metadata.creationTimestamp'
```
### Longhorn storage
```bash
# Volume / replica status
kubectl -n longhorn-system get volumes
kubectl -n longhorn-system get nodes.longhorn.io
# Orphaned replicas (safe to delete)
kubectl get orphan -n longhorn-system -o name | \
xargs kubectl delete -n longhorn-system
# Trigger a backup manually
# Longhorn UI → Volume → Create Backup
# Old snapshots cleanup
kubectl get snapshots -n longhorn-system -o json | \
jq -r '.items[] | select(.status.creationTime < "2025-01-01") | .metadata.name' | \
xargs kubectl delete snapshot -n longhorn-system
```
### Replacing a disk on a K8s node
```bash
# 1. Drain node
kubectl drain --ignore-daemonsets --delete-emptydir-data
# 2. In Proxmox: shutdown VM, swap physical disk, boot VM
# 3. Regenerate and re-apply Talos config
task talos:generate-config
talosctl apply-config --insecure --nodes \
--file talos/clusterconfig/.yaml
# 4. Uncordon
kubectl uncordon
# 5. If Longhorn disk UUID changed — evict replicas then re-add disk:
kubectl -n longhorn-system patch node.longhorn.io \
--type merge -p '{"spec":{"evictionRequested":true}}'
# Wait for replicas to evacuate (~20-60 min), then remove old disk
# and add new disk via Longhorn UI
```
> Wait 1-2 hours between disk swaps to allow replica rebuild.
### Node unreachable
```bash
talosctl -n health
talosctl -n dmesg
talosctl -n services
kubectl describe node
```
### Garage S3 (Longhorn backup target)
```bash
ssh root@ "docker exec garage /garage status"
ssh root@ "docker exec garage /garage bucket list"
# Verify Longhorn can reach it:
kubectl -n longhorn-system get secret minio-secret
```
---
## Maintenance
### Adding a node
> Keep an odd number of control plane nodes (1, 3, 5) for quorum.
```bash
# 1. Boot new node from Talos ISO — same schematic ID as existing nodes
# Get disk and MAC from the node in maintenance mode:
talosctl get disks -n --insecure
talosctl get links -n --insecure
# 2. Add node entry to talos/talconfig.yaml with the disk and MAC above
# 3. Regenerate config and apply to new node
task talos:generate-config
task talos:apply-node IP=
# 4. Node joins automatically — watch it become Ready:
kubectl get nodes --watch
```
### Talos config changes
```bash
# After editing talos/talconfig.yaml or any patch:
task talos:generate-config
task talos:apply-node IP= MODE=auto
# MODE=auto applies without reboot if possible, reboots if required
```
### Dependency updates
Renovate runs every weekend and opens PRs automatically for:
- Helm chart versions (all HelmReleases)
- Container image tags (annotated with `# renovate:`)
- Talos / Kubernetes versions (`.mise.toml`)
Config: `.renovaterc.json5`
### SOPS secret rotation
```bash
# Edit any encrypted secret
sops kubernetes/apps///app/secret.sops.yaml
# Re-encrypt all secrets after AGE key rotation
find . -name "*.sops.*" -exec sops updatekeys {} \;
```
### Security
- Kubernetes secrets encrypted with SOPS (AGE key — back up manually)
- Ansible secrets in encrypted Vault (`cloudlab-infrastructure/`)
- All traffic via Tailscale mesh or Cloudflare Tunnel (no open ports)