{"id":18579246,"url":"https://github.com/faust64/kube-magic","last_synced_at":"2026-03-19T05:08:24.022Z","repository":{"id":76091516,"uuid":"268522280","full_name":"faust64/kube-magic","owner":"faust64","description":"Kubernetes \u0026 Ceph \u0026 OpenShift","archived":false,"fork":false,"pushed_at":"2023-02-19T19:08:30.000Z","size":835,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-05-16T02:38:58.322Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jinja","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/faust64.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-01T12:57:08.000Z","updated_at":"2022-02-27T11:16:56.000Z","dependencies_parsed_at":"2023-05-22T11:00:32.951Z","dependency_job_id":null,"html_url":"https://github.com/faust64/kube-magic","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/faust64/kube-magic","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faust64%2Fkube-magic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faust64%2Fkube-magic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faust64%2Fkube-magic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faust64%2Fkube-magic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/faust64","download_url":"https://codeload.github.com/faust64/kube-magic/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faust64%2Fkube-magic/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28741627,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T01:40:51.112Z","status":"online","status_checked_at":"2026-01-25T02:00:06.841Z","response_time":113,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T23:39:36.994Z","updated_at":"2026-01-25T02:01:48.494Z","avatar_url":"https://github.com/faust64.png","language":"Jinja","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Kube Magic\n\nBringing together [KubeSpray](https://github.com/kubernetes-sigs/kubespray),\n[Ceph-Ansible](https://github.com/ceph/ceph-ansible) and\n[OpenShift-Ansible](https://github.com/openshift/openshift-ansible)\ndeploying Kubernetes.\n\nRefer to KubeSpray, OpenShift-Ansible and Ceph-Ansible respective\ndocumentations preparing to deploy your clusters - the main purpose\nof this repository being to centralize every inventories involved,\nas well as adding third party components such as the EFK stack,\nintegrating with Nagios based monitoring, ... or some\nHAProxy / Keepalived external Load Balancer serving Kubernetes API.\n\nAs an FYI, `./terraform-aws` would allow you to easily bootstrap\nEC2 instances / VPC / SG / IAM / ELB / ... Based on Kubepsray\nterraform samples.\n\n## Inventories Setup\n\nWe would start by pulling upstream repositories:\n\n```\n$ make init\n```\n\nWe now have the following Ansible inventories:\n\n * `custom/hosts`, variables in `custom/group_vars`, with hosts and\n   settings specific to this repository\n * `kubespray/inventory/utgb/hosts.yml` and\n   `kubespray/inventory/utgb/group_vars`, for everything specific to\n   KubeSpray\n * Ceph-Ansible specifics would be in `ceph-ansible/hosts` and\n   `ceph-ansible/group_vars`\n * OpenShift would be deployed using `openshift-ansible/hosts` and\n   `openshift-ansible/group_vars`.\n\nDepending on what you would want to setup, we would patch the corresponding\ninventories, listing servers the servers we would deploy to, and Ansible\nvariables setting the specifics for our cluster, such as domain names,\nproxies, DNS, NTP configuration, ... Or in the case of OpenShift: LDAP\nauthentication.\n\nNote: the `kubespray.rpi` folder includes inventories deploying\nKubernetes on Raspberry 3 \u0026 4 (armv7). Requires 64b on masters, raspbian\nworks perfectly.\n`kubespray.aws` comes from another deployment (1.20), testing/lab on AWS.\n\n## Usage\n\n### Ceph\n\nIf you intend to use Ceph, we would first deploying that cluster.\n\nHaving customized your inventories, we would run:\n\n```\n$ make deploy-ceph\n```\n\nNote: running on ARM, we should pull different images, ... see\n`examples/ceph-csi-arm` for a sample working on Raspberry PI. Also note that\narm64 is mandatory for rbdplugin mapper to work. And that Raspbian does not\nprovide with rbd kernel modules: rbd-nbd should be used instead.\n\n### KubeSpray\n\nThen prepare for KubeSpray deployment, setting up External LoadBalancers,\nproper Ceph repository, preparing a RBD pool, initializing Ansible\nvariables for Kubernetes to authenticate againt Ceph, ... using:\n\n```\n$ make prep-hosts\n```\n\nThen proceed with Kubespray deployment:\n\n```\n$ make deploy-kube\n```\n\n### OpenShift\n\nDealing with OpenShift, we would deploy our cluster:\n\n```\n$ make deploy-openshift\n```\n\n### Post Kubernetes/OpenShift Deployment\n\nSetting up Nagios monitoring, backups, or OpenShift persistent storage\nand LDAP Groups sync, we would then use:\n\n```\n$ make deploy-post\n```\n\nEventually, we may deploy additional components, such as:\n\n * Logging stack: `make deploy-logging` (kubespray/openshift)\n * Monitoring stack: `make deploy-prometheus` (kubespray/openshift)\n * Tekton: `make deploy-tekton` (kubespray -- WARNING: some manual fix required\n   afterwards)\n\nAlternatively, deploying Prometheus on Kubernetes can be done using:\nhttps://github.com/Worteks/k8s-prometheus - which includes ARM support,\nmissing from kube-state-metrics, as shipped by roles in the playbooks we\nhave here.\n\nDeploying Kubespray, if you choosed not to deploy one of their supported\ningress controller, you may go with Traefik - see samples in `examples/traefik`.\n\n#### EFK on Kubernetes\n\nNote that deploying the logging stack on Kubernetes, you will then have to\nconnect Kibana, go to Settings, Kibana / Index Patterns, close the div on\nthe right side, as it hides the Create Index Pattern button.\n\nAs a pattern, we would enter `logstash-*`, confirm. The more we would wait\nbefore doing so, the more fields we would discover. A few to look for would\nbe `kuberenetes.container.*`, `kubernetes.labels.*` or `SYSLOG_FACILITY`.\n\n## Feedback\n\n### Buildah\n\nThe first inconsistency I would find, comparing with OpenShift, is that I can\nnot run Buildah on unprivileged containers, building my images. Using either\nthe default or VFS drivers (the latter did help in OCP), which gives different\nerror messages.\n\nAs far as I understand, this would have to do with Apparmor being enabled on\nmy (Debian buster) nodes. Build fails with a permission denied, writing on\nsome emptyDir.\n\nAllowing my tekton ServiceAccount to run privileged pod, buildah builds do\nwork as intended. An to be fair, even openshift tektoncd samples would run\nthose as root - as reported, openshift/pipelines-catalog#17 . Quite a shame,\nconsidering that unprivileged capbility is one of the main argument buildah\nhas -- and it's either poorly performing, or not at all ...\n\nAnother issue we would encounter with Buildah alongside Containerd, is due\nto some wrong metadata dealing with compressed images. Though that bug was\nallegedly fixed, latest images still randomly reproduce the issue (not all\nof them affected). This could be fixed disabling compression pushing\nimages - which is definitely not a good solution, though at least it works.\n\nhttps://github.com/containers/buildah/issues/1589#issuecomment-504504999\nhttps://github.com/containers/buildah/issues/1589#issuecomment-542509369\nhttps://www.mankier.com/1/buildah-push#--disable-compression\n\nIn some case, disabling compression did not fix. Noticing the remaining\ndeployments all used images builds on top of other custom images\n(eg: apache -\u003e php -\u003e whitepages), I tried to manually pull base images,\nwhich fixed ... A better solution would then to add some extra option to\nbuildah (not required on OpenShift, unclear wtf)\n\nhttps://github.com/containers/image/issues/733#issuecomment-625867772\n\nOverall: quite disappointed by Buildah on Kubernetes - while, to be frank,\nit was not that great on OpenShift either. Doc's accurate. Does what's\nadvertised. Builder ships with arm64 images: same confs work on RPI, and x86.\n\nUPDATE: ... While those comments are quite old: as a follow-up, today using\nKaniko, which is great, despite requiring root.\n\n### CephFS Provisioner\n\nCurrently running version 1.2.2 of the cephfs-provisioner, I noticed that\nPVC deletion may never complete. While the PVC itself is indeed deleted,\nthe Secret that was created accessing our Volume would remain in our\nNamespace. The Persistent Volume object would remain in a Released state,\ndespite my StorageClass reclaimPolicy being set to Delete. Both of which\nwould prevent re-creating a PVC you would have deleted.\n\n### Cert-Manager\n\nThe cert-manager operator does not work, as deployed by kube-spray (master\nbranch, I should have picked a release first, ...). Looks like we're in\nbetween two API versions, the operator is lacking permissions over the\nobjects it wants to use. If I fix RBAC, the the api server refuses the\nobjects created by the operator. The operator image is most definitely wrong,\nin relation to the RBAC/CRD configuration loaded by kube-spray.\n\nMore recently, I did apply an officiel release of cert-manager: I have\neach CRD twice (in `cert-manager.io` \u0026 `certmanager.k8s.io`). Though it\nworks perfectly.\n\n### ETCd Quotas\n\nWhile first deploying kubernetes using those playbooks, I made a mistake in\nsetting `etcd_quota_backend_bytes` to 20Mi, instead of 20Gi. Realizing this,\nI corrected the `/etc/etcd.env` on two out of three masters, leaving the last\none intact, to see what would happend.\n\nAbout 5 days after initial deployment, while porting an OCP operator to work\nin vanilla k8s (lots of objects creation/updates/deletion and do-overs, ...),\nI eventually reached a point where any kubectl command that might have written\nsomething into etcd would fail. In most case, the associated error message\nwould be clear enough, including something like `mvcc: database space exceeded`.\n\nEmpirically, we can see that having all three members healthy, yet only one\nof them reaching its quota, would result in an alarm being set for the whole\ncluster.\n\nQuerying etcd cluster status, we can see all members are here. A column mentions\nsome `20MB`, matching my initial quota.\n\n```\n. /etc/etcd.env\nENDPOINTS=$(echo $ETCD_INITIAL_CLUSTER | sed  -e 's|etcd[0-9]=||g' -e 's|2380|2379|g')\nETCDCTL_API=3 etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE \\\n    --cacert=$ETCDCTL_CA_FILE --endpoints=$ENDPOINTS endpoint status\n```\n\nWe could try compating our database, keeping in mind this process should run in\nbackground, so don't hope too much out of this:\n\n```\nETCDCTL_API=3 etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE \\\n    --cacert=$ETCDCTL_CA_FILE --endpoints=$ENDPOINTS endpoint status \\\n    --write-out=\"json\" | egrep -o '\"revision\":[0-9]*' | egrep -o '[0-9].*'\n\u003creturns-a-revision-number\u003e\nETCDCTL_API=3 etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE \\\n    --cacert=$ETCDCTL_CA_FILE --endpoints=$ENDPOINTS compact \u003crevision-number\u003e\n```\n\nWhile compaction did not help, I eventually went with defrag, which did:\n\n```\nETCDCTL_API=3 etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE \\\n    --cacert=$ETCDCTL_CA_FILE --endpoints=$ENDPOINTS defrag\n```\n\nThen, do not forget to clear the alarm:\n\n```\nETCDCTL_API=3 etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE \\\n    --cacert=$ETCDCTL_CA_FILE --endpoints=$ENDPOINTS alarm disable\n```\n\nNote that while, at that stage, etcd logs would confirm there is no more\nquota issue, we may want to restart the whole cluster, forcing all components\nto acknowledge this -- I could not kubectl exec, logs, ... restarting a pod\nended up in one being stuck in \"terminating\" while the other in\n\"containercreating\", ... situation was quite fucked up.\n\nAt which point, I would argue there's no \"right\" value to pass that\n`etcd_quota_backend_bytes` variable. It would probably be safer to leave it\nundefined and use a dedicated logical volume hosting etcd data - maybe with\nasymetrical capacities, making sure they would not get full all at once, and\nkeeping some space available in the parent volume group, which could fasten\nrecovery, especially if not everyone in your team knows about etcd operations.\n\n### Pulling Container Images from Insecure Registries\n\nUsing containerd runtime, it is not yet possible to pull images from an\ninsecure registry, unless we have some CA to trust on Kubernetes hosts.\nSuch registry would remain usable with Tekton and Buildah, though we would\nnot be able to run containers out of it.\n\nFeature has been added to containerd 1.4.0-beta.0, kube-spray currently ships\nwith 1.2.13-2 on Debian buster.\n\nIn the meantime, when self-signing certificates without a CA, or no way to\neasily trust new CAs into Kubernetes hosts (eg: operator), then http registries\ncould still be used.\n\nUPDATE: fixed in package, and can be configured in kubespray inventory\n\n### Containerd Snapshots\n\nSeen disk usage grow over time on a given node. After 9 months,\n`/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs` was using \u003e30G.\nTo clean it up:\n\n```\n# systemctl stop kubelet\n# crictl pods | awk 'NR\u003e1{print $1}' | xargs crictl stopp\n# crictl pods | awk 'NR\u003e1{print $1}' | xargs crictl rmp\n# crictl ps -a | awk 'NR\u003e1{print $1}' | xargs crictl rm -f\n# systemctl stop containerd\n# rm -fr /var/lib/containerd/io.containerd.metadata.v1.bolt/*\n# rm -fr /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/*\n# reboot\n```\n\n### Re-Deploying Nodes\n\nAfter suffering a disk loss, I had to redeploy a master node. This can be done\nwith the following playbook, having edited your inventory, such as the faulty\nnodes are part of some `broken_xxx` hostgroup:\n\n```\n$ ansible-playbook -i inventory/my/hosts.yml -l etcd,kube-master \\\n    -e etcd_retries=300 recover-control-plane.yml\n```\n\nPlaybook crashed, while dealing with add-ons. I decided to comment them all out\nfrom Ansible group vars, then re-applied the cluster deployment playbook:\n\n```\n$ ansible-playbook -i inventory/my/hosts.yml -l etcd,kube-master cluster.yml\n```\n\nEverything went fine, though I still had a regular worker to re-deploy, and\ndecided to use the scale-out playbook, while re-using the same node name:\n\n```\n$ ansible-playbook -i inventory/my/hosts.yml scale.yml\n```\n\nOutage dit not affect API services and cluster in general, the rest of the nodes\nkept running perfectly fine.\n\n### Upgrading Cluster\n\nThere's two kind of upgrades we could apply to a cluster deployed with\nKubespray:\n\n- upgrading Kubernetes\n- upgrading Kubepsray\n\nEither way, there is an upgrade path to follow. For Kubernetes, starting with\n1.x going to the last 1.z, we would have to go through some 1.y. In my case,\nstarting from 1.18.3, going to 1.20, I would have to apply a 1.19 in the way.\nRe-apply the Kubespray cluster upgrade playbook going from one `kube_version`\nto the next.\n\nAnd upgrading Kubernetes would usually imply updating the Kubepsray playbooks\nmanaging your cluster - at the very least, getting the right default image\nversions, checksums or deployment configurations for calico, containerd, ...\ndepending on the Kubernetes version we're upgrading to.\nAs for Kubernetes, Kubespray has an upgrade path we should follow: each tag\nshould be applied one after the other (v2.14.0, v2.14.1, v2.14.2, ...). And if,\nas me, you started deploying from the master branch: first, we have to guess\nwhich tag to start with ...\n\nKeeping it simple, stick to the default `kube_version` shipping with Kubespray,\nand apply the upgrade playbook for each tag until you reach the right Kubernetes\nversion. Otherwise, make sure your `kube_version` is listed in the download\ndefaults: `roles/download/defaults/main.yml`.\n\nUpgrading Kubespray, we also have to check the diffs between current and target\nversion copies of `inventory/sample`, look into the new variables that were\nintroduced, some that could have changed or others that may have gone away.\nHaving updated your inventory, make sure your nodes are all healthy. Make sure\nno PodDisruptionBudget could prevent a node from being drained. Then apply\nthe upgrade playbook:\n\n```\n$ git checkout v2.14.0\n$ git diff \u003coldtag\u003e..v2.14.0 inventory/sample\n$ vi inventory/\u003cmine\u003e/xxx [ update your inventory ]\n$ ansible-playbook -i inventory/my/hosts.yml upgrade-cluster.yml\n$ git checkout v2.14.1\n$ git diff v2.14.0..v2.14.1 inventory/sample\n$ vi inventory/\u003cmine\u003e/yyy\n$ ansible-playbook -i inventory/my/hosts.yml upgrade-cluster.yml\n$ git checkout v2.14.2\n[ repeat ]\n```\n\nThe upgrade process would run some checks, pre-download images and some\nassets, then eventually start upgrading etcd, then drain and upgrade your\nmasters one after the other. After the first master was upgraded, parts of\nKubernetes Apps also are (CSI \u0026 RBAC, ...). After all masters were upgraded,\nthe SDN would be upgraded on all nodes (two by two). Then, the remaining\nnodes would be upgraded as well. Goes through the apps again, updating\nCoreDS, the metrics server, ...\n\nIn the end, I could not see any API failure. Which is kind of amazing, knowing\nhow painful OpenShift 4 upgrades can be, in terms of SDN components restarting,\nAPI being unavailable, unless that's the OAuth operator that's being redeployed\nor the nodes rebooting, ... The process took a little under two hours applying\na new version on 10 nodes. Draining and restarting Pods being quite slow,\ndon't hesitate to shut down useless deployments or lower the amount of replicas\nwhenever possible, before going through an upgrade. Pro-tip: marking a node\nunschedulable before its being processed by Ansible would skip the draining\nsteps. One may also want to disable a few apps deployment in Kubespray, in\norder to skip those tasks as well. Eg: only re-deploy the registry and ingress\ncontrollers during your last upgrade (or manually, later on).\n\nUPDATE: we don't actually need to apply each kubespray tag. Instead, we should\nmake sure to go through each minor: we can skip patches.\n\nUPDATE: kubespray 2.18.2 misses some arm64 assets (fixed in 2.19.0). While\n2.18.2 also introduces a broken apparmor detection, which is still an issue\nin last masters (post 2.21), PR submitted. 2.20.0 upgrade with calico\nrequired defining a new default/not documented. nobody's prefect. kubespray\nremains quite reliable.\n\n### Recovering from Expired API Certificates\n\nKubespray playbooks were unable to rotate certificates on a cluster whose\ncurrent certificates were expired. In such case, we may regenerate some of\nthose certs manually, to recover API. Connecting to one of your master nodes:\n\n```\n$ ssh root@master1\n# cd /etc/kubernetes/ssl\n# kubeadm certs renew apiserver\n# kubeadm crets renew apiserver-kubelet-client\n# kubeadm certs renew front-proxy-client\n# scp -p apiserver.* apiserver-kubelet-client.* front-proxy-client.* master2:`pwd`/\n# scp -p apiserver.* apiserver-kubelet-client.* front-proxy-client.* master3:`pwd`/\n```\n\nYou may also have to generate a new admin kubeconfig:\n\n```\n# kubeadm kubeconfig user --client-name kubernetes-admin \\\n    --config=/etc/kubernetes/kubeadm-config.yaml \\\n    --org system:masters \u003e/etc/kubernetes/admin.conf\n# scp -p /etc/kubernetes/admin.conf master2:/etc/kubernetes/\n# scp -p /etc/kubernetes/admin.conf master3:/etc/kubernetes/\n\nThen, on all master nodes, we'll restat kubernetes apiserver:\n\n```\n# crictl ps | grep apiserver\n# crictl stop \u003capiserver-container-id\u003e\n# crictl rm \u003capiserver-container-id\u003e\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffaust64%2Fkube-magic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffaust64%2Fkube-magic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffaust64%2Fkube-magic/lists"}