{"id":22475480,"url":"https://github.com/philips/kubernetes-day-2","last_synced_at":"2025-04-14T03:35:34.566Z","repository":{"id":66550498,"uuid":"86627755","full_name":"philips/kubernetes-day-2","owner":"philips","description":"These are notes to accompany my KubeCon EU 2017 talk. The slides are available as well.","archived":false,"fork":false,"pushed_at":"2017-04-12T23:26:35.000Z","size":6,"stargazers_count":19,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-27T17:26:49.031Z","etag":null,"topics":["backup","cluster","devops","kubernetes-cluster","monitoring","operations","prometheus"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/philips.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-29T20:44:50.000Z","updated_at":"2019-10-12T04:39:43.000Z","dependencies_parsed_at":"2023-02-28T13:16:16.701Z","dependency_job_id":null,"html_url":"https://github.com/philips/kubernetes-day-2","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philips%2Fkubernetes-day-2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philips%2Fkubernetes-day-2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philips%2Fkubernetes-day-2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philips%2Fkubernetes-day-2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/philips","download_url":"https://codeload.github.com/philips/kubernetes-day-2/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248817272,"owners_count":21166213,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backup","cluster","devops","kubernetes-cluster","monitoring","operations","prometheus"],"created_at":"2024-12-06T13:17:32.160Z","updated_at":"2025-04-14T03:35:34.533Z","avatar_url":"https://github.com/philips.png","language":null,"readme":"# Kubernetes Day 2\n\nThese are notes to accompany my [KubeCon EU 2017 talk](https://cloudnativeeu2017.sched.com/event/9Tcw/kubernetes-day-2-cluster-operations-i-brandon-philips-coreos). The slides [are available as well](https://docs.google.com/presentation/d/1LpiWAGbK77Ha8mOxyw01VlZ18zdMSNimhsrb35sUFv8/edit?usp=sharing). The video is [available from Youtube](https://www.youtube.com/watch?v=U1zR0eDQRYQ).\n\nHow do you keep a Kubernetes cluster running long term? Just like any other service, you need a combination of monitoring, alerting, backup, upgrade, and infrastructure management strategies to make it happen. This talk will walk through and demonstrate the best practices for each of these questions and show off the latest tooling that makes it possible. The takeaway will be lessons and considerations that will influence the way you operate your own Kubernetes clusters.\n\n## WARNING\n\nThese are notes for a conference talk. Much of this may become out of date very quickly. My goal is to turn much of this into docs overtime.\n\n## Cluster Setup\n\nAll of the demos in this talk were done with a self-hosted cluster deployed with the [Tectonic Installer](https://github.com/coreos/tectonic-installer#tectonic-installer) on AWS.\n\nThis cluster was also deployed using the self-hosted etcd option which at the time of this writing [isn't merged into the Tectonic Installer](https://github.com/coreos/tectonic-installer/pull/135) quite yet.\n\n## Failing a Scheduler\n\nScale it down to remove all schedulers\n\n```\nkubectl scale -n kube-system deployment kube-scheduler --replicas=0\n```\n\nOH NO, scale it back up\n\n```\nkubectl scale -n kube-system deployment kube-scheduler --replicas=1\n```\n\nUnfortunately, it is too late. Everything is ruined?!?!\n\n```\nkubectl get pods -l k8s-app=kube-scheduler -n kube-system\nNAME                              READY     STATUS    RESTARTS   AGE\nkube-scheduler-3027616201-53jfh   0/1       Pending   0          52s\n```\n\nGet the current kubernetes deployment\n\n```\nkubectl get -n kube-system deployment -o yaml kube-scheduler \u003e sched.yaml\n```\n\nPick a node name from this list at random\n\n```\nkubectl get nodes -l master=true\n```\n\nEdit the sched.yaml to use just the pod spec and set the metadata.nodename field to one to the selected node above. Something like this:\n\n```\nkind: Pod\nmetadata:\n  labels:\n    k8s-app: kube-scheduler\n  name: kube-scheduler\n  namespace: kube-system\nspec:\n  nodeName: ip-10-0-37-115.us-west-2.compute.internal\n  containers:\n  - command:\n    - ./hyperkube\n    - scheduler\n    - --leader-elect=true\n    image: quay.io/coreos/hyperkube:v1.5.5_coreos.0\n    imagePullPolicy: IfNotPresent\n    name: kube-scheduler\n    resources: {}\n    terminationMessagePath: /dev/termination-log\n  dnsPolicy: ClusterFirst\n  nodeSelector:\n    master: \"true\"\n  restartPolicy: Always\n  securityContext: {}\n  terminationGracePeriodSeconds: 30\n```\n\nAt this point the deployment scheduler should be ready and can take over\n\n```\nkubectl get pods -l k8s-app=kube-scheduler -n kube-system\n```\n\nDelete the temporary pod\n\n```\nkubectl delete pod -n kube-system kube-scheduler\u003cPaste\u003e\n```\n\n## Downgrade/Upgrade Scheduler\n\nEdit the scheduler and downgrade a patch release.\n\n```\nkubectl edit -n kube-system deployment kube-scheduler\n```\n\nNow edit the scheduler and upgrade a patch release.\n\n```\nkubectl edit -n kube-system deployment kube-scheduler\n```\n\nBoom!\n\n## kubectl drain and corden\n\n```\n$ kubectl get nodes\nNAME                                        STATUS    AGE\nip-10-0-13-248.us-west-2.compute.internal   Ready     19h\n```\n\nTo make a node unschedulable and remove all pods run the following\n\n```\nkubectl drain  ip-10-0-13-248.us-west-2.compute.internal \n```\n\n## kubectl cordon and uncordon\n\nTo ensure a node doesn't get additional workloads you can cordon/uncordon a node. This is very useful to investigate an issue and to ensure a node doesn't change while debugging.\n\n```\n$ kubectl cordon ip-10-0-84-104.us-west-2.compute.internal\nnode \"ip-10-0-84-104.us-west-2.compute.internal\" cordoned\n```\n\nTo undo run uncordon\n\n```\n$ kubectl uncordon ip-10-0-84-104.us-west-2.compute.internal\nnode \"ip-10-0-84-104.us-west-2.compute.internal\" uncordoned\n```\n\n## Monitoring\n\nUsing [contrib/kube-prometheus](https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus) deployed in the self-hosted configuration.\n\nProxy to run queries against prometheus\n\n```\nwhile true; do kubectl port-forward -n monitoring prometheus-k8s-0 9090; don\n```\n\nNOTE: a few bugs were [found and filed](https://github.com/coreos/prometheus-operator/issues/created_by/philips) against this configuration\n\n## Configure etcd backup\n\nNote: S3 backup isn't working in the etcd Operator on self-hosted yet; hunting this down.\n\nSetup AWS upload creds:\n\n```\nkubectl create secret generic aws-credential --from-file=$HOME/.aws/credentials -n kube-system\nkubectl create configmap aws-config --from-file=$HOME/.aws/config-us-west-1 -n kube-system\n```\n\n```\nkubectl edit deployment etcd-operator -n kube-system\n```\n\n\n```\n      - command:\n        - /usr/local/bin/etcd-operator\n        - --backup-aws-secret\n        - aws-credential\n        - --backup-aws-config\n        - aws-config\n        - --backup-s3-bucket\n        - tectonic-eo-etcd-backups\n```\n\n```\nkubectl get cluster.etcd -n kube-system kube-etcd -o yaml \u003e etcd\nkubectl replace -f etcd  -n kube-system\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilips%2Fkubernetes-day-2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphilips%2Fkubernetes-day-2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilips%2Fkubernetes-day-2/lists"}