https://github.com/philips/kubernetes-day-2

These are notes to accompany my KubeCon EU 2017 talk. The slides are available as well.
https://github.com/philips/kubernetes-day-2

backup cluster devops kubernetes-cluster monitoring operations prometheus

Last synced: about 1 year ago
JSON representation

These are notes to accompany my KubeCon EU 2017 talk. The slides are available as well.

Host: GitHub
URL: https://github.com/philips/kubernetes-day-2
Owner: philips
Created: 2017-03-29T20:44:50.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2017-04-12T23:26:35.000Z (about 9 years ago)
Last Synced: 2025-03-27T17:26:49.031Z (about 1 year ago)
Topics: backup, cluster, devops, kubernetes-cluster, monitoring, operations, prometheus
Homepage:
Size: 5.86 KB
Stars: 19
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Kubernetes Day 2

These are notes to accompany my [KubeCon EU 2017 talk](https://cloudnativeeu2017.sched.com/event/9Tcw/kubernetes-day-2-cluster-operations-i-brandon-philips-coreos). The slides [are available as well](https://docs.google.com/presentation/d/1LpiWAGbK77Ha8mOxyw01VlZ18zdMSNimhsrb35sUFv8/edit?usp=sharing). The video is [available from Youtube](https://www.youtube.com/watch?v=U1zR0eDQRYQ).

How do you keep a Kubernetes cluster running long term? Just like any other service, you need a combination of monitoring, alerting, backup, upgrade, and infrastructure management strategies to make it happen. This talk will walk through and demonstrate the best practices for each of these questions and show off the latest tooling that makes it possible. The takeaway will be lessons and considerations that will influence the way you operate your own Kubernetes clusters.

## WARNING

These are notes for a conference talk. Much of this may become out of date very quickly. My goal is to turn much of this into docs overtime.

## Cluster Setup

All of the demos in this talk were done with a self-hosted cluster deployed with the [Tectonic Installer](https://github.com/coreos/tectonic-installer#tectonic-installer) on AWS.

This cluster was also deployed using the self-hosted etcd option which at the time of this writing [isn't merged into the Tectonic Installer](https://github.com/coreos/tectonic-installer/pull/135) quite yet.

## Failing a Scheduler

Scale it down to remove all schedulers

```
kubectl scale -n kube-system deployment kube-scheduler --replicas=0
```

OH NO, scale it back up

```
kubectl scale -n kube-system deployment kube-scheduler --replicas=1
```

Unfortunately, it is too late. Everything is ruined?!?!

```
kubectl get pods -l k8s-app=kube-scheduler -n kube-system
NAME READY STATUS RESTARTS AGE
kube-scheduler-3027616201-53jfh 0/1 Pending 0 52s
```

Get the current kubernetes deployment

```
kubectl get -n kube-system deployment -o yaml kube-scheduler > sched.yaml
```

Pick a node name from this list at random

```
kubectl get nodes -l master=true
```

Edit the sched.yaml to use just the pod spec and set the metadata.nodename field to one to the selected node above. Something like this:

```
kind: Pod
metadata:
labels:
k8s-app: kube-scheduler
name: kube-scheduler
namespace: kube-system
spec:
nodeName: ip-10-0-37-115.us-west-2.compute.internal
containers:
- command:
- ./hyperkube
- scheduler
- --leader-elect=true
image: quay.io/coreos/hyperkube:v1.5.5_coreos.0
imagePullPolicy: IfNotPresent
name: kube-scheduler
resources: {}
terminationMessagePath: /dev/termination-log
dnsPolicy: ClusterFirst
nodeSelector:
master: "true"
restartPolicy: Always
securityContext: {}
terminationGracePeriodSeconds: 30
```

At this point the deployment scheduler should be ready and can take over

```
kubectl get pods -l k8s-app=kube-scheduler -n kube-system
```

Delete the temporary pod

```
kubectl delete pod -n kube-system kube-scheduler
```

## Downgrade/Upgrade Scheduler

Edit the scheduler and downgrade a patch release.

```
kubectl edit -n kube-system deployment kube-scheduler
```

Now edit the scheduler and upgrade a patch release.

```
kubectl edit -n kube-system deployment kube-scheduler
```

Boom!

## kubectl drain and corden

```
$ kubectl get nodes
NAME STATUS AGE
ip-10-0-13-248.us-west-2.compute.internal Ready 19h
```

To make a node unschedulable and remove all pods run the following

```
kubectl drain ip-10-0-13-248.us-west-2.compute.internal
```

## kubectl cordon and uncordon

To ensure a node doesn't get additional workloads you can cordon/uncordon a node. This is very useful to investigate an issue and to ensure a node doesn't change while debugging.

```
$ kubectl cordon ip-10-0-84-104.us-west-2.compute.internal
node "ip-10-0-84-104.us-west-2.compute.internal" cordoned
```

To undo run uncordon

```
$ kubectl uncordon ip-10-0-84-104.us-west-2.compute.internal
node "ip-10-0-84-104.us-west-2.compute.internal" uncordoned
```

## Monitoring

Using [contrib/kube-prometheus](https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus) deployed in the self-hosted configuration.

Proxy to run queries against prometheus

```
while true; do kubectl port-forward -n monitoring prometheus-k8s-0 9090; don
```

NOTE: a few bugs were [found and filed](https://github.com/coreos/prometheus-operator/issues/created_by/philips) against this configuration

## Configure etcd backup

Note: S3 backup isn't working in the etcd Operator on self-hosted yet; hunting this down.

Setup AWS upload creds:

```
kubectl create secret generic aws-credential --from-file=$HOME/.aws/credentials -n kube-system
kubectl create configmap aws-config --from-file=$HOME/.aws/config-us-west-1 -n kube-system
```

```
kubectl edit deployment etcd-operator -n kube-system
```

```
- command:
- /usr/local/bin/etcd-operator
- --backup-aws-secret
- aws-credential
- --backup-aws-config
- aws-config
- --backup-s3-bucket
- tectonic-eo-etcd-backups
```

```
kubectl get cluster.etcd -n kube-system kube-etcd -o yaml > etcd
kubectl replace -f etcd -n kube-system
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/philips/kubernetes-day-2

Awesome Lists containing this project

README