https://github.com/mhausenblas/troubleshooting-k8s-apps

Troubleshooting Kubernetes Applications
https://github.com/mhausenblas/troubleshooting-k8s-apps
chaos-engineering debugging distributed-tracing kubernetes network observability security storage troubleshooting
Last synced: 3 months ago
JSON representation
Troubleshooting Kubernetes Applications
Host: GitHub
URL: https://github.com/mhausenblas/troubleshooting-k8s-apps
Owner: mhausenblas
License: apache-2.0
Created: 2018-09-15T04:00:35.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2020-07-09T12:59:48.000Z (about 5 years ago)
Last Synced: 2025-04-12T18:05:31.221Z (3 months ago)
Topics: chaos-engineering, debugging, distributed-tracing, kubernetes, network, observability, security, storage, troubleshooting
Language: HTML
Homepage: https://troubleshooting.kubernetes.sh
Size: 2 MB
Stars: 35
Watchers: 4
Forks: 13
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # Troubleshooting Kubernetes Applications

_Table of contents:_

Preparation | Intro | Poking pods

--- | --- | ---

Storage | Network | Security

Observability | Vaccination | References

To demonstrate the different issues and failures as well as how to fix them, I've been using the commands and resources as shown below. 

_NOTE_: whenever you see a 📄 icon, it means this is a reference to the official Kubernetes [docs](https://kubernetes.io/docs/).

## Prerequisits

- Kubernetes 1.16 or higher

## Preparation

Before starting, set up:

```

# create the namespace we'll be operating in:

kubectl create ns vnyc

# in different tmux pane keep an eye on the resources:

watch kubectl -n vnyc get all

```

## Intro

Using [00_intro.yaml](00_intro.yaml):

```

kubectl -n vnyc apply -f 00_intro.yaml

kubectl -n vnyc describe deploy/unhappy-camper

THEPOD=$(kubectl -n vnyc get po -l=app=whatever --output=jsonpath={.items[*].metadata.name})

kubectl -n vnyc describe po/$THEPOD

kubectl -n vnyc logs $THEPOD

kubectl -n vnyc exec -it $THEPOD -- sh

kubectl -n vnyc delete deploy/unhappy-camper

```

## Poking pods

### Pod lifecycle

![pod lifecycle](img/pod-lifecycle-inline.png)

_Download [in original resolution](https://github.com/mhausenblas/troubleshooting-k8s-apps/raw/master/img/pod-lifecycle.png)._

### Image issue

Using [01_pp_image.yaml](01_pp_image.yaml):

```

# let's deploy a confused image and look for the error:

kubectl -n vnyc apply -f 01_pp_image.yaml

kubectl -n vnyc get events | grep confused | grep Error

# fix it by specifying the correct image:

kubectl -n vnyc patch deployment confused-imager  \

                --patch '{ "spec" : { "template" : { "spec" : { "containers" : [ { "name" : "something" , "image" : "mhausenblas/simpleservice:0.5.0" } ] } } } }'

kubectl -n vnyc delete deploy/confused-imager

```

Relevant real-world examples on StackOverflow:

- [Kubernetes how to debug CrashLoopBackoff](https://stackoverflow.com/questions/44673957/kubernetes-how-to-debug-crashloopbackoff)

- [Kubernetes imagePullSecrets not working; getting “image not found”](https://stackoverflow.com/questions/32510310/kubernetes-imagepullsecrets-not-working-getting-image-not-found)

- [Trying to create a Kubernetes deployment but it shows 0 pods available](https://stackoverflow.com/questions/51139988/trying-to-create-a-kubernetes-deployment-but-it-shows-0-pods-available)

### Keeps crashing

Using [02_pp_oomer.yaml](02_pp_oomer.yaml) and [02_pp_oomer-fixed.yaml](02_pp_oomer-fixed.yaml):

```

# prepare a greedy fellow that will OOM:

kubectl -n vnyc apply -f 02_pp_oomer.yaml

# wait > 5s and then check mem in container:

kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=oomer --output=jsonpath={.items[*].metadata.name}) -c greedymuch -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes /sys/fs/cgroup/memory/memory.usage_in_bytes

kubectl -n vnyc describe po $(kubectl -n vnyc get po -l=app=oomer --output=jsonpath={.items[*].metadata.name})

# fix the issue:

kubectl -n vnyc apply -f 02_pp_oomer-fixed.yaml

# wait > 20s

kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=oomer --output=jsonpath={.items[*].metadata.name}) -c greedymuch -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes /sys/fs/cgroup/memory/memory.usage_in_bytes

kubectl -n vnyc delete deploy wegotan-oomer

```

Relevant real-world examples on StackOverflow: 

- [InfluxDB container dies over time, and can't restart](https://stackoverflow.com/questions/37877432/influxdb-container-dies-over-time-and-cant-restart)

- [AWS deployment with kubernetes 1.7.2 continuously running in pod getting killed and restarted](https://stackoverflow.com/questions/47849502/aws-deployment-with-kubernetes-1-7-2-continuously-running-in-pod-getting-killed)

### Something's wrong with the app

Using [03_pp_logs.yaml](03_pp_logs.yaml): 

```

kubectl -n vnyc apply -f 03_pp_logs.yaml

# nothing to see here:

kubectl -n vnyc describe deploy/hiccup

# but I see it in the logs:

kubectl -n vnyc logs --follow $(kubectl -n vnyc get po -l=app=hiccup --output=jsonpath={.items[*].metadata.name})

kubectl -n vnyc delete deploy hiccup

```

Relevant real-world examples on StackOverflow: 

- [My kubernetes pods keep crashing with “CrashLoopBackOff” but I can't find any log](https://stackoverflow.com/questions/41604499/my-kubernetes-pods-keep-crashing-with-crashloopbackoff-but-i-cant-find-any-lo)

- [Kubernetes Readiness probe failed error](https://stackoverflow.com/questions/48540929/kubernetes-readiness-probe-failed-error)

- [Kubernetes error: validating data found invalid field env for v1 PodSpec](https://stackoverflow.com/questions/43532990/kubernetes-error-validating-data-found-invalid-field-env-for-v1-podspec)

References:

- [Debugging microservices - Squash vs. Telepresence](https://www.weave.works/blog/debugging-microservices-squash-vs-telepresence)

- [Debugging and Troubleshooting Microservices in Kubernetes with Ray Tsang (Google)](https://www.weave.works/blog/debugging-and-troubleshooting-microservices-in-kubernetes)

- [Troubleshooting Kubernetes Using Logs](https://blog.papertrailapp.com/troubleshoot-kubernetes-using-logs/)

## Storage

Using [04_storage-failedmount.yaml](04_storage-failedmount.yaml) and [04_storage-failedmount-fixed.yaml](04_storage-failedmount-fixed.yaml): 

```

kubectl -n vnyc apply -f 04_storage-failedmount.yaml

# has the data been written?

kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=wheresmyvolume --output=jsonpath={.items[*].metadata.name}) -c writer -- cat /tmp/out/data

# has the data been read in?

kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=wheresmyvolume --output=jsonpath={.items[*].metadata.name}) -c reader -- cat /tmp/in/data

kubectl -n vnyc describe po $(kubectl -n vnyc get po -l=app=wheresmyvolume --output=jsonpath={.items[*].metadata.name})

kubectl -n vnyc apply -f 04_storage-failedmount-fixed.yaml

kubectl -n vnyc delete deploy wheresmyvolume

```

Relevant real-world examples on StackOverflow: 

- [How to find out why mounting an emptyDir volume fails in Kubernetes?](https://stackoverflow.com/questions/51206154/how-to-find-out-why-mounting-an-emptydir-volume-fails-in-kubernetes)

- [Kubernetes NFS volume mount fail with exit status 32](https://stackoverflow.com/questions/34113569/kubernetes-nfs-volume-mount-fail-with-exit-status-32)

References:

- [Debugging Kubernetes PVCs](https://itnext.io/debugging-kubernetes-pvcs-a150f5efbe95)

- Further references see [Stateful Kubernetes](https://stateful.kubernetes.sh/) 

## Network

Using [05_network-wrongsel.yaml](05_network-wrongsel.yaml) and [05_network-wrongsel-fixed.yaml](05_network-wrongsel-fixed.yaml): 

```

kubectl -n vnyc run webserver --image nginx --port 80

kubectl -n vnyc apply -f 05_network-wrongsel.yaml 

kubectl -n vnyc run -it --rm debugpod --restart=Never --image=centos:7 -- curl webserver.vnyc

kubectl -n vnyc run -it --rm debugpod --restart=Never --image=centos:7 -- ping webserver.vnyc

kubectl -n vnyc run -it --rm debugpod --restart=Never --image=centos:7 -- ping $(kubectl -n vnyc get po -l=run=webserver --output=jsonpath={.items[*].status.podIP})

kubectl -n vnyc apply -f 05_network-wrongsel-fixed.yaml 

kubectl -n vnyc delete deploy webserver 

```

Other scenarios often found:

- See an error message that says something like `connection refused`? You could be hitting the `127.0.0.1` issue with the solution to make the app listen on `0.0.0.0` rather than on localhost. Further, see also some discussion [here](https://superuser.com/questions/949428/whats-the-difference-between-127-0-0-1-and-0-0-0-0).

- Missing firewall rules, from cluster-internal open ports to communication between clusters can cause all kinds of issues. It very much depends on the environment (AWS, Azure, GCP, on-premises, etc.) how exactly you go about it and most certainly is an infra admin task rather than an appops task.

- Taking a pod offline for debugging: on the pod, simply remove the relevant label(s) the service uses in its `selector` and that removes the pod from the pool of endpoints the service has to serve traffic to while leaving the pod running, ready for you to `kubectl exec -it` in.

Relevant real-world examples on StackOverflow: 

- [Connection Refused error when connecting to Kubernetes Redis Service](https://stackoverflow.com/questions/48597726/connection-refused-error-when-connecting-to-kubernetes-redis-service/)

- [“kubectl get pods” showing STATUS - ImagePullbackOff](https://stackoverflow.com/questions/51164795/kubectl-get-pods-showing-status-imagepullbackoff)

- [Service not exposing in kubernetes](https://stackoverflow.com/questions/51662015/service-not-exposing-in-kubernetes)

- [Kubernetes: Can not curl minikube pod](https://stackoverflow.com/questions/52289583/kubernetes-can-not-curl-minikube-pod/52289956)

References:

- [Debug Services](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/) 📄

- [Troubleshooting Kubernetes Networking Issues](https://gravitational.com/blog/troubleshooting-kubernetes-networking/)

- Further references see [Container Networking](https://mhausenblas.info/cn-ref/)

## Security

```

kubectl -n vnyc create sa prober

kubectl -n vnyc run -it --rm probepod --serviceaccount=prober --restart=Never --image=centos:7 -- sh

# in the container; will result in an 403, b/c we don't have the permissions necessary:

export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt

APISERVERTOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)

curl -H "Authorization: Bearer $APISERVERTOKEN"  https://kubernetes.default/api/v1/namespaces/vnyc/pods

# different tmux pane, verify if the SA actually is allowed to:

kubectl -n vnyc auth can-i list pods --as=system:serviceaccount:vnyc:prober

# … seems not to be the case, so give sufficient permissions:

kubectl create clusterrole podreader \

        --verb=get --verb=list \

        --resource=pods

kubectl -n vnyc create rolebinding allowpodprobes \

        --clusterrole=podreader \

        --serviceaccount=vnyc:prober \

        --namespace=vnyc

# clean up

kubectl delete clusterrole podreader && kubectl delete ns vnyc

```

Relevant real-world examples on StackOverflow: 

- [Accessing Kubernetes API from pod fails although roles are configured is configured](https://stackoverflow.com/questions/52095161/accessing-kubernetes-api-from-pod-fails-although-roles-are-configured-is-configu)

- [How to deploy a deployment in another namespace in Kubernetes?](https://stackoverflow.com/questions/52297676/how-to-deploy-a-deployment-in-another-namespace-in-kubernetes/52298101#52298101)

- [Unable to read my newly created Kubernetes secret](https://stackoverflow.com/questions/51418711/unable-to-read-my-newly-created-kubernetes-secret/51419033#51419033)

References see [kubernetes-security.info](https://kubernetes-security.info/).

## Observability

From metrics ([Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/)) to logs (EFK/ELK stack) to tracing ([OpenCensus](https://opencensus.io/) and [OpenTracing](http://opentracing.io/)).

### Service ops in practice

Show [Linkerd 2.0 in action](https://medium.com/@mhausenblas/linkerd-2-0-service-ops-for-you-and-me-281cc5bd6424) using [this Katacoda scenario](https://www.katacoda.com/mhausenblas/scenarios/linkerd2) as a starting point.

![Screen shot of Linkerd 2.0 dashboard](img/linkerd2-overview.png)

### Distributed tracing and debugging

Show [Jaeger 1.6 in action](https://www.jaegertracing.io/docs/1.6/getting-started/) using [this Katacoda scenario](https://katacoda.com/opentracing/scenarios/golang-hotrod-demo).

![Screen shot of Jaeger 1.6](img/jaeger-overview.png)

References:

- [Logs and Metrics](https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)

- [Evolution of Monitoring and Prometheus](https://www.slideshare.net/brianbrazil/evolution-of-monitoring-and-prometheus-dublin-2018)

- [The life of a span](https://medium.com/jaegertracing/the-life-of-a-span-ee508410200b)

- [Distributed Tracing with Jaeger & Prometheus on Kubernetes](https://blog.openshift.com/openshift-commons-briefing-82-distributed-tracing-with-jaeger-prometheus-on-kubernetes/)

- Debugging: [KubeSquash](https://github.com/solo-io/kubesquash)

## Vaccination

Show [chaoskube](https://github.com/linki/chaoskube) in action, killing off random pods in the `vnyc` namespace.

We have the following setup:

``` 
 
 
 
 
+----------------+ 
|                | 
| appserver/pod1 +--------+ 
|                | 
+----------------+ 
 
                          |   X 
                          v  X 
 
                          ^  X 
+----------------+        |   X 
|                |        |    X 
| appserver/pod2 +--------+     X 
|                | 
+----------------+ 
 
 
 
 
 
 
```

+----------------+ |                | +-----> | webserver/pod1 | |       |                | |       +----------------+ |       +----------------+ +---------+  |       |                | |      +--+         |  +-----> | webserver/pod2 | |     X             |  |       |                | |    X              |  |       +----------------+ |  |       +----------------+ |  |       |                | X   svc/webserver +--------> | webserver/pod3 | |  |       |                | |  |       +----------------+ |  |       +----------------+ |  |       |                | +--+          |  +-----> | webserver/pod4 | +----------+  |       |                | |       +----------------+ |       +----------------+ |       |                | +-----> | webserver/pod5 | |                | +----------------+

That is, a `webserver` running with five replicas along with a service as well as an `appserver` running with two replicas that queries said service.

```

# let's create our victims, that is webservers and appservers:

kubectl create ns vnyc

kubectl -n vnyc run webserver --image nginx --port 80 --replicas 5

kubectl -n vnyc expose deploy/webserver

kubectl -n vnyc run appserver --image centos:7 --replicas 2 -- sh -c "while true; do curl webserver ; sleep 10 ; done"

kubectl -n vnyc logs deploy/appserver --follow

# also keep on the events generated: 

kubectl -n vnyc get events --watch

# now release the chaos monkey:

chaoskube \

    --interval 30s \

    --namespaces 'vnyc' \

    --no-dry-run

kubectl delete ns vnyc

```

And here's a screen shot of `chaoskube` in action, with all the above commands applied:

![screen shot of chaoskube in action](img/chaoskube-in-action.png)

References:

- [Kubernetes: five steps to well-behaved apps](https://medium.com/@betz.mark/kubernetes-five-steps-to-well-behaved-apps-a7cbeb99471a)

- [Kubernetes Best Practices](https://medium.com/google-cloud/kubernetes-best-practices-8d5cd03446e2)

- [Developing on Kubernetes](https://kubernetes.io/blog/2018/05/01/developing-on-kubernetes/)

- [Kubernetes Application Operator Basics](https://blog.openshift.com/kubernetes-application-operator-basics/) 

- [Using chaoskube with OpenEBS](https://blog.openebs.io/chaos-engineering-on-openebs-7d4e0f995545)

- Tooling:

  - [linki/chaoskube](https://github.com/linki/chaoskube)

  - [asobti/kube-monkey](https://github.com/asobti/kube-monkey)

  - [bloomberg/powerfulseal](https://github.com/bloomberg/powerfulseal)

  - [AlexsJones/k8aos](https://github.com/AlexsJones/k8aos)

  - [jnewland/kubernetes-pod-chaos-monkey](https://github.com/jnewland/kubernetes-pod-chaos-monkey)

## References

### General

- [Troubleshoot Applications](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/) 📄

- [Troubleshoot Clusters](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) 📄

- A site dedicated to [Kubernetes Troubleshooting](https://kubernetes.feisky.xyz/en/troubleshooting/) 

- [Debug a Go Application in Kubernetes from IDE](https://itnext.io/debug-a-go-application-in-kubernetes-from-ide-c45ad26d8785)

- CrashLoopBackoff, Pending, FailedMount and Friends: Debugging Common Kubernetes Cluster (KubeCon NA 2017): [video](https://www.youtube.com/watch?v=7FOCG5kua1w) and [slide deck](https://afontofuseless.info/debugging-kubernetes-app-deploys-kc2017/)

- 10 Most Common Reasons Kubernetes Deployments Fail: [Part 1](https://kukulinski.com/10-most-common-reasons-kubernetes-deployments-fail-part-1/) and [Part 2](https://kukulinski.com/10-most-common-reasons-kubernetes-deployments-fail-part-1/)

### Language or platform specific

- [Debugging Microservices: How Google SREs Resolve Outages](https://www.infoq.com/presentations/google-debug-microservices)

- [Debugging Microservices: Lessons from Google, Facebook, Lyft](https://thenewstack.io/debugging-microservices-lessons-from-google-facebook-lyft/)

- [Troubleshooting Java applications on OpenShift](https://developers.redhat.com/blog/2017/08/16/troubleshooting-java-applications-on-openshift/)

- Google Kubernetes Engine [Troubleshooting](https://cloud.google.com/kubernetes-engine/docs/troubleshooting) docs
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mhausenblas/troubleshooting-k8s-apps

Awesome Lists containing this project

README