Thursday, January 31, 2019

Troubleshooting Kubernetes: "zombie pods"

I recently ran into a mysterious problem as I was developing and testing a lab exercise to teach about Kubernetes resiliency. I sort of caused the problem myself, because I had run through several scenarios with the example application, and I wanted to blow it all away and start I just started deleting things. That, my friends, is a sure-fire way to break something. If you are dealing with Deployments and ReplicaSets, merely deleting a pod is just going to cause K8s to try and redeploy it.

I ended up with a handful of pods that were stuck in a state of "Terminating," and they would not die. For days. So, I asked around, tried researching the problem. A google search of "pods stuck in terminating" gave many hits, with many different possible causes and solutions. Some issues mentioned kubelet and a hostname mismatch - that was not it. 

I tried doing a drain, cordon, and shutdown of the node. When I started it back up, the pods were still there, still terminating. 

I tried deleting the Helm release, and helm command would not work. I got errors, which I googled, and that pointed to a problem with Tiller. I tried deleting the helm cache. I tried reinstalling Tiller, but Tiller would not come up. The Tiller pod was stuck in a state of "Pending." 

Some of the issues mentioned that they were trying to deploy pods on the master node, which is something you typically don't do, because the whole point of K8s is to let it schedule pods on the worker nodes. However, if you want a single-node environment, or something like that, you must remove a taint from the master node that prevent pods from being scheduled on it. That sounded a lot like my situation, because I was actually only running a master node...

...and that was the problem. I started with a cluster with two worker nodes, but they were offline. I had not restarted them since encountering the problem. After starting the workers, the status of all the hung pods resolved.

The moral of this story is: sometimes the answer is just too obvious to find through an internet search.