Node NotReady because of unresponsive Kubelet

Problem

A Node status NotReady may leave Pods in Terminating state indefinitely.

Background

See Nodes.

Symptoms

The problem can usually be observed as follows:

  1. Node enters NotReady state.
  2. Node controller terminates all Pods running on that Node.
  3. Pods don't get deleted and stay in the Terminating status.

Causes

Among the possible causes are:

  • High rate of Pod creation
  • Many sidecar containers

Mitigation

Pods that are stuck in Terminating won't be deleted automatically, if the Kubelet isn't working.
To forcefully remove all Pods of a Node, run:

kubectl delete pods --all-namespaces --field-selector spec.nodeName="${NAME_NAME}"

This will remove the Pods from the Kubernetes storage (etcd).
Containers belonging to these Pods will remain running on the broken Node.
Since the underlying machine will be deleted, this most likely won't matter though.

Prevention & Best Practices

Make sure to adhere to the following best practices:

  • Update and replace Nodes frequently (treat Nodes as cattle)
  • Don't overprovision your Nodes
  • Define resource limits on your Pods