Unresponsive Kubelet with runc 1.0.0-rc93

Newer containerd versions with the newer runc version 1.0.0-rc93 introduced a serious bug that can lead to broken nodes.
The issue is mitigated on MetaKube nodes created after 04/26/2021.

Background

Kubernetes relies on Kubelet (that runs on each node) to 1. report a node's status and 2. sync pod status (creation, deletion, etc.).
To create or delete pods, Kubelet needs to communicate with the container runtime that's installed on the node.
The Kubernetes API shows a pod's status, but Kubelet is the component that is actually responsible for the lifecycle of everything that makes up a pod.

Runc 1.0.0-rc93 (and in turn containerd) may hang indefinitely while checking a container's status.
This causes Kubelet's PLEG component to become unhealthy and its cache to become stale.

Symptoms

The problem may manifest in multiple ways:

  1. Kubernetes node(s) are marked as NotReady
  2. Pods are stuck in the Terminating phase
  3. New pods can't be created

IMPORTANT:
Even if your cluster isn't experiencing any issues at the moment, you may still be affected.


Diagnosis

To check if your nodes have the faulty version installed, on each node run:

$ docker version
Client: Docker Engine - Community
 Version:           20.10.6
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        370c289
 Built:             Fri Apr  9 22:47:17 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.6
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8728dd2
  Built:            Fri Apr  9 22:45:28 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

If the version of runc is 1.0.0-rc93 the node should be replaced.

Mitigation

On Nodes created after 04/26/2021 MetaKube installs a lower runc version 1.0.0-rc92 which doesn't contain the bug.
So it is sufficient to replace all nodes in your cluster with new ones.

To replace all nodes:

  1. Change the node deployment (e.g. updated image)


    Note:

    If you want to keep everything as is, you can set an annotation to the MachineDeployment as follows:

    $ kubectl -n kube-system patch machinedeployments.cluster.k8s.io ${MACHINE_DEPLOYMENT_NAME} -p '{"spec": {"template": {"metadata": {"annotations": {"kubectl.kubernetes.io/restartedAt": "'$(date +"%Y-%m-%dT%T.%3N%z")'"}}}}}' --type merge

    This will create a new, otherwise identical, MachineSet.


  2. MetaKube will start creating new nodes

  3. Old nodes will be drained and deleted once new ones become Ready


Note (for big clusters):

You may want to change the value of MaxSurge in the node deployment to a value >1.
This will reduce node rollover time as multiple new nodes get created at once.
This does require enough Quota for CPU, RAM and possibly Floating IPs.


Possible issues

  1. If Kubelet on a node is already unresponsive, it won't be able to remove pods marked for deletion.
    In this case, the pods need to be deleted with --force, e.g.:

    $ kubectl delete pods --field-selector=spec.nodeName=<node name> --force
  2. If you have PodDisruptionBudgets that require one replica, but you only have one matching pod, the node with that pod won't be able to drain.
    You'll have to either manually delete the pod, or increase the replica count.
    In any case, this is considered bad practice and we strongly advise against using PDBs in this way.

References