Newer containerd versions with the newer runc version 1.0.0-rc93
introduced a serious bug that can lead to broken nodes.
The issue is mitigated on MetaKube nodes created after 04/26/2021.
Kubernetes relies on Kubelet (that runs on each node) to 1. report a node's status and 2. sync pod status (creation, deletion, etc.).
To create or delete pods, Kubelet needs to communicate with the container runtime that's installed on the node.
The Kubernetes API shows a pod's status, but Kubelet is the component that is actually responsible for the lifecycle of everything that makes up a pod.
Runc 1.0.0-rc93
(and in turn containerd) may hang indefinitely while checking a container's status.
This causes Kubelet's PLEG component to become unhealthy and its cache to become stale.
The problem may manifest in multiple ways:
NotReady
Terminating
phaseIMPORTANT:
Even if your cluster isn't experiencing any issues at the moment, you may still be affected.
To check if your nodes have the faulty version installed, on each node run:
$ docker version
Client: Docker Engine - Community
Version: 20.10.6
API version: 1.41
Go version: go1.13.15
Git commit: 370c289
Built: Fri Apr 9 22:47:17 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.6
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 8728dd2
Built: Fri Apr 9 22:45:28 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.4
GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
runc:
Version: 1.0.0-rc93
GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
docker-init:
Version: 0.19.0
GitCommit: de40ad0
If the version of runc
is 1.0.0-rc93
the node should be replaced.
On Nodes created after 04/26/2021 MetaKube installs a lower runc version 1.0.0-rc92
which doesn't contain the bug.
So it is sufficient to replace all nodes in your cluster with new ones.
To replace all nodes:
Change the node deployment (e.g. updated image)
Note:
If you want to keep everything as is, you can set an annotation to the MachineDeployment
as follows:
$ kubectl -n kube-system patch machinedeployments.cluster.k8s.io ${MACHINE_DEPLOYMENT_NAME} -p '{"spec": {"template": {"metadata": {"annotations": {"kubectl.kubernetes.io/restartedAt": "'$(date +"%Y-%m-%dT%T.%3N%z")'"}}}}}' --type merge
This will create a new, otherwise identical, MachineSet
.
MetaKube will start creating new nodes
Old nodes will be drained and deleted once new ones become Ready
Note (for big clusters):
You may want to change the value of MaxSurge
in the node deployment to a value >1
.
This will reduce node rollover time as multiple new nodes get created at once.
This does require enough Quota for CPU, RAM and possibly Floating IPs.
If Kubelet on a node is already unresponsive, it won't be able to remove pods marked for deletion.
In this case, the pods need to be deleted with --force
, e.g.:
$ kubectl delete pods --field-selector=spec.nodeName=<node name> --force
If you have PodDisruptionBudgets
that require one replica, but you only have one matching pod, the node with that pod won't be able to drain.
You'll have to either manually delete the pod, or increase the replica count.
In any case, this is considered bad practice and we strongly advise against using PDBs in this way.