Troubleshooting

General inspection

Machine lifecycle

To get an immediate view of the state of MachineDeployments and Nodes, run:

kubectl -n kube-system get machinedeployment,machineset,machine,no -o wide
NAME                                      AGE   DELETED   REPLICAS   AVAILABLEREPLICAS   PROVIDER    OS       VERSION
machinedeployment.cluster.k8s.io/worker   59d             2          2                   openstack   ubuntu   1.30.1

NAME                                          AGE   DELETED   REPLICAS   AVAILABLEREPLICAS   MACHINEDEPLOYMENT   PROVIDER    OS       VERSION
machineset.cluster.k8s.io/worker-677cf94d4d   8d              0                              worker              openstack   ubuntu   1.30.1
machineset.cluster.k8s.io/worker-77c8c559d6   8d              2          2                   worker              openstack   ubuntu   1.30.1

NAME                                             AGE   DELETED   MACHINESET          ADDRESS        NODE                      PROVIDER    OS       VERSION
machine.cluster.k8s.io/worker-77c8c559d6-mknks   8d              worker-77c8c559d6   192.168.1.20   worker-77c8c559d6-mknks   openstack   ubuntu   1.30.1
machine.cluster.k8s.io/worker-77c8c559d6-rfd8r   8d              worker-77c8c559d6   192.168.1.9    worker-77c8c559d6-rfd8r   openstack   ubuntu   1.30.1

NAME                           STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP       OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
node/worker-77c8c559d6-mknks   Ready    <none>   8d    v1.30.1   192.168.1.20                     Ubuntu 22.04.4 LTS   5.15.0-107-generic   containerd://1.6.33
node/worker-77c8c559d6-rfd8r   Ready    <none>   8d    v1.30.1   192.168.1.9                      Ubuntu 22.04.4 LTS   5.15.0-107-generic   containerd://1.6.33

Note:

  • When the MachineDeployment has converged, all but one MachineSet have 0 replicas
  • All Machines of a given MachineSet share the same name prefix
  • Node names are the same as Machine names
  • When the Machines have been fully provisioned, there's a Node object for each Machine

Machine Events

To inspect a specific Machine, describe it or check its Events:

kubectl -n kube-system events --for machine/worker-77c8c559d6-wvlcg                                                                                                                                                            08/09 17:52
LAST SEEN               TYPE     REASON                     OBJECT                            MESSAGE
9m58s                   Normal   Created                    Machine/worker-77c8c559d6-wvlcg   Successfully created instance
7m26s (x5 over 9m53s)   Normal   InstanceFound              Machine/worker-77c8c559d6-wvlcg   Found instance at cloud provider, status: running
6m57s (x2 over 6m59s)   Normal   LabelsAnnotationsUpdated   Machine/worker-77c8c559d6-wvlcg   Successfully updated labels/annotations

Get Node conditions

To see the status of different Node conditions, run:

kubectl describe node $node

List Pods running on Node

To list all Pods scheduled on a particular Node, run:

kubectl get pods --all-namespaces --field-selector spec.nodeName=$node

Delete Pods running on Node

In case the Node isn't responsive, you may choose to force the immediate deletion of all Pods on a Node:

When using the --force flag, Kubernetes does wait until the Pods and their containers are terminated.
The applications may continue to run!
Under normal circumstances you should rely on Kubelet to gracefully tear down the Pods.

kubectl delete pods --all-namespaces --field-selector spec.nodeName=$node --force

Get remote access using SSH

Prerequisites

  • SSH key agent enabled in the cluster
  • SSH public key added to the cluster
  • SSH client is configured to use the SSH key
  • Either of:
    • Floating IP enabled for MachineDeployment
    • Other host available as SSH jump host
  • Port 22 accessible (default, see node networking)

Adding an SSH key after a Machine is provisioned is only possible if Kubelet on the Node is running and healthy.

  1. Get public IP of Node

    IP=$(kubectl get node $node -o jsonpath='{.status.addresses[?(@.type == "ExternalIP")].address}')
    echo $IP
  2. Establish SSH session

    ssh ubuntu@$IP

Remote access by escaping from a privileged Pod

Do at your own risk
The Pod has full access to the Node!
Make sure to verify the integrity of the tooling and container image that is used!

Prerequisites

  • Kubelet running and healthy
  • Cluster network functioning (for forwarding stdin & stdout)

By creating a Pod with a privileged container that's sharing the PID namespace of the host, you can switch to the Kernel namespaces to the init process.
To do this, you can use a tool like node-shell.

Inspect Kubelet logs

Prerequisites

To tail the logs of Kubelet, run:

journalctl -exu kubelet -f

Inspect Node bootstrapping with OpenStack

It's possible to get some output of the initialization process of a Node.
These logs may contain valuable information on how far the initialization has progressed or surface potentially errors (e.g. DNS).

To show the logs of a Node using the OpenStack CLI:

openstack console log show $node

Node Provisioning

The most important three phases during Node provisioning are:

  1. Server creation

    To verify this:

  2. Node initialization

    Issues during this phase can be investigated by inspecting Kubelet logs or the OpenStack console output.

  3. Initialize Node daemons

    By this point the Node has already been registered with the Kubernetes cluster.

    Check:

    1. Node conditions
    2. Pods running on Node
    3. Get their logs

      To get the logs of e.g. the Canal Pod running on the particular Node, run:

      kubectl -n kube-system logs -l k8s-app=canal --field-selector spec.nodeName="$node"

Node deletion

The following steps help to determine why a Node isn't being deleted:

  1. Check if the corresponding Machine has a deletion timestamp (DELETED column):

    kubectl -n kube-system get machine worker-77c8c559d6-rfd8r
    NAME                      AGE   DELETED   MACHINESET          ADDRESS       NODE                      PROVIDER    OS       VERSION                                                                             
    worker-77c8c559d6-rfd8r   8d    91s       worker-77c8c559d6   192.168.1.9   worker-77c8c559d6-rfd8r   openstack   ubuntu   1.30.1
  2. Check if the Node cannot be drained

    MetaKube will drain the Node, meaning it's evicting the Pods running on the Node (besides exceptions like DaemonSet Pods).
    The eviction API attempts to gracefully delete Pods.
    A Pod may be forbidden to be evicted, e.g. if a matching PodDisruptionBudget doesn't allow any more disruptions.

    You can safely try draining the Node yourself:

    kubectl drain --ignore-daemonsets --delete-emptydir-data $node

    It may tell you that some Pods are not safe to evict and why.

    To get a list of Pods running on a Node, see above:

    Another reason why a Node cannot be drained is because the Pods don't leave the Terminating state.
    This may be because of an unresponsive Kubelet.

  3. Server can't be deleted

    If the Node is fully drained, but it still remains in the cluster, there may be issues with deleting the cloud server.
    In that case, check the Machine Events for errors.

Autoscaling

For issues related to autoscaling see here.