Resources

Allocatable resources

Not all of a system's resources are available for Kubernetes Pods.
Kubernetes reserves a certain amount of CPU and memory for the system (kernel & daemons), Kubelet & container runtime and eviction threshold (to be able to react to memory pressure).

Profiles

The resource reservation for Kubelet & container runtime depends on the Machine's cloud flavor and the configured "profile".
MachineDeployments that don't set a profile explicitly, use the default profile.

To address node stability issues the default profile will change from metakube-legacy to metakube-2025-01 on 01st January 2025.
The new profile also leaves more CPU allocatable on most node sizes.

Profile CPU formula Memory formula
metakube-2025-01, metakube-latest 20m + MaxPods * 2m/Pod 190MiB + MaxPods * 6.5MiB/Pod
metakube-legacy (deprecated) 200m 300MiB

Additionally, MetaKube reserves 200m CPU and 500MiB of memory for the system regardless of flavor.

The eviction threshold is 100MiB.

Flavors

MetaKube scales the Pod limit by the available memory.
That in turn determines the amount of reserved resources.

Available memory Example Flavors Pod limit Reserved CPU Reserved Memory
8GiB m2.small, l1c.medium, SCS-2V-8-50n 50 120m + 200m 515MiB + 500Mi + 100Mi
16GiB m2.medium, l1r.small, m2c.large, SCS-4V-16-* 70 160m + 200m 645MiB + 500Mi + 100Mi
32GiB m2.large, l1r.medium, m2c.xlarge, SCS-8V-32-* 90 200m + 200m 775MiB + 500Mi + 100Mi
> 32GiB l1r.large 110 240m + 200m 905MiB + 500Mi + 100Mi

Tuning

The formulas used for the metakube-latest profile are derived from experiments.
We intend to strike a balance of stability and practicality at the same time.

The experiments were conducted with the following assumptions:

  • Container to Pod ratio of 1.25

    This accounts for typical use of sidecar containers.
    At higher ratios, the container runtime may require additional memory.

  • Full nodes

    The reserved resources should be sufficient when packing the node with as many Pods as the limit allows.

  • Minimal Pod churn

    Kubelet and the container runtime are particularly busy (CPU) during Pod create/delete events.
    Since they are "idle" for most of the time and may also use spare CPU time, we decided to not reserve more CPU than necessary.
    Frequent Pod churn may stretch the time Pods take to become running or be fully deleted.

These assumptions hold true for most use cases because usually not all these thresholds are crossed at once.

If your use case differs drastically from these assumptions you may need to adjust the reserved resources to ensure your nodes are stable.

Change MaxPods

You may achieve a more economic node utilization by setting higher or lower Pod limits.

A higher Pod limit means you can pack more Pods on a single node.

Lower Pod limits means less resources are reserved leaving them allocatable for Pods.

To change the Pod limit set the following Machine annotation in your MachineDeployment:

kind: MachineDeployment
spec:
  template:
    metadata:
      annotations:
        kubelet-config.machines.metakube.syseleven.de/MaxPods: "30"

Change profiles

You should only change the profile to opt in/out when migrating from legacy profiles.

To configure a different profile set the following Machine annotation in your MachineDeployment:

kind: MachineDeployment
spec:
  template:
    metadata:
      annotations:
        kubelet-config.machines.metakube.syseleven.de/KubeReservedProfile: "metakube-latest"

Change reserved resources directly

Do not change these settings unless absolutely necessary!
Reserving too little resources may lead to node instability.

To change the reservation for individual resources set the following Machine annotation in your MachineDeployment:

kind: MachineDeployment
spec:
  template:
    metadata:
      annotations:
        kubelet-config.machines.metakube.syseleven.de/KubeReserved: "cpu=500m,memory=1Gi"

Cgroups

System resources, such as CPU and memory, are distributed by means of a Linux Kernel feature called "cgroups".

A Pod requests resources through it's spec.containers[*].resources fields.

MetaKube nodes use systemd as the cgroup manager. They are organized in a hierarchy through "slices".

Depending on the Pod's determined Quality of Service, they're organized directly under the kubepods.slice or one level lower under the kubepods-burstable.slice or kubepods-besteffort.slice:

$ systemctl status
...
├─init.scope 
├─user.slice 
│ └─user-<uid>.slice
├─system.slice 
│ ├─containerd.service
│ ├─kubelet.service 
│ └─<other system services>.service
└─kubepods.slice 
  ├─kubepods-pod<id>.slice 
  │ ├─cri-containerd-<container id>.scope
  │ │ └─<pid> <command>
  │ └─cri-containerd-<container id>.scope
  │   └─<pid> /pause
  ├─kubepods-burstable.slice 
  │ └─kubepods-burstable-pod<pod id>.slice 
  └─kubepods-besteffort.slice 
    └─kubepods-besteffort-pod<pod id>.slice 

Container resources

CPU

CPU time allocation is implemented by the Linux Kernel's completely fair scheduler (CFS).

A container's resources.requests.cpu is regarded during scheduling to ensure there are enough CPUs available on the node.
It also corresponds to the container's cgroup's cpu.weight value.

The weight determines the place of a processes' thread in the CFS's weighted queue and thus how often it's scheduled to be executed on a CPU core.

By dividing up the available resources proportionately, Kubelet ensures that a container process will at least get the configured requested CPU time (from resources.requests.cpu).
Any spare CPU (either not requested or when processes yield) is free to use by waiting processes.
It is again proportionately distributed based on the processes' cgroup's CPU weights and up to their cpu.max value (from resources.limits.cpu).

Kubernetes doesn't prevent you from over-committing a node (limits > available resources).
Containers with low CPU requests may be throttled and may not be able to answer requests including liveness or readiness probes.

Memory

A container's resources.requests.memory is only regarded during scheduling to ensure the node has enough space available.

A container's resources.limits.memory correspond to the cgroup's memory.max value.
Once reached, the process will be "OOM" killed.

Kubernetes doesn't prevent you from over-committing a node (limits > available resources).
If any higher level slice's or the entire system's memory is exhausted, the system will choose and kill a process in that slice or any respectively.

References