Not all of a system's resources are available for Kubernetes Pods.
Kubernetes reserves a certain amount of CPU and memory for the system (kernel & daemons), Kubelet & container runtime and eviction threshold (to be able to react to memory pressure).
The resource reservation for Kubelet & container runtime depends on the Machine's cloud flavor and the configured "profile".
MachineDeployments that don't set a profile explicitly, use the default profile.
To address node stability issues the default profile will change from metakube-legacy
to metakube-2025-01
on 01st January 2025.
The new profile also leaves more CPU allocatable on most node sizes.
Profile | CPU formula | Memory formula |
---|---|---|
metakube-2025-01 , metakube-latest |
20m + MaxPods * 2m/Pod |
190MiB + MaxPods * 6.5MiB/Pod |
metakube-legacy (deprecated) |
200m |
300MiB |
Additionally, MetaKube reserves 200m
CPU and 500MiB
of memory for the system regardless of flavor.
The eviction threshold is 100MiB
.
MetaKube scales the Pod limit by the available memory.
That in turn determines the amount of reserved resources.
Available memory | Example Flavors | Pod limit | Reserved CPU | Reserved Memory |
---|---|---|---|---|
8GiB | m2.small , l1c.medium , SCS-2V-8-50n |
50 | 120m + 200m | 515MiB + 500Mi + 100Mi |
16GiB | m2.medium , l1r.small , m2c.large , SCS-4V-16-* |
70 | 160m + 200m | 645MiB + 500Mi + 100Mi |
32GiB | m2.large , l1r.medium , m2c.xlarge , SCS-8V-32-* |
90 | 200m + 200m | 775MiB + 500Mi + 100Mi |
> 32GiB | l1r.large |
110 | 240m + 200m | 905MiB + 500Mi + 100Mi |
The formulas used for the metakube-latest
profile are derived from experiments.
We intend to strike a balance of stability and practicality at the same time.
The experiments were conducted with the following assumptions:
Container to Pod ratio of 1.25
This accounts for typical use of sidecar containers.
At higher ratios, the container runtime may require additional memory.
Full nodes
The reserved resources should be sufficient when packing the node with as many Pods as the limit allows.
Minimal Pod churn
Kubelet and the container runtime are particularly busy (CPU) during Pod create/delete events.
Since they are "idle" for most of the time and may also use spare CPU time, we decided to not reserve more CPU than necessary.
Frequent Pod churn may stretch the time Pods take to become running or be fully deleted.
These assumptions hold true for most use cases because usually not all these thresholds are crossed at once.
If your use case differs drastically from these assumptions you may need to adjust the reserved resources to ensure your nodes are stable.
You may achieve a more economic node utilization by setting higher or lower Pod limits.
A higher Pod limit means you can pack more Pods on a single node.
Lower Pod limits means less resources are reserved leaving them allocatable for Pods.
To change the Pod limit set the following Machine annotation in your MachineDeployment:
kind: MachineDeployment
spec:
template:
metadata:
annotations:
kubelet-config.machines.metakube.syseleven.de/MaxPods: "30"
You should only change the profile to opt in/out when migrating from legacy profiles.
To configure a different profile set the following Machine annotation in your MachineDeployment:
kind: MachineDeployment
spec:
template:
metadata:
annotations:
kubelet-config.machines.metakube.syseleven.de/KubeReservedProfile: "metakube-latest"
Do not change these settings unless absolutely necessary!
Reserving too little resources may lead to node instability.
To change the reservation for individual resources set the following Machine annotation in your MachineDeployment:
kind: MachineDeployment
spec:
template:
metadata:
annotations:
kubelet-config.machines.metakube.syseleven.de/KubeReserved: "cpu=500m,memory=1Gi"
System resources, such as CPU and memory, are distributed by means of a Linux Kernel feature called "cgroups".
A Pod requests resources through it's spec.containers[*].resources
fields.
MetaKube nodes use systemd as the cgroup manager. They are organized in a hierarchy through "slices".
Depending on the Pod's determined Quality of Service, they're organized directly under the kubepods.slice
or one level lower under the kubepods-burstable.slice
or kubepods-besteffort.slice
:
$ systemctl status
...
├─init.scope
├─user.slice
│ └─user-<uid>.slice
├─system.slice
│ ├─containerd.service
│ ├─kubelet.service
│ └─<other system services>.service
└─kubepods.slice
├─kubepods-pod<id>.slice
│ ├─cri-containerd-<container id>.scope
│ │ └─<pid> <command>
│ └─cri-containerd-<container id>.scope
│ └─<pid> /pause
├─kubepods-burstable.slice
│ └─kubepods-burstable-pod<pod id>.slice
└─kubepods-besteffort.slice
└─kubepods-besteffort-pod<pod id>.slice
CPU time allocation is implemented by the Linux Kernel's completely fair scheduler (CFS).
A container's resources.requests.cpu
is regarded during scheduling to ensure there are enough CPUs available on the node.
It also corresponds to the container's cgroup's cpu.weight
value.
The weight determines the place of a processes' thread in the CFS's weighted queue and thus how often it's scheduled to be executed on a CPU core.
By dividing up the available resources proportionately, Kubelet ensures that a container process will at least get the configured requested CPU time (from resources.requests.cpu
).
Any spare CPU (either not requested or when processes yield) is free to use by waiting processes.
It is again proportionately distributed based on the processes' cgroup's CPU weights and up to their cpu.max
value (from resources.limits.cpu
).
Kubernetes doesn't prevent you from over-committing a node (limits > available resources).
Containers with low CPU requests may be throttled and may not be able to answer requests including liveness or readiness probes.
A container's resources.requests.memory
is only regarded during scheduling to ensure the node has enough space available.
A container's resources.limits.memory
correspond to the cgroup's memory.max
value.
Once reached, the process will be "OOM" killed.
Kubernetes doesn't prevent you from over-committing a node (limits > available resources).
If any higher level slice's or the entire system's memory is exhausted, the system will choose and kill a process in that slice or any respectively.