System resources, such as CPU and memory, are distributed by means of a Linux Kernel feature called "cgroups".
A Pod requests resources through it's spec.containers[*].resources
fields.
MetaKube nodes use systemd as the cgroup manager. They are organized in a hierarchy through "slices".
Depending on the Pod's determined Quality of Service, they're organized directly under the kubepods.slice
or one level lower under the kubepods-burstable.slice
or kubepods-besteffort.slice
:
$ systemctl status
...
├─init.scope
├─user.slice
│ └─user-<uid>.slice
├─system.slice
│ ├─containerd.service
│ ├─kubelet.service
│ └─<other system services>.service
└─kubepods.slice
├─kubepods-pod<id>.slice
│ ├─cri-containerd-<container id>.scope
│ │ └─<pid> <command>
│ └─cri-containerd-<container id>.scope
│ └─<pid> /pause
├─kubepods-burstable.slice
│ └─kubepods-burstable-pod<pod id>.slice
└─kubepods-besteffort.slice
└─kubepods-besteffort-pod<pod id>.slice
Not all of a system's resources are available for Kubernetes Pods.
MetaKube reserves a certain amount of CPU and memory for the system and Kubelet.
We're aware of potential stability problems from not reserving enough resources for the System for bigger flavors.
We will address this very soon.
Currently, MetaKube reserves 200m
CPU and 500Mi
memory for the system including 200m
CPU and 300Mi
memory for Kubelet.
CPU time allocation is implemented by the Linux Kernel's completely fair scheduler (CFS).
A container's resources.requests.cpu
is regarded during scheduling to ensure there are enough CPUs available on the node.
It also corresponds to the container's cgroup's cpu.weight
value.
The weight determines the place of a processes' thread in the CFS's weighted queue and thus how often it's scheduled to be executed on a CPU core.
By dividing up the available resources proportionately, Kubelet ensures that a container process will at least get the configured requested CPU time (from resources.requests.cpu
).
Any spare CPU (either not requested or when processes yield) is free to use by waiting processes.
It is again proportionately distributed based on the processes' cgroup's CPU weights and up to their cpu.max
value (from resources.limits.cpu
).
Kubernetes doesn't prevent you from over-committing a node (limits > available resources).
Containers with low CPU requests may be throttled and may not be able to answer requests including liveness or readiness probes.
A container's resources.requests.memory
is only regarded during scheduling to ensure the node has enough space available.
A container's resources.limits.memory
correspond to the cgroup's memory.max
value.
Once reached, the process will be "OOM" killed.
Kubernetes doesn't prevent you from over-committing a node (limits > available resources).
If any higher level slice's or the entire system's memory is exhausted, the system will choose and kill a process in that slice or any respectively.