Resources

Cgroups

System resources, such as CPU and memory, are distributed by means of a Linux Kernel feature called "cgroups".

A Pod requests resources through it's spec.containers[*].resources fields.

MetaKube nodes use systemd as the cgroup manager. They are organized in a hierarchy through "slices".

Depending on the Pod's determined Quality of Service, they're organized directly under the kubepods.slice or one level lower under the kubepods-burstable.slice or kubepods-besteffort.slice:

$ systemctl status
...
├─init.scope 
├─user.slice 
│ └─user-<uid>.slice
├─system.slice 
│ ├─containerd.service
│ ├─kubelet.service 
│ └─<other system services>.service
└─kubepods.slice 
  ├─kubepods-pod<id>.slice 
  │ ├─cri-containerd-<container id>.scope
  │ │ └─<pid> <command>
  │ └─cri-containerd-<container id>.scope
  │   └─<pid> /pause
  ├─kubepods-burstable.slice 
  │ └─kubepods-burstable-pod<pod id>.slice 
  └─kubepods-besteffort.slice 
    └─kubepods-besteffort-pod<pod id>.slice 

Resources available for Kubernetes Pods

Not all of a system's resources are available for Kubernetes Pods.
MetaKube reserves a certain amount of CPU and memory for the system and Kubelet.

We're aware of potential stability problems from not reserving enough resources for the System for bigger flavors.
We will address this very soon.

Currently, MetaKube reserves 200m CPU and 500Mi memory for the system including 200m CPU and 300Mi memory for Kubelet.

Container resources

CPU

CPU time allocation is implemented by the Linux Kernel's completely fair scheduler (CFS).

A container's resources.requests.cpu is regarded during scheduling to ensure there are enough CPUs available on the node.
It also corresponds to the container's cgroup's cpu.weight value.

The weight determines the place of a processes' thread in the CFS's weighted queue and thus how often it's scheduled to be executed on a CPU core.

By dividing up the available resources proportionately, Kubelet ensures that a container process will at least get the configured requested CPU time (from resources.requests.cpu).
Any spare CPU (either not requested or when processes yield) is free to use by waiting processes.
It is again proportionately distributed based on the processes' cgroup's CPU weights and up to their cpu.max value (from resources.limits.cpu).

Kubernetes doesn't prevent you from over-committing a node (limits > available resources).
Containers with low CPU requests may be throttled and may not be able to answer requests including liveness or readiness probes.

Memory

A container's resources.requests.memory is only regarded during scheduling to ensure the node has enough space available.

A container's resources.limits.memory correspond to the cgroup's memory.max value.
Once reached, the process will be "OOM" killed.

Kubernetes doesn't prevent you from over-committing a node (limits > available resources).
If any higher level slice's or the entire system's memory is exhausted, the system will choose and kill a process in that slice or any respectively.