This document describes the cgroups management in Kubernetes nodes. This is an overview the describes the approach and is not going into details of all possible customizations.
The document concentrates on cgroupv2 and systemd cgroup driver.
The root directory for all cgroups in a cgroupv2 system is /sys/fs/cgroup/. When using the systemd cgroup driver, Kubernetes organizes pods into a hierarchy of systemd slices. The root for all Kubernetes-managed containers is kubepods.slice.
So top level hierarchy looks like the following:
/sys/fs/cgroup/:
kubepods.slice: Contains all Kubernetes pods.system.slice: Contains system services (e.g., sshd, containerd, kubelet).user.slice: Contains user sessions.Kubelet manages cgroup settings of kubepods.slice to define it's relative importance to the system workloads. These settings are designed to ensure the node stability.
The following settings were observed on a standard GKE node (e2-medium) within /sys/fs/cgroup/kubepods.slice/ and given as an example.
Calculated from Node Allocatable and Reserved Resources:
cpu.weight: 37. Relative CPU weight for all pods.memory.max: 3040550912. Hard memory limit for the entire kubepods.slice.pids.max: 4194304. Max number of processes allowed across all pods.The cpu.weight defines the amount of CPU shares Kubernetes-managed pods are recieving comparing to the system processes. The default cpu.weight for system.slice is 100. See January 2026 Kubernetes Blog to understand why the shares of system.slice are so much higher than kubepods.slice in this example.
The memory.max and pids.max ensures that resources reserved for system daemons will not be consumed by the Kubernetes workloads.
CPU reservtion for system daemons:
cpuset.cpus: 0-1. Available CPU cores for pods (calculated based on the ReservedCPUs or --reserved-cpus Kubelet setting).Default value re-enforcement:
cpu.max: max 100000. No hard limit (allows using all node CPU).See Why not set cpu.max on root cgroup? to learn why it is set to max.
Kubelet can also set other fields, for example hugepages limits.
kubepods.sliceUnder kubepods.slice, Kubernetes creates sub-slices based on the Quality of Service (QoS) class of the pods:
kubepods.slice.
/sys/fs/cgroup/kubepods.slice/kubepods-pod<UID>.slice/kubepods-burstable.slice.
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>.slice/kubepods-besteffort.slice.
/sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod<UID>.slice/Within each pod slice, there are further sub-groups for each container in the pod.
The main goal of this hierarchy is to ensure that Pods are getting resources they requested while allowing overcommit of resources.
Kubernetes uses two primary mechanisms in cgroupv2 to manage CPU resources:
cpu.weight: Controls the proportional share of CPU time. This replaces cpu.shares from cgroupv1. It ensures that groups receive their requested share of CPU when the node is under load.cpu.max: Sets a hard limit on CPU usage. This consolidates the legacy cpu.cfs_quota_us and cpu.cfs_period_us from cgroupv1 into a single file. It prevents a group from using more than a specific amount of CPU time, even if the node is idle.These mechanisms work together to provide both resource guarantees and workload isolation.
In cgroupv2, resource control is handled through different files than in cgroupv1. Originally, cpu.weight was mapped to cpu.shares using the linear formula.
This formula has many issues. For example, under this formula, the standard 1024 shares (1 CPU) maps to approximately 40 weight. However, systemd's default weight for system.slice is 100. This means that by default, system services are prioritized ~2.5x higher than pods with 1 CPU request.
So the new non-linear formula is being rolled out as discussed in the January 2026 Kubernetes Blog.
cpu.max on root cgroup?We do not set cpu.max on the root kubepods.slice or the intermediate QoS slices (kubepods-burstable.slice, etc.) because doing so would introduce collective throttling that can negatively impact high-priority pods.
Consider a scenario with two pods:
If a hard limit (cpu.max) were set on the root kubepods.slice, the following could occur:
By using only proportional weights (cpu.weight) at the root and QoS levels, Kubernetes ensures that:
While Kubernetes allows omitting CPU limits (which sets cpu.max to max), the decision to set them should depend on the desired QoS behavior and workload isolation requirements.
Summary of Advice:
For a deeper dive into how these resources are managed under the hood, see Kubernetes Resources Under the Hood - Part 3.
It is important to note that cpu.weight is considered only among siblings at the same level of the hierarchy. The hierarchy is not "flattened" when calculating priority.
By placing Guaranteed pods as direct children of kubepods.slice (alongside the Burstable and BestEffort slices), Kubernetes ensures:
kubepods-burstable.slice. This prevents a large number of low-priority pods from "washing out" the priority of a high-priority Guaranteed pod, which would happen if they all lived at the same level.The cpu.max setting enforces the hard CPU limit (cpu.limits) in cgroupv2. It uses the format quota period, where quota is the allowed CPU time within a given period (defaulting to 100,000 microseconds).
For pods with CPU limits defined:
cpu.max is set at the pod's cgroup slice. It is the sum of the CPU limits of all containers in the pod.cpu.max set to its specific limit.200m limit results in 20000 100000 (20ms per 100ms period).BestEffort pods have no CPU limits.
cpu.max is set to max, meaning no hard limit is enforced.Kubelet does not typically enforce a hard CPU limit at the QoS tier level.
kubepods-burstable.slice: cpu.max is set to max.kubepods-besteffort.slice: cpu.max is set to max.This design allows pods within these slices to burst into any idle CPU cycles on the node, provided they haven't hit their own pod-specific or container-specific limits.
kubepods.slice)As noted in the responsibilities section, the root kubepods.slice also has cpu.max set to max. This ensures the collective set of all pods can utilize the entire node's CPU capacity if it is not being used by the system.
While the root's CPU weight is fixed and based on Node Allocatable, the weights of its children are recalculated dynamically to slice the available "pie" internally.
kubepods.slice. Their cpu.weight is set individually based on their specific CPU requests.kubepods-burstable.slice): Its cpu.weight is the sum of all CPU requests of all active Burstable pods on the node. This ensures the Burstable tier as a whole has enough priority to satisfy its pods' requests when competing with other tiers.kubepods-besteffort.slice): Its cpu.weight is fixed at the minimum value (1 weight / 2 shares). This ensures that BestEffort pods only receive "slack" CPU cycles that are not being used by the Guaranteed or Burstable tiers.Kubernetes manages memory resources in cgroupv2 primarily through the memory.max setting, which enforces hard limits. If a cgroup exceeds this limit, the kernel will attempt to reclaim memory; if it fails, the processes within that cgroup are subject to being killed by the OOM (Out of Memory) killer.
This may seem counter-intuitive, as it allows Burstable pods collectively to consume all memory available to pods. However, Guaranteed pods are protected through a multi-layered defense strategy:
memory.max enforced at the pod cgroup level.oom_score_adj for processes so that the kernel targets lower-priority pods first (BestEffort/Burstable) and protects Guaranteed pods.memory.max calculationFor pods with memory limits defined:
memory.max is set at the pod's cgroup slice. It is the sum of the memory limits of all containers in the pod.memory.max set to its specific limit.256Mi limit results in 268435456 bytes.BestEffort pods have no memory limits.
memory.max is set to max, meaning the pod can theoretically consume all available memory on the node.Kubelet does not enforce hard memory limits at the intermediate QoS tier level.
kubepods-burstable.slice: memory.max is set to max.kubepods-besteffort.slice: memory.max is set to max.