×

To provide more reliable scheduling and minimize node resource overcommitment, reserve a portion of the CPU and memory resources for use by the underlying node components, such as kubelet and kube-proxy, and the remaining system components, such as sshd and NetworkManager. By specifying the resources to reserve, you provide the scheduler with more information about the remaining CPU and memory resources that a node has available for use by pods. You can allow OKD to automatically determine the optimal system-reserved CPU and memory resources for your nodes or you can manually determine and set the best resources for your nodes.

To manually set resource values, you must use a kubelet config CR. You cannot use a machine config CR.

Understanding how to allocate resources for nodes

CPU and memory resources reserved for node components in OKD are based on two node settings:

Setting Description

kube-reserved

This setting is not used with OKD. Add the CPU and memory resources that you planned to reserve to the system-reserved setting.

system-reserved

This setting identifies the resources to reserve for the node components and system components, such as CRI-O and Kubelet. The default settings depend on the OKD and Machine Config Operator versions. Confirm the default systemReserved parameter on the machine-config-operator repository.

If a flag is not set, the defaults are used. If none of the flags are set, the allocated resource is set to the node’s capacity as it was before the introduction of allocatable resources.

Any CPUs specifically reserved using the reservedSystemCPUs parameter are not available for allocation using kube-reserved or system-reserved.

How OKD computes allocated resources

An allocated amount of a resource is computed based on the following formula:

[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]

The withholding of Hard-Eviction-Thresholds from Allocatable improves system reliability because the value for Allocatable is enforced for pods at the node level.

If Allocatable is negative, it is set to 0.

Each node reports the system resources that are used by the container runtime and kubelet. To simplify configuring the system-reserved parameter, view the resource use for the node by using the node summary API. The node summary is available at /api/v1/nodes/<node>/proxy/stats/summary.

How nodes enforce resource constraints

The node is able to limit the total amount of resources that pods can consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from using CPU and memory resources that are needed by system services such as the container runtime and node agent. To improve node reliability, administrators should reserve resources based on a target for resource use.

The node enforces resource constraints by using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy that is separate from system daemons.

Administrators should treat system daemons similar to pods that have a guaranteed quality of service. System daemons can burst within their bounding control groups and this behavior must be managed as part of cluster deployments. Reserve CPU and memory resources for system daemons by specifying the amount of CPU and memory resources in system-reserved.

Enforcing system-reserved limits can prevent critical system services from receiving CPU and memory resources. As a result, a critical system service can be ended by the out-of-memory killer. The recommendation is to enforce system-reserved only if you have profiled the nodes exhaustively to determine precise estimates and you are confident that critical system services can recover if any process in that group is ended by the out-of-memory killer.

Understanding Eviction Thresholds

If a node is under memory pressure, it can impact the entire node and all pods running on the node. For example, a system daemon that uses more than its reserved amount of memory can trigger an out-of-memory event. To avoid or reduce the probability of system out-of-memory events, the node provides out-of-resource handling.

You can reserve some memory using the --eviction-hard flag. The node attempts to evict pods whenever memory availability on the node drops below the absolute value or percentage. If system daemons do not exist on a node, pods are limited to the memory capacity - eviction-hard. For this reason, resources set aside as a buffer for eviction before reaching out of memory conditions are not available for pods.

The following is an example to illustrate the impact of node allocatable for memory:

  • Node capacity is 32Gi

  • --system-reserved is 3Gi

  • --eviction-hard is set to 100Mi.

For this node, the effective node allocatable value is 28.9Gi. If the node and system components use all their reservation, the memory available for pods is 28.9Gi, and kubelet evicts pods when it exceeds this threshold.

If you enforce node allocatable, 28.9Gi, with top-level cgroups, then pods can never exceed 28.9Gi. Evictions are not performed unless system daemons consume more than 3.1Gi of memory.

If system daemons do not use up all their reservation, with the above example, pods would face memcg OOM kills from their bounding cgroup before node evictions kick in. To better enforce QoS under this situation, the node applies the hard eviction thresholds to the top-level cgroup for all pods to be Node Allocatable + Eviction Hard Thresholds.

If system daemons do not use up all their reservation, the node will evict pods whenever they consume more than 28.9Gi of memory. If eviction does not occur in time, a pod will be OOM killed if pods consume 29Gi of memory.

How the scheduler determines resource availability

The scheduler uses the value of node.Status.Allocatable instead of node.Status.Capacity to decide if a node will become a candidate for pod scheduling.

By default, the node will report its machine capacity as fully schedulable by the cluster.

Automatically allocating resources for nodes

OKD can automatically determine the optimal system-reserved CPU and memory resources for nodes associated with a specific machine config pool and update the nodes with those values when the nodes start. By default, the system-reserved CPU is 500m and system-reserved memory is 1Gi.

To automatically determine and allocate the system-reserved resources on nodes, create a KubeletConfig custom resource (CR) to set the autoSizingReserved: true parameter. A script on each node calculates the optimal values for the respective reserved resources based on the installed CPU and memory capacity on each node. The script takes into account that increased capacity requires a corresponding increase in the reserved resources.

Automatically determining the optimal system-reserved settings ensures that your cluster is running efficiently and prevents node failure due to resource starvation of system components, such as CRI-O and kubelet, without your needing to manually calculate and update the values.

This feature is disabled by default.

Prerequisites
  1. Obtain the label associated with the static MachineConfigPool object for the type of node you want to configure by entering the following command:

    $ oc edit machineconfigpool <name>

    For example:

    $ oc edit machineconfigpool worker
    Example output
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      creationTimestamp: "2022-11-16T15:34:25Z"
      generation: 4
      labels:
        pools.operator.machineconfiguration.openshift.io/worker: "" (1)
      name: worker
    #...
    1 The label appears under Labels.

    If an appropriate label is not present, add a key/value pair such as:

    $ oc label machineconfigpool worker custom-kubelet=small-pods
Procedure
  1. Create a custom resource (CR) for your configuration change:

    Sample configuration for a resource allocation CR
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: dynamic-node (1)
    spec:
      autoSizingReserved: true (2)
      machineConfigPoolSelector:
        matchLabels:
          pools.operator.machineconfiguration.openshift.io/worker: "" (3)
    #...
    1 Assign a name to CR.
    2 Add the autoSizingReserved parameter set to true to allow OKD to automatically determine and allocate the system-reserved resources on the nodes associated with the specified label. To disable automatic allocation on those nodes, set this parameter to false.
    3 Specify the label from the machine config pool that you configured in the "Prerequisites" section. You can choose any desired labels for the machine config pool, such as custom-kubelet: small-pods, or the default label, pools.operator.machineconfiguration.openshift.io/worker: "".

    The previous example enables automatic resource allocation on all worker nodes. OKD drains the nodes, applies the kubelet config, and restarts the nodes.

  2. Create the CR by entering the following command:

    $ oc create -f <file_name>.yaml
Verification
  1. Log in to a node you configured by entering the following command:

    $ oc debug node/<node_name>
  2. Set /host as the root directory within the debug shell:

    # chroot /host
  3. View the /etc/node-sizing.env file:

    Example output
    SYSTEM_RESERVED_MEMORY=3Gi
    SYSTEM_RESERVED_CPU=0.08

    The kubelet uses the system-reserved values in the /etc/node-sizing.env file. In the previous example, the worker nodes are allocated 0.08 CPU and 3 Gi of memory. It can take several minutes for the optimal values to appear.

Manually allocating resources for nodes

OKD supports the CPU and memory resource types for allocation. The ephemeral-resource resource type is also supported. For the cpu type, you specify the resource quantity in units of cores, such as 200m, 0.5, or 1. For memory and ephemeral-storage, you specify the resource quantity in units of bytes, such as 200Ki, 50Mi, or 5Gi. By default, the system-reserved CPU is 500m and system-reserved memory is 1Gi.

As an administrator, you can set these values by using a kubelet config custom resource (CR) through a set of <resource_type>=<resource_quantity> pairs (e.g., cpu=200m,memory=512Mi).

You must use a kubelet config CR to manually set resource values. You cannot use a machine config CR.

For details on the recommended system-reserved values, refer to the recommended system-reserved values.

Prerequisites
  1. Obtain the label associated with the static MachineConfigPool CRD for the type of node you want to configure by entering the following command:

    $ oc edit machineconfigpool <name>

    For example:

    $ oc edit machineconfigpool worker
    Example output
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      creationTimestamp: "2022-11-16T15:34:25Z"
      generation: 4
      labels:
        pools.operator.machineconfiguration.openshift.io/worker: "" (1)
      name: worker
    #...
    1 The label appears under Labels.

    If the label is not present, add a key/value pair such as:

    $ oc label machineconfigpool worker custom-kubelet=small-pods
Procedure
  1. Create a custom resource (CR) for your configuration change.

    Sample configuration for a resource allocation CR
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: set-allocatable (1)
    spec:
      machineConfigPoolSelector:
        matchLabels:
          pools.operator.machineconfiguration.openshift.io/worker: "" (2)
      kubeletConfig:
        systemReserved: (3)
          cpu: 1000m
          memory: 1Gi
    #...
    1 Assign a name to CR.
    2 Specify the label from the machine config pool.
    3 Specify the resources to reserve for the node components and system components.
  2. Run the following command to create the CR:

    $ oc create -f <file_name>.yaml