Purpose for Allocating Node Resources

To provide more reliable scheduling and minimize node resource overcommitment, reserve a portion of the CPU and memory resources for use by the underlying node components such as kubelet, kube-proxy, and the container engine. The resources that you reserve are also used by the remaining system components such as sshd, NetworkManager, and so on. Specifying the resources to reserve provides the scheduler with more information about the remaining memory and CPU resources that a node has available for use by pods.

Configuring Nodes for Allocated Resources

Resources are reserved for node components and system components in OKD by configuring the system-reserved node setting.

OKD does not use the kube-reserved setting. Documentation for Kubernetes and some cloud vendors that provide a Kubernetes environment might suggest configuring kube-reserved. That information does not apply to an OKD cluster.

Use caution when you tune your cluster with resource limits and enforcing limits with evictions. Enforcing system-reserved limits can prevent critical system services from receiving CPU time or ending the critical system services when memory resources run low.

In most cases, tuning resource allocation is performed by making an adjustment and then monitoring the cluster performance with a production-like workload. That process is repeated until the cluster is stable and meets service-level agreements.

For more information on the effects of these settings, see Computing Allocated Resources.

Setting Description

kube-reserved

This setting is not used with OKD. Add the CPU and memory resources that you planned to reserve to system-reserved setting.

system-reserved

Resources that are reserved for the node components and system components. Default is none.

View the services that are controlled by system-reserved with a tool such as lscgroup by running the following commands:

# yum install libcgroup-tools
$ lscgroup memory:/system.slice

Reserve resources in the kubeletArguments section of the node configuration map by adding a set of <resource_type>=<resource_quantity> pairs. For example, cpu=500m,memory=1Gi reserves 500 millicores of CPU and one gigabyte of memory.

Example 1. Node-Allocatable Resources Settings
kubeletArguments:
  system-reserved:
    - "cpu=500m,memory=1Gi"

Add the system-reserved field if it does not exist.

Do not edit the node-config.yaml file directly.

To determine appropriate values for these settings, view the resource usage of a node by using the node summary API. For more information, see System Resources Reported by Node.

After you set system-reserved:

  • Monitor the memory usage of a node for high-water marks:

    $ ps aux | grep <service-name>

    For example:

    $ ps aux | grep atomic-openshift-node
    
    USER       PID   %CPU  %MEM  VSZ     RSS  TTY    STAT  START  TIME  COMMAND
    root       11089 11.5  0.3   112712  996  pts/1  R+    16:23  0:00  grep --color=auto atomic-openshift-node

    If this value is close to your system-reserved mark, you can increase the system-reserved value.

  • Monitor the memory usage of system services with a tool such as cgget by running the following commands:

    # yum install libcgroup-tools
    $ cgget -g memory  /system.slice | grep memory.usage_in_bytes

    If this value is close to your system-reserved mark, you can increase the system-reserved value.

  • Use the OKD cluster loader to measure performance metrics of your deployment at various cluster states.

Computing Allocated Resources

An allocated amount of a resource is computed based on the following formula:

[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]

The withholding of Hard-Eviction-Thresholds from allocatable improves system reliability because the value for allocatable is enforced for pods at the node level. The experimental-allocatable-ignore-eviction setting is available to preserve legacy behavior, but it will be deprecated in a future release.

If [Allocatable] is negative, it is set to 0.

Viewing Node-Allocatable Resources and Capacity

To view the current capacity and allocatable resources for a node, run the following command:

$ oc get node/<node_name> -o yaml

In the following partial output, the allocatable values are less than the capacity. The difference is expected and matches a cpu=500m,memory=1Gi resource allocation for system-reserved.

status:
...
  allocatable:
    cpu: "3500m"
    memory: 6857952Ki
    pods: "110"
  capacity:
    cpu: "4"
    memory: 8010948Ki
    pods: "110"
...

The scheduler uses the values for allocatable to decide if a node is a candidate for pod scheduling.

System Resources Reported by Node

Each node reports the system resources that are used by the container runtime and kubelet. To simplify configuring system-reserved, view the resource usage for the node by using the node summary API. The node summary is available at <master>/api/v1/nodes/<node>/proxy/stats/summary.

For instance, to access the resources from cluster.node22 node, run the following command:

$ curl <certificate details> https://<master>/api/v1/nodes/cluster.node22/proxy/stats/summary

The response includes information that is similar to the following:

{
    "node": {
        "nodeName": "cluster.node22",
        "systemContainers": [
            {
                "cpu": {
                    "usageCoreNanoSeconds": 929684480915,
                    "usageNanoCores": 190998084
                },
                "memory": {
                    "rssBytes": 176726016,
                    "usageBytes": 1397895168,
                    "workingSetBytes": 1050509312
                },
                "name": "kubelet"
            },
            {
                "cpu": {
                    "usageCoreNanoSeconds": 128521955903,
                    "usageNanoCores": 5928600
                },
                "memory": {
                    "rssBytes": 35958784,
                    "usageBytes": 129671168,
                    "workingSetBytes": 102416384
                },
                "name": "runtime"
            }
        ]
    }
}

See REST API Overview for more details about certificate details.

Node Enforcement

The node is able to limit the total amount of resources that pods can consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from using CPU and memory resources that are needed by system services such as the container runtime and node agent. To improve node reliability, administrators should reserve resources based on a target for resource use.

The node enforces resource constraints using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy that is separate from system daemons.

To configure node enforcement, use the following parameters in the appropriate node configuration map.

Example 2. Node Cgroup Settings
kubeletArguments:
  cgroups-per-qos:
    - "true" (1)
  cgroup-driver:
    - "systemd" (2)
  enforce-node-allocatable:
    - "pods" (3)
1 Enable or disable a cgroup hierarchy for each quality of service. The cgroups are managed by the node. Any change of this setting requires a full drain of the node. This flag must be true to enable the node to enforce the node-allocatable resource constraints. The default value is true and Red Hat does not recommend that customers change this value.
2 The cgroup driver that is used by the node to manage the cgroup hierarchies. This value must match the driver that is associated with the container runtime. Valid values are systemd and cgroupfs, but Red Hat supports systemd only.
3 A comma-delimited list of scopes for where the node should enforce node resource constraints. The default value is pods and Red Hat supports pods only.

Administrators should treat system daemons similar to pods that have a guaranteed quality of service. System daemons can burst within their bounding control groups and this behavior must be managed as part of cluster deployments. Reserve CPU and memory resources for system daemons by specifying the resources in system-reserved as shown in section Configuring Nodes for Allocated Resources.

To view the cgroup driver that is set, run the following command:

$ systemctl status atomic-openshift-node -l | grep cgroup-driver=

The output includes a response that is similar to the following:

--cgroup-driver=systemd

For more information on managing and troubleshooting cgroup drivers, see Introduction to Control Groups (Cgroups).

Eviction Thresholds

If a node is under memory pressure, it can impact the entire node and all pods running on it. If a system daemon uses more than its reserved amount of memory, an out-of-memory event can occur that impacts the entire node and all pods running on the node. To avoid or reduce the probability of system out-of-memory events, the node provides out of resource handling.