Out of Resource Handling | Cluster Administration

Overview
Configuring Eviction Policies
Configuring the Amount of Resource for Scheduling
Controlling Node Condition Oscillation
Reclaiming Node-level Resources
Understanding Pod Eviction
- Understanding Quality of Service and Out of Memory Killer
Understanding the Pod Scheduler and OOR Conditions
Example Scenario
Recommended Practice
- Daemon Sets and Out of Resource Handling

Overview

This topic discusses best-effort attempts to prevent OKD from experiencing out-of-memory (OOM) and out-of-disk-space conditions.

A node must maintain stability when available compute resources are low. This is especially important when dealing with incompressible resources such as memory or disk. If either resource is exhausted, the node becomes unstable.

Administrators can proactively monitor nodes for and prevent against situations where the node runs out of compute and memory resources using configurable eviction policies.

This topic also provides information on how OKD handles out-of-resource conditions and provides an example scenario and recommended practices:

Resource reclaiming
Pod eviction
Pod scheduling
Out of Resource and Out of Memory Killer

If swap memory is enabled for a node, that node cannot detect that it is under memory pressure.

To take advantage of memory based evictions, operators must disable swap.

Configuring Eviction Policies

An eviction policy allows a node to fail one or more pods when the node is running low on available resources. Failing a pod allows the node to reclaim needed resources.

An eviction policy is a combination of an eviction trigger signal with a specific eviction threshold value that is set in the node configuration file or through the command line. Evictions can be either hard, where a node takes immediate action on a pod that exceeds a threshold, or soft, where a node allows a grace period before taking action.

To modify a node in your cluster, update the node configuration maps as needed. Do not manually edit the node-config.yaml file.

By using well-configured eviction policies, a node can proactively monitor for and prevent against total resource consumption of a compute resource.

When the node fails a pod, the node ends all the containers in the pod, and the PodPhase is transitioned to Failed.

When detecting disk pressure, the node supports the nodefs and imagefs file system partitions.

The nodefs, or rootfs, is the file system that the node uses for local disk volumes, daemon logs, emptyDir, and other local storage. For example, rootfs is the file system that provides /. The rootfs contains openshift.local.volumes, by default /var/lib/origin/openshift.local.volumes.

The imagefs is the file system that the container runtime uses for storing images and individual container-writable layers. Eviction thresholds are at 85% full for imagefs. The imagefs file system depends on the runtime and, in the case of Docker, which storage driver that the container uses.

For Docker:
- If you use the devicemapper storage driver, the imagefs is thin pool.
  
  You can limit the read and write layer for the container by setting the --storage-opt dm.basesize flag in the Docker daemon.
  $ sudo dockerd --storage-opt dm.basesize=50G
- If you use the overlay2 storage driver, the imagefs is the file system that contains /var/lib/docker/overlay2.
For CRI-O, which uses the overlay driver, the imagefs is /var/lib/containers/storage by default.

If you do not use local storage isolation (ephemeral storage) and you do not use XFS quota (volumeConfig), you cannot limit local disk usage by the pod.

Using the Node Configuration to Create a Policy

To configure an eviction policy, edit the appropriate node configuration map to specify the eviction thresholds under the eviction-hard or eviction-soft parameters.

The following samples show eviction thresholds:

Sample Node Configuration File for a Hard Eviction

kubeletArguments:
  eviction-hard: (1)
  - memory.available<100Mi (2)
  - nodefs.available<10%
  - nodefs.inodesFree<5%
  - imagefs.available<15%
  - imagefs.inodesFree<10%

1	The type of eviction: Use this parameter for a hard eviction.
2	Eviction thresholds based on a specific eviction trigger signal.

You must provide percentage values for the inodesFree parameters. You can provide a percentage or a numerical value for the other parameters.

Sample Node Configuration File for a Soft Eviction

kubeletArguments:
  eviction-soft: (1)
  - memory.available<100Mi (2)
  - nodefs.available<10%
  - nodefs.inodesFree<5%
  - imagefs.available<15%
  - imagefs.inodesFree<10%
  eviction-soft-grace-period:(3)
  - memory.available=1m30s
  - nodefs.available=1m30s
  - nodefs.inodesFree=1m30s
  - imagefs.available=1m30s
  - imagefs.inodesFree=1m30s

1	The type of eviction: Use this parameter for a soft eviction.
2	An eviction threshold based on a specific eviction trigger signal.
3	The grace period for the soft eviction. Leave the default values for optimal performance.

Restart the OKD service for the changes to take effect:

# systemctl restart origin-node

Understanding Eviction Signals

You can configure a node to trigger eviction decisions on any of the signals described in the table below. You add an eviction signal to an eviction threshold along with a threshold value.

To view the signals:

curl <certificate details> \
  https://<master>/api/v1/nodes/<node>/proxy/stats/summary

Table 1. Supported Eviction Signals
Node Condition	Eviction Signal	Value	Description
`MemoryPressure`	`memory.available`	`memory.available` = `node.status.capacity[memory]` - `node.stats.memory.workingSet`	Available memory on the node has exceeded an eviction threshold.
`DiskPressure`	`nodefs.available`	`nodefs.available` = `node.stats.fs.available`	Available disk space on either the node root file system or image file system has exceeded an eviction threshold.
	`nodefs.inodesFree`	`nodefs.inodesFree` = `node.stats.fs.inodesFree`
	`imagefs.available`	`imagefs.available` = `node.stats.runtime.imagefs.available`
	`imagefs.inodesFree`	`imagefs.inodesFree` = `node.stats.runtime.imagefs.inodesFree`

Each of the signals in the preceding table supports either a literal or percentage-based value, except inodesFree. The inodesFree signal must be specified as a percentage. The percentage-based value is calculated relative to the total capacity associated with each signal.

A script derives the value for memory.available from your cgroup driver using the same set of steps that the kubelet performs. The script excludes inactive file memory (that is, the number of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that inactive file memory is reclaimable under pressure.

Do not use tools like free -m, because free -m does not work in a container.

OKD monitors these file systems every 10 seconds.

If you store volumes and logs in a dedicated file system, the node does not monitor that file system.

The node supports the ability to trigger eviction decisions based on disk pressure. Before evicting pods because of disk pressure, the node also performs container and image garbage collection.

Understanding Eviction Thresholds

You can configure a node to specify eviction thresholds. Reaching a threshold triggers the node to reclaim resources. You can configure a threshold in the node configuration file.

If an eviction threshold is met, independent of its associated grace period, the node reports a condition to indicate that the node is under memory or disk pressure. Reporting the pressure prevents the scheduler from scheduling any additional pods on the node while attempts to reclaim resources are made.

The node continues to report node status updates at the frequency specified by the node-status-update-frequency argument. The default frequency is 10s (ten seconds).

Eviction thresholds can be hard, for when the node takes immediate action when a threshold is met, or soft, for when you allow a grace period before reclaiming resources.

Soft eviction usage is more common when you target a certain level of utilization, but can tolerate temporary spikes. We recommended setting the soft eviction threshold lower than the hard eviction threshold, but the time period can be operator-specific. The system reservation should also cover the soft eviction threshold.

The soft eviction threshold is an advanced feature. You should configure a hard eviction threshold before attempting to use soft eviction thresholds.

Thresholds are configured in the following form:

<eviction_signal><operator><quantity>

The eviction-signal value can be any supported eviction signal.
The operator value is <.
The quantity value must match the quantity representation used by Kubernetes and can be expressed as a percentage if it ends with the % token.

For example, if an operator has a node with 10Gi of memory, and that operator wants to induce eviction if available memory falls below 1Gi, an eviction threshold for memory can be specified as either of the following:

memory.available<1Gi
memory.available<10%

The node evaluates and monitors eviction thresholds every 10 seconds and the value can not be modified. This is the housekeeping interval.

Understanding Hard Eviction Thresholds

A hard eviction threshold has no grace period. When a hard eviction threshold is met, the node takes immediate action to reclaim the associated resource. For example, the node can end one or more pods immediately with no graceful termination.

To configure hard eviction thresholds, add eviction thresholds to the node configuration file under eviction-hard, as shown in Using the Node Configuration to Create a Policy.

Sample Node Configuration File with Hard Eviction Thresholds

kubeletArguments:
  eviction-hard:
  - memory.available<500Mi
  - nodefs.available<500Mi
  - nodefs.inodesFree<5%
  - imagefs.available<100Mi
  - imagefs.inodesFree<10%

This example is a general guideline and not recommended settings.

Default Hard Eviction Thresholds

OKD uses the following default configuration for eviction-hard.

...
kubeletArguments:
  eviction-hard:
  - memory.available<100Mi
  - nodefs.available<10%
  - nodefs.inodesFree<5%
  - imagefs.available<15%
...

Understanding Soft Eviction Thresholds

A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The node does not reclaim resources associated with the eviction signal until that grace period is exceeded. If no grace period is provided in the node configuration, the node produces an error on startup.

In addition, if a soft eviction threshold is met, an operator can specify a maximum-allowed pod termination grace period to use when evicting pods from the node. If eviction-max-pod-grace-period is specified, the node uses the lesser value among the pod.Spec.TerminationGracePeriodSeconds and the maximum-allowed grace period. If not specified, the node ends pods immediately with no graceful termination.

For soft eviction thresholds the following flags are supported:

eviction-soft: a set of eviction thresholds, such as memory.available<1.5Gi. If the threshold is met over a corresponding grace period, the threshold triggers a pod eviction.
eviction-soft-grace-period: a set of eviction grace periods, such as memory.available=1m30s. The grace period corresponds to how long a soft eviction threshold must hold before triggering a pod eviction.
eviction-max-pod-grace-period: the maximum-allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.

To configure soft eviction thresholds, add eviction thresholds to the node configuration file under eviction-soft, as shown in Using the Node Configuration to Create a Policy.

Sample Node Configuration Files with Soft Eviction Thresholds

kubeletArguments:
  eviction-soft:
  - memory.available<500Mi
  - nodefs.available<500Mi
  - nodefs.inodesFree<5%
  - imagefs.available<100Mi
  - imagefs.inodesFree<10%
  eviction-soft-grace-period:
  - memory.available=1m30s
  - nodefs.available=1m30s
  - nodefs.inodesFree=1m30s
  - imagefs.available=1m30s
  - imagefs.inodesFree=1m30s

This example is a general guideline and not recommended settings.

Configuring the Amount of Resource for Scheduling

You can control how much of a node resource is made available for scheduling in order to allow the scheduler to fully allocate a node and to prevent evictions.

Set system-reserved equal to the amount of resource that you want available to the scheduler for deploying pods and for system-daemons. The system-reserved resources are reserved for operating system daemons such as sshd and NetworkManager. Evictions should only occur if pods use more than their requested amount of an allocatable resource.

A node reports two values:

Capacity: How much resource is on the machine.
Allocatable: How much resource is made available for scheduling.

To configure the amount of allocatable resources, edit the appropriate node configuration map to add or modify the system-reserved parameter for eviction-hard or eviction-soft.

kubeletArguments:
  eviction-hard: (1)
    - "memory.available<500Mi"
  system-reserved:
    - "memory=1.5Gi"

1	This threshold can either be `eviction-hard` or `eviction-soft`.

To determine appropriate values for the system-reserved setting, determine a node’s resource usage using the node summary API. For more information, see Configuring Nodes for Allocated Resources.

Restart the OKD service for the changes to take effect:

# systemctl restart origin-node

Controlling Node Condition Oscillation

If a node oscillates above and below a soft eviction threshold, but does not exceed an associated grace period, the oscillation can cause problems for the scheduler.

To prevent the oscillation, set the eviction-pressure-transition-period parameter to control how long the node must wait before transitioning out of a pressure condition.

Edit or add the parameter to the kubeletArguments section of the appropriate node configuration map using a set of <resource_type>=<resource_quantity> pairs.

kubeletArguments:
  eviction-pressure-transition-period:
  - 5m

The node toggles the condition back to false when the node has not met an eviction threshold for the specified pressure condition during the specified period.

Use the default value, 5 minutes, before making any adjustments. The default value is intended to enable the system to stabilize and to prevent the scheduler from scheduling new pods to the node before it has settled.

Restart the OKD services for the changes to take effect:
```
# systemctl restart origin-node
```

Reclaiming Node-level Resources

If an eviction criteria is satisfied, the node initiates the process of reclaiming the pressured resource until the signal is below the defined threshold. During this time, the node does not support scheduling any new pods.

The node attempts to reclaim node-level resources before the node evicts end-user pods, based on whether the host system has a dedicated imagefs configured for the container runtime.

With Imagefs

If the host system has imagefs:

If the nodefs file system meets eviction thresholds, the node frees disk space in the following order:
- Delete dead pods and containers.
If the imagefs file system meets eviction thresholds, the node frees disk space in the following order:
- Delete all unused images.

Without Imagefs

If the host system does not have imagefs:

If the nodefs file system meets eviction thresholds, the node frees disk space in the following order:
- Delete dead pods and containers.
- Delete all unused images.

Understanding Pod Eviction

If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until the signal is below the defined threshold.

The node ranks pods for eviction by their quality of service. Among pods with the same quality of service, the node ranks the pods by the consumption of the compute resource relative to the pod’s scheduling request.

Each quality of service level has an out-of-memory score. The Linux out-of-memory tool (OOM killer) uses the score to determine which pods to end. For more information, see Understanding Quality of Service and Out of Memory Killer.

The following table lists each quality of service level and the associated out-of-memory score.

Table 2. Quality of Service Levels
Quality of Service	Description
`Guaranteed`	Pods that consume the highest amount of the resource relative to their request are failed first. If no pod exceeds its request, the strategy targets the largest consumer of the resource.
`Burstable`	Pods that consume the highest amount of the resource relative to their request for that resource are failed first. If no pod exceeds its request, the strategy targets the largest consumer of the resource.
`BestEffort`	Pods that consume the highest amount of the resource are failed first.

A guaranteed quality of service pod is never evicted due to resource consumption by another pod unless a system daemon, such as node or the container engine, consumes more resources than were reserved using the system-reserved allocations or if the node has only guaranteed quality of service pods remaining.

If the node has only guaranteed quality of service pods remaining, the node evicts a pod that least impacts node stability and limits the impact of the unexpected consumption to the other guaranteed quality of service pods.

Local disk is a best-effort quality of service resource. If necessary, the node evicts pods one at a time to reclaim disk space when disk pressure is encountered. The node ranks pods by quality of service. If the node is responding to a lack of free inodes, the node reclaims inodes by evicting pods with the lowest quality of service first. If the node is responding to lack of available disk, the node ranks pods within a quality of service that consumes the largest amount of local disk and then evicts those pods first.

Understanding Quality of Service and Out of Memory Killer

If the node experiences a system out-of-memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.

The node sets a oom_score_adj value for each container that is based on the quality of service for the pod.

Table 3. Quality of Service Levels
Quality of Service	`oom_score_adj` Value
`Guaranteed`	-998
`Burstable`	min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)
`BestEffort`	1000

If the node is unable to reclaim memory before the node experiences a system OOM event, the OOM killer process calculates an OOM score:

% of node memory a container is using + oom_score_adj = oom_score

The node then ends the container with the highest score.

Containers with the lowest quality of service and that consume the largest amount of memory, relative to the scheduling request, are ended first.

Unlike pod eviction, if a pod container is ended due to OOM, the node can restart the container according to the node restart policy.

Understanding the Pod Scheduler and OOR Conditions

The scheduler views node conditions when the scheduler places additional pods on the node. For example, if the node has an eviction threshold like the following:

eviction-hard is "memory.available<500Mi"

If available memory falls below 500Mi, the node reports a value in Node.Status.Conditions as MemoryPressure as true.

Table 4. Node Conditions and Scheduler Behavior
Node Condition	Scheduler Behavior
`MemoryPressure`	If a node reports this condition, the scheduler does not place `BestEffort` pods on that node.
`DiskPressure`	If a node reports this condition, the scheduler does not place any additional pods on that node.

Example Scenario

An Operator:

Has a node with a memory capacity of 10Gi.
Wants to reserve 10% of memory capacity for system daemons such as kernel, node, and other daemons.
Wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.

Implicit in this configuration is the understanding that system-reserved should include the amount of memory covered by the eviction threshold.

To reach that capacity, either some pod is using more than its request, or the system is using more than 1Gi.

If a node has 10 Gi of capacity and you want to reserve 10% of that capacity for the system daemons with the system-reserved setting, perform the following calculation:

capacity = 10 Gi
system-reserved = 10 Gi * .1 = 1 Gi

The amount of allocatable resources becomes:

allocatable = capacity - system-reserved = 9 Gi

This means by default, the scheduler will schedule pods that request 9 Gi of memory to that node.

If you want to enable eviction so that eviction is triggered when the node observes that available memory is below 10% of capacity for 30 seconds, or immediately when it falls below 5% of capacity, you need the scheduler to evaluate allocatable as 8Gi. Therefore, ensure your system reservation covers the greater of your eviction thresholds.

capacity = 10 Gi
eviction-threshold = 10 Gi * .1 = 1 Gi
system-reserved = (10Gi * .1) + eviction-threshold = 2 Gi
allocatable = capacity - system-reserved = 8 Gi

Add the following to the appropriate node configuration map:

kubeletArguments:
  system-reserved:
  - "memory=2Gi"
  eviction-hard:
  - "memory.available<.5Gi"
  eviction-soft:
  - "memory.available<1Gi"
  eviction-soft-grace-period:
  - "memory.available=30s"

This configuration ensures that the scheduler does not place pods on a node and immediately induce memory pressure and trigger an eviction. This configuration assumes those pods use less than their configured request.

Recommended Practice

Daemon Sets and Out of Resource Handling

If a node evicts a pod that was created by a daemon set, the pod is immediately recreated and rescheduled to the same node. The scheduler operates this way because the node has no ability to distinguish a pod that is created by a daemon set versus any other object.

In general, daemon sets should not create best effort pods to avoid being identified as a candidate pod for eviction. Instead, daemon sets should launch pods and configure them with a guaranteed quality of service.

Handling Out of Resource Errors