$ sudo dockerd --storage-opt dm.basesize=50G
This topic discusses best-effort attempts to prevent OKD from experiencing out-of-memory (OOM) and out-of-disk-space conditions.
A node must maintain stability when available compute resources are low. This is especially important when dealing with incompressible resources such as memory or disk. If either resource is exhausted, the node becomes unstable.
Administrators can proactively monitor nodes for and prevent against situations where the node runs out of compute and memory resources using configurable eviction policies.
This topic also provides information on how OKD handles out-of-resource conditions and provides an example scenario and recommended practices:
If swap memory is enabled for a node, that node cannot detect that it is under MemoryPressure. To take advantage of memory based evictions, operators must disable swap. |
An eviction policy allows a node to fail one or more pods when the node is running low on available resources. Failing a pod allows the node to reclaim needed resources.
An eviction policy is a combination of an eviction trigger signal with a specific eviction threshold value that is set in the node configuration file or through the command line. Evictions can be either hard, where a node takes immediate action on a pod that exceeds a threshold, or soft, where a node allows a grace period before taking action.
By using well-configured eviction policies, a node can proactively monitor for and prevent against total starvation of a compute resource.
When the node fails a pod, it terminates all containers in the pod, and
the |
When detecting disk pressure, the node supports the nodefs
and imagefs
file system partitions.
The nodefs
, or rootfs
, is the file system that the node uses for local disk volumes, daemon logs, emptyDir,
and so on (for example, the file system that provides /
). The rootfs
contains openshift.local.volumes
,
by default /var/lib/origin/openshift.local.volumes.
The imagefs
is the file system that the container runtime uses for storing images and
individual container-writable layers. Eviction thresholds are at 85% full for imagefs
. The imagefs
file system depends on the runtime and,
in the case of Docker, which storage driver you are using.
For Docker:
If you are using the devicemapper
storage driver, the imagefs
is thin pool.
You can limit the read/write layer for the container by setting the --storage-opt dm.basesize
flag in the Docker daemon.
$ sudo dockerd --storage-opt dm.basesize=50G
If you are using the overlay2
storage driver, the imagefs
is the file system that contains /var/lib/docker/overlay2
.
For CRI-O, which uses the overlay driver, the imagefs is /var/lib/containers/storage by default.
If you do not use local storage isolation (ephemeral storage) and not using XFS quota (volumeConfig), you cannot limit local disk usage by the pod. |
To configure an eviction policy, edit the node configuration file (the /etc/origin/node/node-config.yaml file) to specify the eviction thresholds under the eviction-hard
or eviction-soft
parameters.
For example:
kubeletArguments: eviction-hard: (1) - memory.available<100Mi (2) - nodefs.available<10% - nodefs.inodesFree<5% - imagefs.available<15% - imagefs.inodesFree<10%
1 | The type of eviction: Use this parameter for a hard eviction. |
2 | Eviction thresholds based on a specific eviction trigger signal. |
You must provide percentage values for the |
kubeletArguments: eviction-soft: (1) - memory.available<100Mi (2) - nodefs.available<10% - nodefs.inodesFree<5% - imagefs.available<15% - imagefs.inodesFree<10% eviction-soft-grace-period:(3) - memory.available=1m30s - nodefs.available=1m30s - nodefs.inodesFree=1m30s - imagefs.available=1m30s - imagefs.inodesFree=1m30s
1 | The type of eviction: Use this parameter for a soft eviction. |
2 | An eviction threshold based on a specific eviction trigger signal. |
3 | The grace period for the soft eviction. Leave the default values for optimal performance. |
Restart the OKD service for the changes to take effect:
# systemctl restart origin-node
You can configure a node to trigger eviction decisions on any of the signals described in the table below. You add an eviction signal to an eviction threshold along with a threshold value.
The value of each signal is described in the Description column based on the node summary API.
To view the signals:
curl <certificate details> \ https://<master>/api/v1/nodes/<node>/proxy/stats/summary
Node Condition | Eviction Signal | Value | Description |
---|---|---|---|
|
|
|
Available memory on the node has exceeded an eviction threshold. |
|
|
|
Available diskspace on either the node root file system or image file system has exceeded an eviction threshold. |
|
|
||
|
|
||
|
|
Each of the above signals supports either a literal or percentage-based value. The percentage-based value is calculated relative to the total capacity associated with each signal.
A script derives the value for memory.available
from your cgroup driver using the same set of steps that the kubelet performs. The script excludes inactive file memory (that is, the number of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that inactive file memory is reclaimable under pressure.
Do not use tools like |
OKD monitors these file systems every 10 seconds.
If you store volumes and logs in a dedicated file system, the node will not monitor that file system.
As of OKD 3.4, the node supports the ability to trigger eviction decisions based on disk pressure. Operators must opt-in to enable disk-based evictions. Prior to evicting pods due to disk pressure, the node also performs container and image garbage collection. In future releases, garbage collection will be deprecated in favor of a pure disk-eviction based configuration. |
You can configure a node to specify eviction thresholds, which triggers the node to reclaim resources, by adding a threshold to the node configuration file.
If an eviction threshold is met, independent of its associated grace period, the node reports a condition indicating that the node is under memory or disk pressure. This prevents the scheduler from scheduling any additional pods on the node while attempts to reclaim resources are made.
The node continues to report node status updates at the frequency specified by the node-status-update-frequency
argument, which
defaults to 10s
(ten seconds).
Eviction thresholds can be hard, for when the node takes immediate action when a threshold is met, or soft, for when you allow a grace period before reclaiming resources.
Soft eviction usage is more common when you are targeting a certain level of utilization, but can tolerate temporary spikes. We recommended setting the soft eviction threshold lower than the hard eviction threshold, but the time period can be operator-specific. The system reservation should also cover the soft eviction threshold. The soft eviction threshold is an advanced feature. You should configure a hard eviction threshold before attempting to use soft eviction thresholds. |
Thresholds are configured in the following form:
<eviction_signal><operator><quantity>
the eviction-signal
value can be any supported eviction signal.
the operator
value is <
.
the quantity
value must match the quantity representation used by
Kubernetes and can be expressed as a percentage if it ends with the %
token.
For example, if an operator has a node with 10Gi of memory, and that operator wants to induce eviction if available memory falls below 1Gi, an eviction threshold for memory can be specified as either of the following:
memory.available<1Gi memory.available<10%
The node evaluates and monitors eviction thresholds every 10 seconds and the value can not be modified. This is the housekeeping interval. |
A hard eviction threshold has no grace period and, if observed, the node takes immediate action to reclaim the associated starved resource. If a hard eviction threshold is met, the node kills the pod immediately with no graceful termination.
To configure hard eviction thresholds, add eviction thresholds to the node configuration file
under eviction-hard
, as shown in Using the Node Configuration to Create a Policy.
kubeletArguments: eviction-hard: - memory.available<500Mi - nodefs.available<500Mi - nodefs.inodesFree<100Mi - imagefs.available<100Mi - imagefs.inodesFree<100Mi
This example is a general guideline and not recommended settings.
A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The node does not reclaim resources associated with the eviction signal until that grace period is exceeded. If no grace period is provided in the node configuration the node errors on startup.
In addition, if a soft eviction threshold is met, an operator can specify a maximum allowed pod termination grace period to use when evicting pods from the
node. If eviction-max-pod-grace-period
is specified, the node uses the lesser value among the pod.Spec.TerminationGracePeriodSeconds
and the maximum-allowed grace period. If not specified, the node kills pods immediately with no graceful termination.
For soft eviction thresholds the following flags are supported:
eviction-soft
: a set of eviction thresholds (for example, memory.available<1.5Gi
) that, if met over a corresponding grace period, triggers a pod eviction.
eviction-soft-grace-period
: a set of eviction grace periods (for example, memory.available=1m30s
) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
eviction-max-pod-grace-period
: the maximum-allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
To configure soft eviction thresholds, add eviction thresholds to the node configuration file under eviction-soft
, as shown in Using the Node Configuration to Create a Policy.
kubeletArguments: eviction-soft: - memory.available<500Mi - nodefs.available<500Mi - nodefs.inodesFree<100Mi - imagefs.available<100Mi - imagefs.inodesFree<100Mi eviction-soft-grace-period: - memory.available=1m30s - nodefs.available=1m30s - nodefs.inodesFree=1m30s - imagefs.available=1m30s - imagefs.inodesFree=1m30s
This example is a general guideline and not recommended settings.
You can control how much of a node resource is made available for scheduling in order to allow the scheduler to fully allocate a node and to prevent evictions.
Set system-reserved
equal to the amount of resource you want available to the scheduler for deploying pods and for system-daemons.
Evictions should only occur if pods use more than their requested amount of an allocatable resource.
A node reports two values:
Capacity
: How much resource is on the machine
Allocatable
: How much resource is made available for scheduling.
To configure the amount of allocatable resources:
Edit the node configuration file (the /etc/origin/node/node-config.yaml file) to add or modify the system-reserved
parameter for eviction-hard
or eviction-soft
.
kubeletArguments: eviction-hard: (1) - "memory.available<500Mi" system-reserved: - "memory=1.5Gi"
1 | This threshold can either be eviction-hard or eviction-soft . |
Restart the OKD service for the changes to take effect:
# systemctl restart origin-node
If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node condition oscillates between true and false, which can cause problems for the scheduler.
To prevent this oscillation, set the eviction-pressure-transition-period
parameter to control how long the node must wait before transitioning out of a pressure condition.
Edit or add the parameter to the kubeletArguments
section of the node configuration file
(the /etc/origin/node/node-config.yaml)
using a set of <resource_type>=<resource_quantity>
pairs.
kubeletArguments: eviction-pressure-transition-period="5m"
+ The node toggles the condition back to false when the node has not observed an eviction threshold being met for the specified pressure condition for the specified period.
+
Use the default value (5 minutes) before doing any adjustments. The default choice is intended to allow the system to stabilize, and to prevent the scheduler from assigning new pods to the node before it has settled. |
Restart the OKD services for the changes to take effect:
# systemctl restart origin-node
If an eviction criteria is satisfied, the node initiates the process of reclaiming the pressured resource until the signal goes below the defined threshold. During this time, the node does not support scheduling any new pods.
The node attempts to reclaim node-level resources prior to evicting end-user pods, based on whether the host system has a dedicated imagefs
configured for the
container runtime.
If the host system has imagefs
:
If the nodefs
file system meets eviction thresholds, the node frees up disk
space in the following order:
Delete dead pods/containers
If the imagefs
file system meets eviction thresholds, the node frees up disk
space in the following order:
Delete all unused images
If the host system does not have imagefs
:
If the nodefs
file system meets eviction thresholds, the node frees up disk
space in the following order:
Delete dead pods/containers
Delete all unused images
If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until the signal goes below the defined threshold.
The node ranks pods for eviction by their quality of service, and, among those with the same quality of service, by the consumption of the starved compute resource relative to the pod’s scheduling request.
Each QOS level has an OOM score, which the Linux out-of-memory tool (OOM killer) uses to determine which pods to kill. See Understanding Quality of Service and Out of Memory Killer below.
The following table lists each QOS level and the associated OOM score.
Quality of Service | Description |
---|---|
|
Pods that consume the highest amount of the starved resource relative to their request are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource. |
|
Pods that consume the highest amount of the starved resource relative to their request for that resource are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource. |
|
Pods that consume the highest amount of the starved resource are failed first. |
A Guaranteed
pod will never be evicted because of another pod’s resource consumption unless a system daemon (such as node, docker, journald) is consuming more resources than were reserved using system-reserved, or kube-reserved allocations or if the node has only Guaranteed
pods remaining.
If the node has only Guaranteed
pods remaining, the node evicts a Guaranteed
pod that least impacts node stability and limits the impact of the unexpected consumption to other Guaranteed
pods.
Local disk is a BestEffort
resource. If necessary, the node evicts pods one at a time to reclaim disk when DiskPressure
is encountered. The node ranks
pods by quality of service. If the node is responding to inode starvation, it will reclaim inodes by evicting pods with the lowest quality of service first.
If the node is responding to lack of available disk, it will rank pods within a quality of service that consumes the largest amount of local disk, and evict
those pods first.
If the node experiences a system out of memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.
The node sets a oom_score_adj
value for each container based on the quality of service for the pod.
Quality of Service | oom_score_adj Value |
---|---|
|
-998 |
|
min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
|
1000 |
If the node is unable to reclaim memory prior to experiencing a system OOM event, the oom_killer
calculates an oom_score
:
% of node memory a container is using + `oom_score_adj` = `oom_score`
The node then kills the container with the highest score.
Containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request are failed first.
Unlike pod eviction, if a pod container is OOM failed, it can be restarted by the node based on the node restart policy.
The scheduler views node conditions when placing additional pods on the node. For example, if the node has an eviction threshold like the following:
eviction-hard is "memory.available<500Mi"
and available memory falls below 500Mi, the node reports a value in Node.Status.Conditions
as MemoryPressure
as true.
Node Condition | Scheduler Behavior |
---|---|
|
If a node reports this condition, the scheduler will not place |
|
If a node reports this condition, the scheduler will not place any additional pods on that node. |
Consider the following scenario.
An opertator:
has a node with a memory capacity of 10Gi
;
wants to reserve 10% of memory capacity for system daemons (kernel, node, etc.);
wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.
Implicit in this configuration is the understanding that system-reserved
should include the amount of memory covered by the eviction threshold.
To reach that capacity, either some pod is using more than its request, or the system is using more than 1Gi
.
If a node has 10 Gi of capacity, and you want to reserve 10% of that capacity for the system daemons (system-reserved
), perform the following calculation:
capacity = 10 Gi system-reserved = 10 Gi * .1 = 1 Gi
The amount of allocatable resources becomes:
allocatable = capacity - system-reserved = 9 Gi
This means by default, the scheduler will schedule pods that request 9 Gi of memory to that node.
If you want to turn on eviction so that eviction is triggered when the node observes that available memory falls below 10% of capacity for 30 seconds, or immediately when it falls below 5% of capacity, you need the scheduler to see allocatable as 8Gi. Therefore, ensure your system reservation covers the greater of your eviction thresholds.
capacity = 10 Gi eviction-threshold = 10 Gi * .1 = 1 Gi system-reserved = (10Gi * .1) + eviction-threshold = 2 Gi allocatable = capacity - system-reserved = 8 Gi
Enter the following in the node-config.yaml:
kubeletArguments: system-reserved: - "memory=2Gi" eviction-hard: - "memory.available<.5Gi" eviction-soft: - "memory.available<1Gi" eviction-soft-grace-period: - "memory.available=30s"
This configuration ensures that the scheduler does not place pods on a node that immediately induce memory pressure and trigger eviction assuming those pods use less than their configured request.
If a node evicts a pod that was created by a DaemonSet, the pod will immediately be recreated and rescheduled back to the same node, because the node has no ability to distinguish a pod created from a DaemonSet versus any other object.
In general, DaemonSets should not create BestEffort
pods to avoid being
identified as a candidate pod for eviction. Instead DaemonSets should ideally
launch Guaranteed
pods.