Applying pod priority and preemption

You can enable pod priority and preemption in your cluster. Pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. Pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node Pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.

To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.

Preemption is controlled by the disablePreemption parameter in the scheduler configuration file, which is set to false by default.

About pod priority

When the Pod Priority and Preemption feature is enabled, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.

Pod priority classes

You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.

A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted. By default, OKD has two reserved priority classes for critical system pods to have guaranteed scheduling.

  • System-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are sdn-ovs, sdn, and so forth.

  • System-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth.

If you upgrade your existing cluster, the priority of your existing pods is effectively zero. However, existing pods with the scheduler.alpha.kubernetes.io/critical-pod annotation are automatically converted to system-cluster-critical class.

Pod priority names

After you have one or more priority classes, you can create pods that specify a priority class name in a pod specification. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.

The following YAML is an example of a pod configuration that uses the priority class created in the preceding example. The priority admission controller checks the specification and resolves the priority of the pod to 1000000.

About pod preemption

When a developer creates a pod, the pod goes into a queue. When the Pod Priority and Preemption feature is enabled, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.

When the scheduler preempts one or more pods on a node, the nominatedNodeName field of higher-priority pod specification is set to the name of the node, along with the nodename field. The scheduler uses the nominatedNodeName field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.

After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the nominatedNodeName field and nodeName field of the pod specification might be different.

Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the nominatedNodeName of the pending pod, making the pod eligible for another node.

Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.

The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.

Pod preemption and other scheduler settings

If you enable pod priority and preemption, consider your other scheduler settings:

Pod priority and pod disruption budget

A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OKD respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.

Pod priority and pod affinity

Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.

If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.

To prevent this situation, carefully configure pod affinity with equal-priority pods.

Graceful termination of preempted pods

When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.

To minimize this gap, configure a small graceful termination period for lower-priority pods.

Pod priority example scenarios

Pod priority and preemption assigns a priority to pods for scheduling. The scheduler will preempt (evict) lower-priority pods in order to schedule higher-priority pods.

Typical preemption scenario

Pod P is a pending pod.

  1. The scheduler locates Node N, where the removal of one or more pods would enable Pod P to be scheduled on that node.

  2. The scheduler deletes the lower-priority pods from the Node N and schedules Pod P on the node.

  3. The nominatedNodeName field of Pod P is set to the name of Node N.

Pod P is not necessarily scheduled to the nominated node.

Preemption and termination periods

The preempted pod has a long termination period.

  1. The scheduler preempts a lower-priority pod on Node N.

  2. The scheduler waits for the pod to gracefully terminate.

  3. For other scheduling reasons, Node M becomes available.

  4. The scheduler can then schedule Pod P on Node M.

Configuring priority and preemption

You apply pod priority and preemption by creating a priority class objects and associating pods to the priority using the priorityClassName in your pod specifications.

Sample priority class object
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: high-priority (1)
value: 1000000 (2)
globalDefault: false (3)
description: "This priority class should be used for XYZ service pods only." (4)
1 The name of the priority class object.
2 The priority value of the object.
3 Optional field that indicates whether this priority class should be used for pods without a priority class name specified. This field is false by default. Only one priority class with globalDefault set to true can exist in the cluster. If there is no priority class with globalDefault:true, the priority of pods with no priority class name is zero. Adding a priority class with globalDefault:true affects only pods created after the priority class is added and does not change the priorities of existing pods.
4 Optional arbitrary text string that describes which pods developers should use with this priority class.
Sample pod specification with priority class name
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority (1)
1 Specify the priority class to use with this pod.

To configure your cluster to use priority and preemption:

  1. Create one or more priority classes:

    1. Specify a name and value for the priority.

    2. Optionally specify the globalDefault field in the priority class and a description.

  2. Create pods or edit existing pods to include the name of a priority class. You can add the priority name directly to the pod configuration or to a pod template:

Disabling priority and preemption

You can disable the pod priority and preemption feature.

After the feature is disabled, the existing pods keep their priority fields, but preemption is disabled, and priority fields are ignored. If the feature is disabled, you cannot set a priority class name in new pods.

Critical pods rely on scheduler preemption to be scheduled when a cluster is under resource pressure. For this reason, Red Hat recommends not disabling preemption. DaemonSet pods are scheduled by the DaemonSet controller and not affected by disabling preemption.

To disable the preemption for the cluster:

  1. Modify the master-config.yaml to set the disablePreemption parameter in the schedulerArgs section to false.

    disablePreemption=false
  2. Restart the OKD master service and scheduler to apply the changes.

    # master-restart api
    # master-restart scheduler