$ oc get priorityclasses
You can enable pod priority and preemption in your cluster. pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.
To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.
When you use the Pod Priority and Preemption feature, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.
You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.
A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than or equal to one billion for critical pods that must not be preempted or evicted. By default, OKD has two reserved priority classes for critical system pods to have guaranteed scheduling.
$ oc get priorityclasses
NAME VALUE GLOBAL-DEFAULT AGE system-node-critical 2000001000 false 72m system-cluster-critical 2000000000 false 72m openshift-user-critical 1000000000 false 3d13h cluster-logging 1000000 false 29s
system-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are
sdn, and so forth. A number of critical components include the
system-node-critical priority class by default, for example:
system-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the
system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth.
A number of critical components include the
system-cluster-critical priority class by default, for example:
openshift-user-critical - You can use the
priorityClassName field with important pods that cannot bind their resource consumption and do not have predictable resource consumption behavior. Prometheus pods under the
openshift-user-workload-monitoring namespaces use the
priorityClassName. Monitoring workloads use
system-critical as their first
priorityClass, but this causes problems when monitoring uses excessive memory and the nodes cannot evict them. As a result, monitoring drops priority to give the scheduler flexibility, moving heavy workloads around to keep critical nodes operating.
cluster-logging - This priority is used by Fluentd to make sure Fluentd pods are scheduled to nodes over other apps.
After you have one or more priority classes, you can create pods that specify a priority class name in a
Pod spec. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.
When a developer creates a pod, the pod goes into a queue. If the developer configured the pod for pod priority or preemption, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.
When the scheduler preempts one or more pods on a node, the
nominatedNodeName field of higher-priority
Pod spec is set to the name of the node, along with the
nodename field. The scheduler uses the
nominatedNodeName field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.
After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the
nominatedNodeName field and
nodeName field of the
Pod spec might be different.
Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the
nominatedNodeName of the pending pod, making the pod eligible for another node.
Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.
The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.
Pods with the preemption policy set to
Never are placed in the scheduling queue ahead of lower-priority pods, but they cannot preempt other pods. A non-preempting pod waiting to be scheduled stays in the scheduling queue until sufficient resources are free and it can be scheduled. Non-preempting pods, like other pods, are subject to scheduler back-off. This means that if the scheduler tries unsuccessfully to schedule these pods, they are retried with lower frequency, allowing other pods with lower priority to be scheduled before them.
Non-preempting pods can still be preempted by other, high-priority pods.
If you enable pod priority and preemption, consider your other scheduler settings:
A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OKD respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.
Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.
If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.
To prevent this situation, carefully configure pod affinity with equal-priority pods.
When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.
To minimize this gap, configure a small graceful termination period for lower-priority pods.
You apply pod priority and preemption by creating a priority class object and associating pods to the priority using the
priorityClassName in your
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority (1) value: 1000000 (2) preemptionPolicy: PreemptLowerPriority (3) globalDefault: false (4) description: "This priority class should be used for XYZ service pods only." (5)
|1||The name of the priority class object.|
|2||The priority value of the object.|
|3||Optional field that indicates whether this priority class is preempting or non-preempting. The preemption policy defaults to
|4||Optional field that indicates whether this priority class should be used for pods without a priority class name specified. This field is
|5||Optional arbitrary text string that describes which pods developers should use with this priority class.|
To configure your cluster to use priority and preemption:
Create one or more priority classes:
Specify a name and value for the priority.
Optionally specify the
globalDefault field in the priority class and a description.
Pod spec or edit existing pods to include the name of a priority class, similar to the following:
Podspec with priority class name
apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent priorityClassName: high-priority (1)
|1||Specify the priority class to use with this pod.|
Create the pod:
$ oc create -f <file-name>.yaml
You can add the priority name directly to the pod configuration or to a pod template.