$ oc get priorityclasses
You can enable pod priority and preemption in your cluster. Pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. Pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node Pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.
To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.
Preemption is controlled by the
disablePreemption parameter in the scheduler configuration file, which is set to
false by default.
When you use the Pod Priority and Preemption feature, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.
You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.
A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted. By default, OKD has two reserved priority classes for critical system pods to have guaranteed scheduling.
$ oc get priorityclasses
NAME CREATED AT cluster-logging 2019-03-13T14:45:12Z system-cluster-critical 2019-03-13T14:01:10Z system-node-critical 2019-03-13T14:01:10Z
system-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are
sdn, and so forth. A number of critical components include the
system-node-critical priority class by default, for example:
system-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the
system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth.
A number of critical components include the
system-cluster-critical priority class by default, for example:
cluster-logging - This priority is used by Fluentd to make sure Fluentd pods are scheduled to nodes over other apps.
If you upgrade your existing cluster, the priority of your existing pods is effectively zero. However, existing pods with
After you have one or more priority classes, you can create pods that specify a priority class name in a pod specification. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.
When a developer creates a pod, the pod goes into a queue. If the developer configured the pod for pod priority or preemption, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.
When the scheduler preempts one or more pods on a node, the
nominatedNodeName field of higher-priority pod specification is set to the name of the node, along with the
nodename field. The scheduler uses the
nominatedNodeName field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.
After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the
nominatedNodeName field and
nodeName field of the pod specification might be different.
Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the
nominatedNodeName of the pending pod, making the pod eligible for another node.
Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.
The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.
If you enable pod priority and preemption, consider your other scheduler settings:
A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OKD respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.
Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.
If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.
To prevent this situation, carefully configure pod affinity with equal-priority pods.
When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.
To minimize this gap, configure a small graceful termination period for lower-priority pods.
You apply pod priority and preemption by creating a priority class object and associating pods to the priority using the
priorityClassName in your pod specifications.
apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: high-priority (1) value: 1000000 (2) globalDefault: false (3) description: "This priority class should be used for XYZ service pods only." (4)
|1||The name of the priority class object.|
|2||The priority value of the object.|
|3||Optional field that indicates whether this priority class should be used for pods without a priority class name specified. This field is
|4||Optional arbitrary text string that describes which pods developers should use with this priority class.|
To configure your cluster to use priority and preemption:
Create one or more priority classes:
Specify a name and value for the priority.
Optionally specify the
globalDefault field in the priority class and a description.
Create a pod specification or edit existing pods to include the name of a priority class, similar to the following:
apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent priorityClassName: high-priority (1)
|1||Specify the priority class to use with this pod.|
Create the pod:
$ oc create -f <file-name>.yaml
You can add the priority name directly to the pod configuration or to a pod template.
You can disable the pod priority and preemption feature.
After the feature is disabled, the existing pods keep their priority fields, but preemption is disabled, and priority fields are ignored. If the feature is disabled, you cannot set a priority class name in new pods.
Critical pods rely on scheduler preemption to be scheduled when a cluster is under resource pressure. For this reason, Red Hat recommends not disabling preemption. DaemonSet pods are scheduled by the DaemonSet controller and not affected by disabling preemption.
To disable the preemption for the cluster:
Edit the Scheduler Operator Custom Resource to add the
disablePreemption: true parameter:
$ oc edit scheduler cluster
+ .Example output
apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: '2019-03-12T01:45:02Z' generation: 1 name: example resourceVersion: '1882034' selfLink: /apis/config.openshift.io/v1/schedulers/example uid: 743701e9-4468-11e9-bd34-02a7fe1bf828 spec: disablePreemption: true