Descheduling - Scheduling | Cluster Administration

Overview
Creating a Cluster Role
Creating Descheduler Policies
Create a Configuration Map for the Descheduler Policy
Create the Job Specification
Run the Descheduler

Overview

Descheduling involves evicting pods based on specific policies so that the pods can be rescheduled onto more appropriate nodes.

Your cluster can benefit from descheduling and rescheduling already-running pods for various reasons:

Nodes are under- or over-utilized.
Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
Node failure requires pods to be moved.
New nodes are added to clusters.

The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.

It is important to note that there are a number of core components, such as Heapster and DNS, that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster may stop working properly if the component is evicted. To prevent the descheduler from removing these pods, configure the pod as a critical pod by adding the scheduler.alpha.kubernetes.io/critical-pod annotation to the pod specification.

The descheduler job is considered a critical pod, which prevents the descheduler pod from being evicted by the descheduler.

The descheduler job and descheduler pod are created in the kube-system project, which is created by default.

The descheduler is a Technology Preview feature only.

The descheduler does not evict the following types of pods:

Critical pods (with the scheduler.alpha.kubernetes.io/critical-pod annotation).
Pods (static and mirror pods or pods in standalone mode) not associated with a Replica Set, Replication Controller, Deployment, or Job (because these pods are not recreated).
Pods associated with DaemonSets.
Pods with local storage.
Pods subject to Pod Disruption Budget (PDB) are not evicted if descheduling violates the PDB. The pods can be evicted using an eviction policy.

Best efforts pods are evicted before Burstable and Guaranteed pods.

The following sections describe the process to configure and run the descheduler:

Create a role.
Define the descheduling behavior in a policy file.
Create a configuration map to reference the policy file.
Create the descheduler job configuration.
Run the descheduler job.

Creating a Cluster Role

To configure the necessary permissions for the descheduler to work in a pod:

Create a cluster role with the following rules:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: descheduler-cluster-role
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "watch", "list"] (1)
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list", "delete"] (2)
- apiGroups: [""]
  resources: ["pods/eviction"] (3)
  verbs: ["create"]

1	Configures the role to allow viewing nodes.
2	Configures the role to allow viewing and deleting pods.
3	Allows a node to evict pods bound to itself.

Create the service account which will be used to run the job:

# oc create sa <file-name>.yaml -n kube-system

For example:

# oc create sa descheduler-sa.yaml -n kube-system

Bind the cluster role to the service account:

# oc create clusterrolebinding descheduler-cluster-role-binding \
    --clusterrole=<cluster-role-name> \
    --serviceaccount=kube-system:<service-account-name>

For example:

# oc create clusterrolebinding descheduler-cluster-role-binding \
    --clusterrole=descheduler-cluster-role \
    --serviceaccount=kube-system:descheduler-sa

Creating Descheduler Policies

You can configure the descheduler to remove pods from nodes that violate rules defined by strategies in a YAML policy file. Include a path to the policy file in the job specification to apply the specific descheduling strategy.

Sample descheduler policy file

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
     enabled: false
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           "cpu" : 20
           "memory": 20
           "pods": 20
         targetThresholds:
           "cpu" : 50
           "memory": 50
           "pods": 50
         numberOfNodes: 3
  "RemovePodsViolatingInterPodAntiAffinity":
     enabled: true

There are three default strategies that can be used with the descheduler:

Remove duplicate pods (RemoveDuplicates)
Move pods to underutilized nodes (LowNodeUtilization)
Remove pods that violate anti-affinity rules (RemovePodsViolatingInterPodAntiAffinity).

You can configure and disable parameters associated with strategies as needed.

Removing Duplicate Pods

The RemoveDuplicates strategy ensures that there is only one pod associated with a Replica Set, Replication Controller, Deployment Configuration, or Job running on same node. If there are other pods associated with those objects, the duplicate pods are evicted. Removing duplicate pods results in better spreading of pods in a cluster.

For example, duplicate pods could happen if a node fails and the pods on the node are moved to another node, leading to more than one pod associated with an Replica Set or Replication Controller, running on same node. After the failed node is ready again, this strategy could be used to evict those duplicate pods.

There are no parameters associated with this strategy.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
     enabled: false (1)

1	Set this value to `enabled: true` to use this policy. Set to `false` to disable this policy.

Creating a Low Node Utilization Policy

The LowNodeUtilization strategy finds nodes that are underutilized and evicts pods from other nodes so that the evicted pods can be scheduled on these underutilized nodes.

The underutilization of nodes is determined by a configurable threshold, thresholds, for CPU, memory, or number of pods (based on percentage). If a node usage is below all these thresholds, the node is considered underutilized and the descheduler can evict pods from other nodes. Pods request resource requirements are considered when computing node resource utilization.

A high threshold value, targetThresholds is used to determine properly utilized nodes. Any node that is between the thresholds and targetThresholds is considered properly utilized and is not considered for eviction. The threshold, targetThresholds, can be configured for CPU, memory, and number of pods (based on percentage).

These thresholds could be tuned for your cluster requirements.

The numberOfNodes parameter can be configured to activate the strategy only when number of underutilized nodes is above the configured value. Set this parameter if it is acceptable for a few nodes to go underutilized. By default, numberOfNodes is set to zero.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds: (1)
           "cpu" : 20
           "memory": 20
           "pods": 20
         targetThresholds: (2)
           "cpu" : 50
           "memory": 50
           "pods": 50
         numberOfNodes: 3 (3)

1	Set the low-end threshold. If the node is below all three values, the descheduler considers the node underutilized.
2	Set the high-end threshold. If the node is below these values and above the `threshold` values, the descheduler considers the node properly utilized.
3	Set the number of nodes that can be underutilized before the descheduler will evict pods from underutilized nodes.

Remove Pods Violating Inter-Pod Anti-Affinity

The RemovePodsViolatingInterPodAntiAffinity strategy ensures that pods violating inter-pod anti-affinity are removed from nodes.

For example, Node1 has podA, podB, and podC. podB and podC have anti-affinity rules that prohibit them from running on the same node as podA. podA will be evicted from the node so that podB and podC can run on that node. This situation could happen if the anti-affinity rule was applied when podB and podC were running on the node.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingInterPodAntiAffinity": (1)
     enabled: true

1	Set this value to `enabled: true` to use this policy. Set to `false` to disable this policy.

Create a Configuration Map for the Descheduler Policy

Create a configuration map for the descheduler policy file in the kube-system project, so that it can be referenced by the descheduler job.

# oc create configmap descheduler-policy-configmap \
     -n kube-system --from-file=<path-to-policy-dir/policy.yaml> (1)

1	The path to the policy file you created.

Create the Job Specification

Create a job configuration for the descheduler.

apiVersion: batch/v1
kind: Job
metadata:
  name: descheduler-job
  namespace: kube-system
spec:
  parallelism: 1
  completions: 1
  template:
    metadata:
      name: descheduler-pod (1)
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: "true" (2)
    spec:
        containers:
        - name: descheduler
          image: descheduler
          volumeMounts: (3)
          - mountPath: /policy-dir
            name: policy-volume
          command:
          - "/bin/sh"
          - "-ec"
          - |
            /bin/descheduler --policy-config-file /policy-dir/policy.yaml (4)
        restartPolicy: "Never"
        serviceAccountName: descheduler-sa (5)
        volumes:
        - name: policy-volume
          configMap:
            name: descheduler-policy-configmap

1	Specify a name for the job.
2	Configures the pod so that it will not be descheduled.
3	The volume name and mount path in the container where the job should be mounted.
4	Path in the container where the policy file you created will be stored.
5	Specify the name of the service account you created.

The policy file is mounted as a volume from the configuration map.

Run the Descheduler

To run the descheduler as a job in a pod:

# oc create -f <file-name>.yaml

For example:

# oc create -f descheduler-job.yaml