The cluster autoscaler adjusts the size of an OKD cluster to meet its current deployment needs. It uses declarative, Kubernetes-style arguments to provide infrastructure management that does not rely on objects of a specific cloud provider. The cluster autoscaler has a cluster scope, and is not associated with a particular namespace.
The cluster autoscaler increases the size of the cluster when there are pods that fail to schedule on any of the current worker nodes due to insufficient resources or when another node is necessary to meet deployment needs. The cluster autoscaler does not increase the cluster resources beyond the limits that you specify.
The cluster autoscaler computes the total
memory, CPU, and GPU
on all nodes the cluster, even though it does not manage the control plane nodes. These values are not single-machine oriented. They are an aggregation of all the resources in the entire cluster. For example, if you set the maximum memory resource limit, the cluster autoscaler includes all the nodes in the cluster when calculating the current memory usage. That calculation is then used to determine if the cluster autoscaler has the capacity to add more worker resources.
|
Ensure that the maxNodesTotal value in the ClusterAutoscaler resource definition that you create is large enough to account for the total possible number of machines in your cluster. This value must encompass the number of control plane machines and the possible number of compute machines that you might scale to.
|
Every 10 seconds, the cluster autoscaler checks which nodes are unnecessary in the cluster and removes them. The cluster autoscaler considers a node for removal if the following conditions apply:
-
The node utilization is less than the node utilization level threshold for the cluster. The node utilization level is the sum of the requested resources divided by the allocated resources for the node. If you do not specify a value in the ClusterAutoscaler
custom resource, the cluster autoscaler uses a default value of 0.5
, which corresponds to 50% utilization.
-
The cluster autoscaler can move all pods running on the node to the other nodes. The Kubernetes scheduler is responsible for scheduling pods on the nodes.
-
The cluster autoscaler does not have scale down disabled annotation.
If the following types of pods are present on a node, the cluster autoscaler will not remove the node:
-
Pods with restrictive pod disruption budgets (PDBs).
-
Kube-system pods that do not run on the node by default.
-
Kube-system pods that do not have a PDB or have a PDB that is too restrictive.
-
Pods that are not backed by a controller object such as a deployment, replica set, or stateful set.
-
Pods with local storage.
-
Pods that cannot be moved elsewhere because of a lack of resources, incompatible node selectors or affinity, matching anti-affinity, and so on.
-
Unless they also have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
annotation, pods that have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
annotation.
For example, you set the maximum CPU limit to 64 cores and configure the cluster autoscaler to only create machines that have 8 cores each. If your cluster starts with 30 cores, the cluster autoscaler can add up to 4 more nodes with 32 cores, for a total of 62.
If you configure the cluster autoscaler, additional usage restrictions apply:
-
Do not modify the nodes that are in autoscaled node groups directly. All nodes within the same node group have the same capacity and labels and run the same system pods.
-
Specify requests for your pods.
-
If you have to prevent pods from being deleted too quickly, configure appropriate PDBs.
-
Confirm that your cloud provider quota is large enough to support the maximum node pools that you configure.
-
Do not run additional node group autoscalers, especially the ones offered by your cloud provider.
The horizontal pod autoscaler (HPA) and the cluster autoscaler modify cluster resources in different ways. The HPA changes the deployment’s or replica set’s number of replicas based on the current CPU load. If the load increases, the HPA creates new replicas, regardless of the amount of resources available to the cluster. If there are not enough resources, the cluster autoscaler adds resources so that the HPA-created pods can run. If the load decreases, the HPA stops some replicas. If this action causes some nodes to be underutilized or completely empty, the cluster autoscaler deletes the unnecessary nodes.
The cluster autoscaler takes pod priorities into account. The Pod Priority and Preemption feature enables scheduling pods based on priorities if the cluster does not have enough resources, but the cluster autoscaler ensures that the cluster has resources to run all pods. To honor the intention of both features, the cluster autoscaler includes a priority cutoff function. You can use this cutoff to schedule "best-effort" pods, which do not cause the cluster autoscaler to increase resources but instead run only when spare resources are available.
Pods with priority lower than the cutoff value do not cause the cluster to scale up or prevent the cluster from scaling down. No new nodes are added to run the pods, and nodes running these pods might be deleted to free resources.
Cluster autoscaling is supported for the platforms that have machine API available on it.
Configuring the cluster autoscaler
First, deploy the cluster autoscaler to manage automatic resource scaling in your OKD cluster.
|
Because the cluster autoscaler is scoped to the entire cluster, you can make only one cluster autoscaler for the cluster.
|
Cluster autoscaler resource definition
This ClusterAutoscaler
resource definition shows the parameters and sample values for the cluster autoscaler.
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
name: "default"
spec:
podPriorityThreshold: -10 (1)
resourceLimits:
maxNodesTotal: 24 (2)
cores:
min: 8 (3)
max: 128 (4)
memory:
min: 4 (5)
max: 256 (6)
gpus:
- type: nvidia.com/gpu (7)
min: 0 (8)
max: 16 (9)
- type: amd.com/gpu
min: 0
max: 4
logVerbosity: 4 (10)
scaleDown: (11)
enabled: true (12)
delayAfterAdd: 10m (13)
delayAfterDelete: 5m (14)
delayAfterFailure: 30s (15)
unneededTime: 5m (16)
utilizationThreshold: "0.4" (17)
expanders: ["Random"] (18)
1 |
Specify the priority that a pod must exceed to cause the cluster autoscaler to deploy additional nodes. Enter a 32-bit integer value. The podPriorityThreshold value is compared to the value of the PriorityClass that you assign to each pod. |
2 |
Specify the maximum number of nodes to deploy. This value is the total number of machines that are deployed in your cluster, not just the ones that the autoscaler controls. Ensure that this value is large enough to account for all of your control plane and compute machines and the total number of replicas that you specify in your MachineAutoscaler resources. |
3 |
Specify the minimum number of cores to deploy in the cluster. |
4 |
Specify the maximum number of cores to deploy in the cluster. |
5 |
Specify the minimum amount of memory, in GiB, in the cluster. |
6 |
Specify the maximum amount of memory, in GiB, in the cluster. |
7 |
Optional: Specify the type of GPU node to deploy. Only nvidia.com/gpu and amd.com/gpu are valid types. |
8 |
Specify the minimum number of GPUs to deploy in the cluster. |
9 |
Specify the maximum number of GPUs to deploy in the cluster. |
10 |
Specify the logging verbosity level between 0 and 10 . The following log level thresholds are provided for guidance:
-
1 : (Default) Basic information about changes.
-
4 : Debug-level verbosity for troubleshooting typical issues.
-
9 : Extensive, protocol-level debugging information.
If you do not specify a value, the default value of 1 is used.
|
11 |
In this section, you can specify the period to wait for each action by using any valid ParseDuration interval, including ns , us , ms , s , m , and h . |
12 |
Specify whether the cluster autoscaler can remove unnecessary nodes. |
13 |
Optional: Specify the period to wait before deleting a node after a node has recently been added. If you do not specify a value, the default value of 10m is used. |
14 |
Optional: Specify the period to wait before deleting a node after a node has recently been deleted. If you do not specify a value, the default value of 0s is used. |
15 |
Optional: Specify the period to wait before deleting a node after a scale down failure occurred. If you do not specify a value, the default value of 3m is used. |
16 |
Optional: Specify a period of time before an unnecessary node is eligible for deletion. If you do not specify a value, the default value of 10m is used. |
17 |
Optional: Specify the node utilization level. Nodes below this utilization level are eligible for deletion.
The node utilization level is the sum of the requested resources divided by the allocated resources for the node, and must be a value greater than "0" but less than "1" . If you do not specify a value, the cluster autoscaler uses a default value of "0.5" , which corresponds to 50% utilization. You must express this value as a string.
|
18 |
Optional: Specify any expanders that you want the cluster autoscaler to use.
The following values are valid:
-
LeastWaste : Selects the machine set that minimizes the idle CPU after scaling.
If multiple machine sets would yield the same amount of idle CPU, the selection minimizes unused memory.
-
Priority : Selects the machine set with the highest user-assigned priority.
To use this expander, you must create a config map that defines the priority of your machine sets.
For more information, see "Configuring a priority expander for the cluster autoscaler."
-
Random : (Default) Selects the machine set randomly.
If you do not specify a value, the default value of Random is used.
You can specify multiple expanders by using the [LeastWaste, Priority] format.
The cluster autoscaler applies each expander according to the specified order.
In the [LeastWaste, Priority] example, the cluster autoscaler first evaluates according to the LeastWaste criteria.
If more than one machine set satisfies the LeastWaste criteria equally well, the cluster autoscaler then evaluates according to the Priority criteria.
If more than one machine set satisfies all of the specified expanders equally well, the cluster autoscaler selects one to use at random.
|
|
When performing a scaling operation, the cluster autoscaler remains within the ranges set in the ClusterAutoscaler resource definition, such as the minimum and maximum number of cores to deploy or the amount of memory in the cluster. However, the cluster autoscaler does not correct the current values in your cluster to be within those ranges.
The minimum and maximum CPUs, memory, and GPU values are determined by calculating those resources on all nodes in the cluster, even if the cluster autoscaler does not manage the nodes. For example, the control plane nodes are considered in the total memory in the cluster, even though the cluster autoscaler does not manage the control plane nodes.
|
Configuring a priority expander for the cluster autoscaler
When the cluster autoscaler uses the priority expander, it scales up by using the machine set with the highest user-assigned priority.
To use this expander, you must create a config map that defines the priority of your machine sets.
For each specified priority level, you must create regular expressions to identify machine sets that you want to use when prioritizing a machine set for selection.
The regular expressions must match the name of any compute machine set that you want the cluster autoscaler to consider for selection.
Prerequisites
-
You have deployed an OKD cluster that uses the Machine API.
-
You have access to the cluster using an account with cluster-admin
permissions.
-
You have installed the OpenShift CLI (oc
).
Procedure
-
List the compute machine sets on your cluster by running the following command:
$ oc get machinesets.machine.openshift.io
Example output
NAME DESIRED CURRENT READY AVAILABLE AGE
archive-agl030519-vplxk-worker-us-east-1c 1 1 1 1 25m
fast-01-agl030519-vplxk-worker-us-east-1a 1 1 1 1 55m
fast-02-agl030519-vplxk-worker-us-east-1a 1 1 1 1 55m
fast-03-agl030519-vplxk-worker-us-east-1b 1 1 1 1 55m
fast-04-agl030519-vplxk-worker-us-east-1b 1 1 1 1 55m
prod-01-agl030519-vplxk-worker-us-east-1a 1 1 1 1 33m
prod-02-agl030519-vplxk-worker-us-east-1c 1 1 1 1 33m
-
Using regular expressions, construct one or more patterns that match the name of any compute machine set that you want to set a priority level for.
For example, use the regular expression pattern *fast*
to match any compute machine set that includes the string fast
in its name.
-
Create a cluster-autoscaler-priority-expander.yml
YAML file that defines a config map similar to the following:
Example priority expander config map
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander (1)
namespace: openshift-machine-api (2)
data:
priorities: |- (3)
10:
- .*fast.*
- .*archive.*
40:
- .*prod.*
1 |
You must name config map cluster-autoscaler-priority-expander . |
2 |
You must create the config map in the same namespace as cluster autoscaler pod, which is the openshift-machine-api namespace. |
3 |
Define the priority of your machine sets.
The priorities values must be positive integers.
The cluster autoscaler uses higher-value priorities before lower-value priorities.
For each priority level, specify the regular expressions that correspond to the machine sets you want to use.
|
-
Create the config map by running the following command:
$ oc create configmap cluster-autoscaler-priority-expander \
--from-file=<location_of_config_map_file>/cluster-autoscaler-priority-expander.yml
Deploying a cluster autoscaler
To deploy a cluster autoscaler, you create an instance of the ClusterAutoscaler
resource.
Procedure
-
Create a YAML file for a ClusterAutoscaler
resource that contains the custom resource definition.
-
Create the custom resource in the cluster by running the following command:
$ oc create -f <filename>.yaml (1)
1 |
<filename> is the name of the custom resource file. |