After installing OKD, you can further expand and customize your cluster to your requirements through certain node tasks.

Adding RHEL compute machines to an OKD cluster

Understand and work with RHEL compute nodes.

About adding RHEL compute nodes to a cluster

In OKD Latest, you have the option of using Red Hat Enterprise Linux (RHEL) machines as compute machines, which are also known as worker machines, in your cluster if you use a user-provisioned infrastructure installation. You must use Fedora CoreOS (FCOS) machines for the control plane, or master, machines in your cluster.

As with all installations that use user-provisioned infrastructure, if you choose to use RHEL compute machines in your cluster, you take responsibility for all operating system life cycle management and maintenance, including performing system updates, applying patches, and completing all other required tasks.

Because removing OKD from a machine in the cluster requires destroying the operating system, you must use dedicated hardware for any RHEL machines that you add to the cluster.

Swap memory is disabled on all RHEL machines that you add to your OKD cluster. You cannot enable swap memory on these machines.

You must add any RHEL compute machines to the cluster after you initialize the control plane.

System requirements for RHEL compute nodes

The Red Hat Enterprise Linux (RHEL) compute machine hosts, which are also known as worker machine hosts, in your OKD environment must meet the following minimum hardware specifications and system-level requirements.

  • You must have an active OKD subscription on your Red Hat account. If you do not, contact your sales representative for more information.

  • Production environments must provide compute machines to support your expected workloads. As a cluster administrator, you must calculate the expected workload and add about 10 percent for overhead. For production environments, allocate enough resources so that a node host failure does not affect your maximum capacity.

  • Each system must meet the following hardware requirements:

    • Physical or virtual system, or an instance running on a public or private IaaS.

    • Base OS: Fedora 21, CentOS 7.4, or RHEL 7.7-7.8 with "Minimal" installation option.

    • NetworkManager 1.0 or later.

    • 1 vCPU.

    • Minimum 8 GB RAM.

    • Minimum 15 GB hard disk space for the file system containing /var/.

    • Minimum 1 GB hard disk space for the file system containing /usr/local/bin/.

    • Minimum 1 GB hard disk space for the file system containing the system’s temporary directory. The system’s temporary directory is determined according to the rules defined in the tempfile module in Python’s standard library.

  • Each system must meet any additional requirements for your system provider. For example, if you installed your cluster on VMware vSphere, your disks must be configured according to its storage guidelines and the disk.enableUUID=true attribute must be set.

  • Each system must be able to access the cluster’s API endpoints by using DNS-resolvable host names. Any network security access control that is in place must allow the system access to the cluster’s API service endpoints.

Certificate signing requests management

Because your cluster has limited access to automatic machine management when you use infrastructure that you provision, you must provide a mechanism for approving cluster certificate signing requests (CSRs) after installation. The kube-controller-manager only approves the kubelet client CSRs. The machine-approver cannot guarantee the validity of a serving certificate that is requested by using kubelet credentials because it cannot confirm that the correct machine issued the request. You must determine and implement a method of verifying the validity of the kubelet serving certificate requests and approving them.

Preparing the machine to run the playbook

Before you can add compute machines that use Red Hat Enterprise Linux as the operating system to an OKD Latest cluster, you must prepare a machine to run the playbook from. This machine is not part of the cluster but must be able to access it.

Prerequisites
  • Install the OpenShift CLI (oc) on the machine that you run the playbook on.

  • Log in as a user with cluster-admin permission.

Procedure
  1. Ensure that the kubeconfig file for the cluster and the installation program that you used to install the cluster are on the machine. One way to accomplish this is to use the same machine that you used to install the cluster.

  2. Configure the machine to access all of the RHEL hosts that you plan to use as compute machines. You can use any method that your company allows, including a bastion with an SSH proxy or a VPN.

  3. Configure a user on the machine that you run the playbook on that has SSH access to all of the RHEL hosts.

    If you use SSH key-based authentication, you must manage the key with an SSH agent.

  4. If you have not already done so, register the machine with RHSM and attach a pool with an OpenShift subscription to it:

    1. Register the machine with RHSM:

      # subscription-manager register --username=<user_name> --password=<password>
    2. Pull the latest subscription data from RHSM:

      # subscription-manager refresh
    3. List the available subscriptions:

      # subscription-manager list --available --matches '*OpenShift*'
    4. In the output for the previous command, find the pool ID for an OKD subscription and attach it:

      # subscription-manager attach --pool=<pool_id>
  5. Enable the repositories required by OKD Latest:

    # subscription-manager repos \
        --enable="rhel-7-server-rpms" \
        --enable="rhel-7-server-extras-rpms" \
        --enable="rhel-7-server-ansible-2.9-rpms" \
        --enable="rhel-7-server-ose-4.5-rpms"
  6. Install the required packages, including Openshift-Ansible:

    # yum install openshift-ansible openshift-clients jq

    The openshift-ansible package provides installation program utilities and pulls in other packages that you require to add a RHEL compute node to your cluster, such as Ansible, playbooks, and related configuration files. The openshift-clients provides the oc CLI, and the jq package improves the display of JSON output on your command line.

Preparing a RHEL compute node

Before you add a Red Hat Enterprise Linux (RHEL) machine to your OKD cluster, you must register each host with Red Hat Subscription Manager (RHSM), attach an active OKD subscription, and enable the required repositories.

  1. On each host, register with RHSM:

    # subscription-manager register --username=<user_name> --password=<password>
  2. Pull the latest subscription data from RHSM:

    # subscription-manager refresh
  3. List the available subscriptions:

    # subscription-manager list --available --matches '*OpenShift*'
  4. In the output for the previous command, find the pool ID for an OKD subscription and attach it:

    # subscription-manager attach --pool=<pool_id>
  5. Disable all yum repositories:

    1. Disable all the enabled RHSM repositories:

      # subscription-manager repos --disable="*"
    2. List the remaining yum repositories and note their names under repo id, if any:

      # yum repolist
    3. Use yum-config-manager to disable the remaining yum repositories:

      # yum-config-manager --disable <repo_id>

      Alternatively, disable all repositories:

      # yum-config-manager --disable \*

      Note that this might take a few minutes if you have a large number of available repositories

  6. Enable only the repositories required by OKD Latest:

    # subscription-manager repos \
        --enable="rhel-7-server-rpms" \
        --enable="rhel-7-server-extras-rpms" \
        --enable="rhel-7-server-ose-4.6-rpms"
  7. Stop and disable firewalld on the host:

    # systemctl disable --now firewalld.service

    You must not enable firewalld later. If you do, you cannot access OKD logs on the worker.

Adding a RHEL compute machine to your cluster

You can add compute machines that use Red Hat Enterprise Linux as the operating system to an OKD Latest cluster.

Prerequisites
  • You installed the required packages and performed the necessary configuration on the machine that you run the playbook on.

  • You prepared the RHEL hosts for installation.

Procedure

Perform the following steps on the machine that you prepared to run the playbook:

  1. Create an Ansible inventory file that is named /<path>/inventory/hosts that defines your compute machine hosts and required variables:

    [all:vars]
    ansible_user=root (1)
    #ansible_become=True (2)
    
    openshift_kubeconfig_path="~/.kube/config" (3)
    
    [new_workers] (4)
    mycluster-rhel7-0.example.com
    mycluster-rhel7-1.example.com
    1 Specify the user name that runs the Ansible tasks on the remote compute machines.
    2 If you do not specify root for the ansible_user, you must set ansible_become to True and assign the user sudo permissions.
    3 Specify the path and file name of the kubeconfig file for your cluster.
    4 List each RHEL machine to add to your cluster. You must provide the fully-qualified domain name for each host. This name is the host name that the cluster uses to access the machine, so set the correct public or private name to access the machine.
  2. Navigate to the Ansible playbook directory:

    $ cd /usr/share/ansible/openshift-ansible
  3. Run the playbook:

    $ ansible-playbook -i /<path>/inventory/hosts playbooks/scaleup.yml (1)
    1 For <path>, specify the path to the Ansible inventory file that you created.

Required parameters for the Ansible hosts file

You must define the following parameters in the Ansible hosts file before you add Red Hat Enterprise Linux (RHEL) compute machines to your cluster.

Paramter Description Values

ansible_user

The SSH user that allows SSH-based authentication without requiring a password. If you use SSH key-based authentication, then you must manage the key with an SSH agent.

A user name on the system. The default value is root.

ansible_become

If the values of ansible_user is not root, you must set ansible_become to True, and the user that you specify as the ansible_user must be configured for passwordless sudo access.

True. If the value is not True, do not specify and define this parameter.

openshift_kubeconfig_path

Specifies a path and file name to a local directory that contains the kubeconfig file for your cluster.

The path and name of the configuration file.

Removing RHCOS compute machines from a cluster

After you add the Red Hat Enterprise Linux (RHEL) compute machines to your cluster, you can remove the Fedora CoreOS (FCOS) compute machines.

Prerequisites
  • You have added RHEL compute machines to your cluster.

Procedure
  1. View the list of machines and record the node names of the FCOS compute machines:

    $ oc get nodes -o wide
  2. For each FCOS compute machine, delete the node:

    1. Mark the node as unschedulable by running the oc adm cordon command:

      $ oc adm cordon <node_name> (1)
      1 Specify the node name of one of the FCOS compute machines.
    2. Drain all the pods from the node:

      $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets (1)
      1 Specify the node name of the FCOS compute machine that you isolated.
    3. Delete the node:

      $ oc delete nodes <node_name> (1)
      1 Specify the node name of the FCOS compute machine that you drained.
  3. Review the list of compute machines to ensure that only the RHEL nodes remain:

    $ oc get nodes -o wide
  4. Remove the FCOS machines from the load balancer for your cluster’s compute machines. You can delete the Virtual Machines or reimage the physical hardware for the FCOS compute machines.

Deploying MachineHealthChecks

Understand and deploy MachineHealthChecks :leveloffset: +2

This process is not applicable to clusters where you manually provisioned the machines yourself. You can use the advanced machine management and scaling capabilities only in clusters where the machine API is operational.

About MachineHealthChecks

MachineHealthChecks automatically repairs unhealthy Machines in a particular MachinePool.

To monitor machine health, you create a resource to define the configuration for a controller. You set a condition to check for, such as staying in the NotReady status for 15 minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.

You cannot apply a MachineHealthCheck to a machine with the master role.

The controller that observes a MachineHealthCheck resource checks for the status that you defined. If a machine fails the health check, it is automatically deleted and a new one is created to take its place. When a machine is deleted, you see a machine deleted event. To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the maxUnhealthy threshold allows for in the targeted pool of machines, remediation stops so that manual intervention can take place.

To stop the check, you remove the resource.

Sample MachineHealthCheck resource

The MachineHealthCheck resource resembles the following YAML file:

MachineHealthCheck
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: example (1)
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: <role> (2)
      machine.openshift.io/cluster-api-machine-type: <role> (2)
      machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> (3)
  unhealthyConditions:
  - type:    "Ready"
    timeout: "300s" (4)
    status: "False"
  - type:    "Ready"
    timeout: "300s" (4)
    status: "Unknown"
  maxUnhealthy: "40%" (5)
1 Specify the name of the MachineHealthCheck to deploy.
2 Specify a label for the machine pool that you want to check.
3 Specify the MachineSet to track in <cluster_name>-<label>-<zone> format. For example, prod-node-us-east-1a.
4 Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the Machine will be remediated. Long timeouts can result in long periods of downtime for the workload on the unhealthy Machine.
5 Specify the amount of unhealthy machines allowed in the targeted pool of machines. This can be set as a percentage or an integer.

The matchLabels are examples only; you must map your machine groups based on your specific needs.

Creating a MachineHealthCheck resource

You can create a MachineHealthCheck resource for all MachinePools in your cluster except the master pool.

Prerequisites
  • Install the oc command line interface.

Procedure
  1. Create a healthcheck.yml file that contains the definition of your MachineHealthCheck.

  2. Apply the healthcheck.yml file to your cluster:

    $ oc apply -f healthcheck.yml

Scaling a MachineSet manually

If you must add or remove an instance of a machine in a MachineSet, you can manually scale the MachineSet.

This guidance is relevant to fully automated, installer provisioned infrastructure installations. Customized, user provisioned infrastructure installations does not have MachineSets.

Prerequisites
  • Install an OKD cluster and the oc command line.

  • Log in to oc as a user with cluster-admin permission.

Procedure
  1. View the MachineSets that are in the cluster:

    $ oc get machinesets -n openshift-machine-api

    The MachineSets are listed in the form of <clusterid>-worker-<aws-region-az>.

  2. Scale the MachineSet:

    $ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

    Or:

    $ oc edit machineset <machineset> -n openshift-machine-api

    You can scale the MachineSet up or down. It takes several minutes for the new machines to be available.

The MachineSet delete policy

Random, Newest, and Oldest are the three supported options. The default is Random, meaning that random machines are chosen and deleted when scaling MachineSets down. The delete policy can be set according to the use case by modifying the particular MachineSet:

spec:
  deletePolicy: <delete policy>
  replicas: <desired replica count>

Specific machines can also be prioritized for deletion by adding the annotation machine.openshift.io/cluster-api-delete-machine to the machine of interest, regardless of the delete policy.

By default, the OKD router pods are deployed on workers. Because the router is required to access some cluster resources, including the web console, do not scale the worker MachineSet to 0 unless you first relocate the router pods.

Custom MachineSets can be used for use cases requiring that services run on specific nodes and that those services are ignored by the controller when the worker MachineSets are scaling down. This prevents service disruption.

Understanding the difference between MachineSets and the MachineConfigPool

MachineSets describe OKD nodes with respect to the cloud or machine provider.

The MachineConfigPool allows MachineConfigController components to define and provide the status of machines in the context of upgrades.

The MachineConfigPool allows users to configure how upgrades are rolled out to the OKD nodes in the MachineConfigPool.

NodeSelector can be replaced with a reference to MachineSets.

Recommended node host practices

The OKD node configuration file contains important options. For example, two parameters control the maximum number of pods that can be scheduled to a node: podsPerCore and maxPods.

When both options are in use, the lower of the two values limits the number of pods on a node. Exceeding these values can result in:

  • Increased CPU utilization.

  • Slow pod scheduling.

  • Potential out-of-memory scenarios, depending on the amount of memory in the node.

  • Exhausting the pool of IP addresses.

  • Resource overcommitting, leading to poor user application performance.

In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running.

podsPerCore sets the number of pods the node can run based on the number of processor cores on the node. For example, if podsPerCore is set to 10 on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40.

kubeletConfig:
  podsPerCore: 10

Setting podsPerCore to 0 disables this limit. The default is 0. podsPerCore cannot exceed maxPods.

maxPods sets the number of pods the node can run to a fixed value, regardless of the properties of the node.

 kubeletConfig:
    maxPods: 250

Creating a KubeletConfig CRD to edit kubelet parameters

The kubelet configuration is currently serialized as an ignition configuration, so it can be directly edited. However, there is also a new kubelet-config-controller added to the Machine Config Controller (MCC). This allows you to create a KubeletConfig custom resource (CR) to edit the kubelet parameters.

Procedure
  1. Run:

    $ oc get machineconfig

    This provides a list of the available machine configuration objects you can select. By default, the two kubelet-related configs are 01-master-kubelet and 01-worker-kubelet.

  2. To check the current value of max Pods per node, run:

    # oc describe node <node-ip> | grep Allocatable -A6

    Look for value: pods: <value>.

    For example:

    # oc describe node ip-172-31-128-158.us-east-2.compute.internal | grep Allocatable -A6
    Example output
    Allocatable:
     attachable-volumes-aws-ebs:  25
     cpu:                         3500m
     hugepages-1Gi:               0
     hugepages-2Mi:               0
     memory:                      15341844Ki
     pods:                        250
  3. To set the max Pods per node on the worker nodes, create a custom resource file that contains the kubelet configuration. For example, change-maxPods-cr.yaml:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: set-max-pods
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: large-pods
      kubeletConfig:
        maxPods: 500

    The rate at which the kubelet talks to the API server depends on queries per second (QPS) and burst values. The default values, 50 for kubeAPIQPS and 100 for kubeAPIBurst, are good enough if there are limited pods running on each node. Updating the kubelet QPS and burst rates is recommended if there are enough CPU and memory resources on the node:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: set-max-pods
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: large-pods
      kubeletConfig:
        maxPods: <pod_count>
        kubeAPIBurst: <burst_rate>
        kubeAPIQPS: <QPS>
    1. Run:

      $ oc label machineconfigpool worker custom-kubelet=large-pods
    2. Run:

      $ oc create -f change-maxPods-cr.yaml
    3. Run:

      $ oc get kubeletconfig

      This should return set-max-pods.

      Depending on the number of worker nodes in the cluster, wait for the worker nodes to be rebooted one by one. For a cluster with 3 worker nodes, this could take about 10 to 15 minutes.

  4. Check for maxPods changing for the worker nodes:

    $ oc describe node
    1. Verify the change by running:

      $ oc get kubeletconfigs set-max-pods -o yaml

      This should show a status of True and type:Success

Modifying the number of unavailable worker nodes

By default, only one machine is allowed to be unavailable when applying the kubelet-related configuration to the available worker nodes. For a large cluster, it can take a long time for the configuration change to be reflected. At any time, you can adjust the number of machines that are updating to speed up the process.

Procedure
  1. Run:

    $ oc edit machineconfigpool worker
  2. Set maxUnavailable to the value that you want:

    spec:
      maxUnavailable: <node_count>

    When setting the value, consider the number of worker nodes that can be unavailable without affecting the applications running on the cluster.

Control plane node sizing

The control plane node resource requirements depend on the number of nodes in the cluster. The following control plane node size recommendations are based on the results of control plane density focused testing.

Number of worker nodes CPU cores Memory (GB)

25

4

16

100

8

32

250

16

96

Because you cannot modify the control plane node size in a running OKD Latest cluster, you must estimate your total node count and use the suggested control plane node size during installation.

In OKD Latest, half of a CPU core (500 millicore) is now reserved by the system by default compared to OKD 3.11 and previous versions. The sizes are determined taking that into consideration.

Recommended etcd practices

For large and dense clusters, etcd can suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentation, must be performed to free up space in the data store. It is highly recommended that you monitor Prometheus for etcd metrics and defragment it when required before etcd raises a cluster-wide alarm that puts the cluster into a maintenance mode, which only accepts key reads and deletes. Some of the key metrics to monitor are etcd_server_quota_backend_bytes which is the current quota limit, etcd_mvcc_db_total_size_in_use_in_bytes which indicates the actual database usage after a history compaction, and etcd_debugging_mvcc_db_total_size_in_bytes which shows the database size including free space waiting for defragmentation.

Setting up CPU Manager

Procedure
  1. Optional: Label a node:

    # oc label node perf-node.example.com cpumanager=true
  2. Edit the MachineConfigPool of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:

    # oc edit machineconfigpool worker
  3. Add a label to the worker MachineConfigPool:

    metadata:
      creationTimestamp: 2019-xx-xxx
      generation: 3
      labels:
        custom-kubelet: cpumanager-enabled
  4. Create a KubeletConfig, cpumanager-kubeletconfig.yaml, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new KubeletConfig. See the machineConfigPoolSelector section:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: cpumanager-enabled
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: cpumanager-enabled
      kubeletConfig:
         cpuManagerPolicy: static (1)
         cpuManagerReconcilePeriod: 5s (2)
    1 Specify a policy:
    • none. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically.

    • static. This policy allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node.

    2 Optional. Specify the CPU Manager reconcile frequency. The default is 5s.
  5. Create the dynamic KubeletConfig:

    # oc create -f cpumanager-kubeletconfig.yaml

    This adds the CPU Manager feature to the KubeletConfig and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.

  6. Check for the merged KubeletConfig:

    # oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
    Example output
           "ownerReferences": [
                {
                    "apiVersion": "machineconfiguration.openshift.io/v1",
                    "kind": "KubeletConfig",
                    "name": "cpumanager-enabled",
                    "uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878"
                }
            ],
  7. Check the worker for the updated kubelet.conf:

    # oc debug node/perf-node.example.com
    sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
    Example output
    cpuManagerPolicy: static        (1)
    cpuManagerReconcilePeriod: 5s   (1)
    
    1 These settings were defined when you created the KubeletConfig CR.
  8. Create a Pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this Pod:

    # cat cpumanager-pod.yaml
    Example output
    apiVersion: v1
    kind: Pod
    metadata:
      generateName: cpumanager-
    spec:
      containers:
      - name: cpumanager
        image: gcr.io/google_containers/pause-amd64:3.0
        resources:
          requests:
            cpu: 1
            memory: "1G"
          limits:
            cpu: 1
            memory: "1G"
      nodeSelector:
        cpumanager: "true"
  9. Create the Pod:

    # oc create -f cpumanager-pod.yaml
  10. Verify that the Pod is scheduled to the node that you labeled:

    # oc describe pod cpumanager
    Example output
    Name:               cpumanager-6cqz7
    Namespace:          default
    Priority:           0
    PriorityClassName:  <none>
    Node:  perf-node.example.com/xxx.xx.xx.xxx
    ...
     Limits:
          cpu:     1
          memory:  1G
        Requests:
          cpu:        1
          memory:     1G
    ...
    QoS Class:       Guaranteed
    Node-Selectors:  cpumanager=true
  11. Verify that the cgroups are set up correctly. Get the process ID (PID) of the pause process:

    # ├─init.scope
    │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17
    └─kubepods.slice
      ├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice
      │ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope
      │ └─32706 /pause

    Pods of quality of service (QoS) tier Guaranteed are placed within the kubepods.slice. Pods of other QoS tiers end up in child cgroups of kubepods:

    # cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope
    # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
    Example output
    cpuset.cpus 1
    tasks 32706
  12. Check the allowed CPU list for the task:

    # grep ^Cpus_allowed_list /proc/32706/status
    Example output
     Cpus_allowed_list:    1
  13. Verify that another Pod (in this case, the Pod in the burstable QoS tier) on the system cannot run on the core allocated for the Guaranteed Pod:

    # cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus
    0
    # oc describe node perf-node.example.com
    Example output
    ...
    Capacity:
     attachable-volumes-aws-ebs:  39
     cpu:                         2
     ephemeral-storage:           124768236Ki
     hugepages-1Gi:               0
     hugepages-2Mi:               0
     memory:                      8162900Ki
     pods:                        250
    Allocatable:
     attachable-volumes-aws-ebs:  39
     cpu:                         1500m
     ephemeral-storage:           124768236Ki
     hugepages-1Gi:               0
     hugepages-2Mi:               0
     memory:                      7548500Ki
     pods:                        250
    -------                               ----                           ------------  ----------  ---------------  -------------  ---
      default                                 cpumanager-6cqz7               1 (66%)       1 (66%)     1G (12%)         1G (12%)       29m
    
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource                    Requests          Limits
      --------                    --------          ------
      cpu                         1440m (96%)       1 (66%)

    This VM has two CPU cores. You set kube-reserved to 500 millicores, meaning half of one core is subtracted from the total capacity of the node to arrive at the Node Allocatable amount. You can see that Allocatable CPU is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second Pod, the system will accept the Pod, but it will never be scheduled:

    NAME                    READY   STATUS    RESTARTS   AGE
    cpumanager-6cqz7        1/1     Running   0          33m
    cpumanager-7qc2t        0/1     Pending   0          11s

Huge pages

Understand and configure huge pages.

What huge pages do

Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, and so on. CPUs have a built-in memory management unit that manages a list of these pages in hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of virtual-to-physical page mappings. If the virtual address passed in a hardware instruction can be found in the TLB, the mapping can be determined quickly. If not, a TLB miss occurs, and the system falls back to slower, software-based address translation, resulting in performance issues. Since the size of the TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the page size.

A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other architectures. In order to use huge pages, code must be written so that applications are aware of them. Transparent Huge Pages (THP) attempt to automate the management of huge pages without application knowledge, but they have limitations. In particular, they are limited to 2Mi page sizes. THP can lead to performance degradation on nodes with high memory utilization or fragmentation due to defragmenting efforts of THP, which can lock memory pages. For this reason, some applications may be designed to (or recommend) usage of pre-allocated huge pages instead of THP.

How huge pages are consumed by apps

Nodes must pre-allocate huge pages in order for the node to report its huge page capacity. A node can only pre-allocate huge pages for a single size.

Huge pages can be consumed through container-level resource requirements using the resource name hugepages-<size>, where size is the most compact binary notation using integer values supported on a particular node. For example, if a node supports 2048KiB page sizes, it exposes a schedulable resource hugepages-2Mi. Unlike CPU or memory, huge pages do not support over-commitment.

apiVersion: v1
kind: Pod
metadata:
  generateName: hugepages-volume-
spec:
  containers:
  - securityContext:
      privileged: true
    image: rhel7:latest
    command:
    - sleep
    - inf
    name: example
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    resources:
      limits:
        hugepages-2Mi: 100Mi (1)
        memory: "1Gi"
        cpu: "1"
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
1 Specify the amount of memory for hugepages as the exact amount to be allocated. Do not specify this value as the amount of memory for hugepages multiplied by the size of the page. For example, given a huge page size of 2MB, if you want to use 100MB of huge-page-backed RAM for your application, then you would allocate 50 huge pages. OKD handles the math for you. As in the above example, you can specify 100MB directly.

Allocating huge pages of a specific size

Some platforms support multiple huge page sizes. To allocate huge pages of a specific size, precede the huge pages boot command parameters with a huge page size selection parameter hugepagesz=<size>. The <size> value must be specified in bytes with an optional scale suffix [kKmMgG]. The default huge page size can be defined with the default_hugepagesz=<size> boot parameter.

Huge page requirements

  • Huge page requests must equal the limits. This is the default if limits are specified, but requests are not.

  • Huge pages are isolated at a pod scope. Container isolation is planned in a future iteration.

  • EmptyDir volumes backed by huge pages must not consume more huge page memory than the pod request.

  • Applications that consume huge pages via shmget() with SHM_HUGETLB must run with a supplemental group that matches proc/sys/vm/hugetlb_shm_group.

Additional resources

Configuring huge pages

Nodes must pre-allocate huge pages used in an OKD cluster. There are two ways of reserving huge pages: at boot time and at run time. Reserving at boot time increases the possibility of success because the memory has not yet been significantly fragmented. The Node Tuning Operator currently supports boot time allocation of huge pages on specific nodes.

At boot time

Procedure

To minimize node reboots, the order of the steps below needs to be followed:

  1. Label all nodes that need the same hugepages setting by a label.

    $ oc label node <node_using_hugepages> node-role.kubernetes.io/worker-hp=
  2. Create a file with the following content and name it hugepages-tuned-boottime.yaml:

    apiVersion: tuned.openshift.io/v1
    kind: Tuned
    metadata:
      name: hugepages (1)
      namespace: openshift-cluster-node-tuning-operator
    spec:
      profile: (2)
      - data: |
          [main]
          summary=Boot time configuration for hugepages
          include=openshift-node
          [bootloader]
          cmdline_openshift_node_hugepages=hugepagesz=2M hugepages=50 (3)
        name: openshift-node-hugepages
    
      recommend:
      - machineConfigLabels: (4)
          machineconfiguration.openshift.io/role: "worker-hp"
        priority: 30
        profile: openshift-node-hugepages
    1 Set the name of the Tuned resource to hugepages.
    2 Set the profile section to allocate huge pages.
    3 Note the order of parameters is important as some platforms support hugepages of various sizes.
    4 Enable MachineConfigPool based matching.
  3. Create the Tuned hugepages profile

    $ oc create -f hugepages-tuned-boottime.yaml
  4. Create a file with the following content and name it hugepages-mcp.yaml:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: worker-hp
      labels:
        worker-hp: ""
    spec:
      machineConfigSelector:
        matchExpressions:
          - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-hp]}
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/worker-hp: ""
  5. Create the MachineConfigPool:

    $ oc create -f hugepages-mcp.yaml

Given enough non-fragmented memory, all the nodes in the worker-hp MachineConfigPool should now have 50 2Mi hugepages allocated.

$ oc get node <node_using_hugepages> -o jsonpath="{.status.allocatable.hugepages-2Mi}"
100Mi

This functionality is currently only supported on Fedora CoreOS (FCOS) 8.x worker nodes. On Fedora 7.x worker nodes the tuned [bootloader] plug-in is currently not supported.

Understanding device plug-ins

The device plug-in provides a consistent and portable solution to consume hardware devices across clusters. The device plug-in provides support for these devices through an extension mechanism, which makes these devices available to Containers, provides health checks of these devices, and securely shares them.

OKD supports the device plug-in API, but the device plug-in Containers are supported by individual vendors.

A device plug-in is a gRPC service running on the nodes (external to the kubelet) that is responsible for managing specific hardware resources. Any device plug-in must support following remote procedure calls (RPCs):

service DevicePlugin {
      // GetDevicePluginOptions returns options to be communicated with Device
      // Manager
      rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}

      // ListAndWatch returns a stream of List of Devices
      // Whenever a Device state change or a Device disappears, ListAndWatch
      // returns the new list
      rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}

      // Allocate is called during container creation so that the Device
      // Plug-in can run device specific operations and instruct Kubelet
      // of the steps to make the Device available in the container
      rpc Allocate(AllocateRequest) returns (AllocateResponse) {}

      // PreStartcontainer is called, if indicated by Device Plug-in during
      // registration phase, before each container start. Device plug-in
      // can run device specific operations such as reseting the device
      // before making devices available to the container
      rpc PreStartcontainer(PreStartcontainerRequest) returns (PreStartcontainerResponse) {}
}

Example device plug-ins

For easy device plug-in reference implementation, there is a stub device plug-in in the Device Manager code: vendor/k8s.io/kubernetes/pkg/kubelet/cm/deviceplugin/device_plugin_stub.go.

Methods for deploying a device plug-in

  • Daemonsets are the recommended approach for device plug-in deployments.

  • Upon start, the device plug-in will try to create a UNIX domain socket at /var/lib/kubelet/device-plugin/ on the node to serve RPCs from Device Manager.

  • Since device plug-ins must manage hardware resources, access to the host file system, as well as socket creation, they must be run in a privileged security context.

  • More specific details regarding deployment steps can be found with each device plug-in implementation.

Understanding the Device Manager

Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plug-ins known as device plug-ins.

You can advertise specialized hardware without requiring any upstream code changes.

OKD supports the device plug-in API, but the device plug-in Containers are supported by individual vendors.

Device Manager advertises devices as Extended Resources. User pods can consume devices, advertised by Device Manager, using the same Limit/Request mechanism, which is used for requesting any other Extended Resource.

Upon start, the device plug-in registers itself with Device Manager invoking Register on the /var/lib/kubelet/device-plugins/kubelet.sock and starts a gRPC service at /var/lib/kubelet/device-plugins/<plugin>.sock for serving Device Manager requests.

Device Manager, while processing a new registration request, invokes ListAndWatch remote procedure call (RPC) at the device plug-in service. In response, Device Manager gets a list of Device objects from the plug-in over a gRPC stream. Device Manager will keep watching on the stream for new updates from the plug-in. On the plug-in side, the plug-in will also keep the stream open and whenever there is a change in the state of any of the devices, a new device list is sent to the Device Manager over the same streaming connection.

While handling a new pod admission request, Kubelet passes requested Extended Resources to the Device Manager for device allocation. Device Manager checks in its database to verify if a corresponding plug-in exists or not. If the plug-in exists and there are free allocatable devices as well as per local cache, Allocate RPC is invoked at that particular device plug-in.

Additionally, device plug-ins can also perform several other device-specific operations, such as driver installation, device initialization, and device resets. These functionalities vary from implementation to implementation.

Enabling Device Manager

Enable Device Manager to implement a device plug-in to advertise specialized hardware without any upstream code changes.

Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plug-ins known as device plug-ins.

  1. Obtain the label associated with the static Machine Config Pool CRD for the type of node you want to configure. Perform one of the following steps:

    1. View the Machine Config:

      # oc describe machineconfig <name>

      For example:

      # oc describe machineconfig 00-worker
      Example output
      Name:         00-worker
      Namespace:
      Labels:       machineconfiguration.openshift.io/role=worker (1)
      
      1 Label required for the device manager.
Procedure
  1. Create a Custom Resource (CR) for your configuration change.

    Sample configuration for a Device Manager CR
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: devicemgr (1)
    spec:
      machineConfigPoolSelector:
        matchLabels:
           machineconfiguration.openshift.io: devicemgr (2)
      kubeletConfig:
        feature-gates:
          - DevicePlugins=true (3)
    1 Assign a name to CR.
    2 Enter the label from the Machine Config Pool.
    3 Set DevicePlugins to 'true`.
  2. Create the device manager:

    $ oc create -f devicemgr.yaml
    Example output
    kubeletconfig.machineconfiguration.openshift.io/devicemgr created
  3. Ensure that Device Manager was actually enabled by confirming that /var/lib/kubelet/device-plugins/kubelet.sock is created on the node. This is the UNIX domain socket on which the Device Manager gRPC server listens for new plug-in registrations. This sock file is created when the Kubelet is started only if Device Manager is enabled.

Taints and tolerations

Understand and work with taints and tolerations.

Understanding taints and tolerations

A taint allows a node to refuse Pod to be scheduled unless that Pod has a matching toleration.

You apply taints to a node through the node specification (NodeSpec) and apply tolerations to a Pod through the Pod specification (PodSpec). A taint on a node instructs the node to repel all Pods that do not tolerate the taint.

Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.

Table 1. Taint and toleration components
Parameter Description

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

NoSchedule

  • New Pods that do not match the taint are not scheduled onto that node.

  • Existing Pods on the node remain.

PreferNoSchedule

  • New Pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.

  • Existing Pods on the node remain.

NoExecute

  • New Pods that do not match the taint cannot be scheduled onto that node.

  • Existing Pods on the node that do not have a matching toleration are removed.

operator

Equal

The key/value/effect parameters must match. This is the default.

Exists

The key/effect parameters must match. You must leave a blank value parameter, which matches any.

A toleration matches a taint:

  • If the operator parameter is set to Equal:

    • the key parameters are the same;

    • the value parameters are the same;

    • the effect parameters are the same.

  • If the operator parameter is set to Exists:

    • the key parameters are the same;

    • the effect parameters are the same.

The following taints are built into kubernetes:

  • node.kubernetes.io/not-ready: The node is not ready. This corresponds to the node condition Ready=False.

  • node.kubernetes.io/unreachable: The node is unreachable from the node controller. This corresponds to the node condition Ready=Unknown.

  • node.kubernetes.io/out-of-disk: The node has insufficient free space on the node for adding new Pods. This corresponds to the node condition OutOfDisk=True.

  • node.kubernetes.io/memory-pressure: The node has memory pressure issues. This corresponds to the node condition MemoryPressure=True.

  • node.kubernetes.io/disk-pressure: The node has disk pressure issues. This corresponds to the node condition DiskPressure=True.

  • node.kubernetes.io/network-unavailable: The node network is unavailable.

  • node.kubernetes.io/unschedulable: The node is unschedulable.

  • node.cloudprovider.kubernetes.io/uninitialized: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.

Understanding how to use toleration seconds to delay Pod evictions

You can specify how long a Pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the Pod specification. If a taint with the NoExecute effect is added to a node, any Pods that do not tolerate the taint are evicted immediately. Pods that do tolerate the taint are not evicted. However, if a Pod that does tolerate the taint has the tolerationSeconds parameter, the Pod is not evicted until that time period expires.

Example output
tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

Here, if this Pod is running but does not have a matching taint, the Pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the Pod is not evicted.

Understanding how to use multiple taints

You can put multiple taints on the same node and multiple tolerations on the same Pod. OKD processes multiple taints and tolerations as follows:

  1. Process the taints for which the Pod has a matching toleration.

  2. The remaining unmatched taints have the indicated effects on the Pod:

    • If there is at least one unmatched taint with effect NoSchedule, OKD cannot schedule a Pod onto that node.

    • If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OKD tries to not schedule the Pod onto the node.

    • If there is at least one unmatched taint with effect NoExecute, OKD evicts the Pod from the node (if it is already running on the node), or the Pod is not scheduled onto the node (if it is not yet running on the node).

      • Pods that do not tolerate the taint are evicted immediately.

      • Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever.

      • Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

For example:

  • Add the following taints to the node:

    $ oc adm taint nodes node1 key1=value1:NoSchedule
    $ oc adm taint nodes node1 key1=value1:NoExecute
    $ oc adm taint nodes node1 key2=value2:NoSchedule
  • The Pod has the following tolerations:

    tolerations:
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoSchedule"
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoExecute"

In this case, the Pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The Pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the Pod.

Preventing Pod eviction for node problems

The Taint-Based Evictions feature, enabled by default, adds a taint with the NoExecute effect to nodes that are not ready or are unreachable. This allows you to specify how long a Pod should remain bound to a node that becomes unreachable or not ready, rather than using the default of five minutes. For example, you might want to allow a Pod on an unreachable node if the workload is safe to remain running while a networking issue resolves.

If a node enters a not ready state, the node controller adds the node.kubernetes.io/not-ready:NoExecute taint to the node. If a node enters an unreachable state, the node controller adds the node.kubernetes.io/unreachable:NoExecute taint to the node.

The NoExecute taint affects Pods that are already running on the node in the following ways:

  • Pods that do not tolerate the taint are evicted immediately.

  • Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever.

  • Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

Understanding Pod scheduling and node conditions (Taint Node by Condition)

OKD automatically taints nodes that report conditions such as memory pressure and disk pressure. If a node reports a condition, a taint is added until the condition clears. The taints have the NoSchedule effect, which means no Pod can be scheduled on the node unless the Pod has a matching toleration. This feature, Taint Nodes By Condition, is enabled by default.

The scheduler checks for these taints on nodes before scheduling Pods. If the taint is present, the Pod is scheduled on a different node. Because the scheduler checks for taints and not the actual node conditions, you configure the scheduler to ignore some of these node conditions by adding appropriate Pod tolerations.

To ensure backward compatibility, the DaemonSet controller automatically adds the following tolerations to all daemons:

  • node.kubernetes.io/memory-pressure

  • node.kubernetes.io/disk-pressure

  • node.kubernetes.io/out-of-disk (only for critical Pods)

  • node.kubernetes.io/unschedulable (1.10 or later)

  • node.kubernetes.io/network-unavailable (host network only)

You can also add arbitrary tolerations to DaemonSets.

Understanding evicting pods by condition (Taint-Based Evictions)

The Taint-Based Evictions feature, enabled by default, evicts Pods from a node that experiences specific conditions, such as not-ready and unreachable. When a node experiences one of these conditions, OKD automatically adds taints to the node, and starts evicting and rescheduling the Pods on different nodes.

Taint Based Evictions has a NoExecute effect, where any Pod that does not tolerate the taint will be evicted immediately and any Pod that does tolerate the taint will never be evicted.

OKD evicts Pods in a rate-limited way to prevent massive Pod evictions in scenarios such as the master becoming partitioned from the nodes.

This feature, in combination with tolerationSeconds, allows you to specify how long a Pod stays bound to a node that has a node condition. If the condition still exists after the tolerationSections period, the taint remains on the node and the Pods are evicted in a rate-limited manner. If the condition clears before the tolerationSeconds period, Pods are not removed.

OKD automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300, unless the Pod configuration specifies either toleration.

spec
  tolerations:
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300

These tolerations ensure that the default Pod behavior is to remain bound for five minutes after one of these node conditions problems is detected.

You can configure these tolerations as needed. For example, if you have an application with a lot of local state, you might want to keep the Pods bound to node for a longer time in the event of network partition, allowing for the partition to recover and avoiding Pod eviction.

DaemonSet Pods are created with NoExecute tolerations for the following taints with no tolerationSeconds:

  • node.kubernetes.io/unreachable

  • node.kubernetes.io/not-ready

This ensures that DaemonSet Pods are never evicted due to these node conditions, even if the DefaultTolerationSeconds admission controller is disabled.

Adding taints and tolerations

You add taints to nodes and tolerations to pods allow the node to control which pods should (or should not) be scheduled on them.

Procedure
  1. Use the following command using the parameters described in the taint and toleration components table:

    $ oc adm taint nodes <node-name> <key>=<value>:<effect>

    For example:

    $ oc adm taint nodes node1 key1=value1:NoExecute

    This example places a taint on node1 that has key key1, value value1, and taint effect NoExecute.

  2. Add a toleration to a pod by editing the pod specification to include a tolerations section:

    Sample pod configuration file with Equal operator
    tolerations:
    - key: "key1" (1)
      operator: "Equal" (1)
      value: "value1" (1)
      effect: "NoExecute" (1)
      tolerationSeconds: 3600 (2)
    1 The toleration parameters, as described in the taint and toleration components table.
    2 The tolerationSeconds parameter specifies how long a pod can remain bound to a node before being evicted.

    For example:

    Sample pod configuration file with Exists operator
    tolerations:
    - key: "key1"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 3600

    Both of these tolerations match the taint created by the oc adm taint command above. A pod with either toleration would be able to schedule onto node1.

Dedicating a Node for a User using taints and tolerations

You can specify a set of nodes for exclusive use by a particular set of users.

Procedure

To specify dedicated nodes:

  1. Add a taint to those nodes:

    For example:

    $ oc adm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    Only the pods with the tolerations are allowed to use the dedicated nodes.

Binding a user to a Node using taints and tolerations

You can configure a node so that particular users can use only the dedicated nodes.

Procedure

To configure a node so that users can use only that node:

  1. Add a taint to those nodes:

    For example:

    $ oc adm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    The admission controller should add a node affinity to require that the pods can only schedule onto nodes labeled with the key:value label (dedicated=groupName).

  3. Add a label similar to the taint (such as the key:value label) to the dedicated nodes.

Controlling Nodes with special hardware using taints and tolerations

In a cluster where a small subset of nodes have specialized hardware (for example GPUs), you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.

Procedure

To ensure pods are blocked from the specialized hardware:

  1. Taint the nodes that have the specialized hardware using one of the following commands:

    $ oc adm taint nodes <node-name> disktype=ssd:NoSchedule

    Or:

    $ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule
  2. Adding a corresponding toleration to pods that use the special hardware using an admission controller.

For example, the admission controller could use some characteristic(s) of the pod to determine that the pod should be allowed to use the special nodes by adding a toleration.

To ensure pods can only use the specialized hardware, you need some additional mechanism. For example, you could label the nodes that have the special hardware and use node affinity on the pods that need the hardware.

Removing taints and tolerations

You can remove taints from nodes and tolerations from pods as needed.

Procedure

To remove taints and tolerations:

  1. To remove a taint from a node:

    $ oc adm taint nodes <node-name> <key>-

    For example:

    $ oc adm taint nodes ip-10-0-132-248.ec2.internal key1-
    Example output
    node/ip-10-0-132-248.ec2.internal untainted
  2. To remove a toleration from a pod, edit the pod specification to remove the toleration:

    tolerations:
    - key: "key2"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 3600

Topology Manager

Understand and work with Topology Manager.

Topology Manager policies

Topology Manager aligns Pod resources of all Quality of Service (QoS) classes by collecting topology hints from Hint Providers, such as CPU Manager and Device Manager, and using the collected hints to align the Pod resources.

To align CPU resources with other requested resources in a Pod spec, the CPU Manager must be enabled with the static CPU Manager policy.

Topology Manager supports four allocation policies, which you assign in the cpumanager-enabled custom resource (CR):

none policy

This is the default policy and does not perform any topology alignment.

best-effort policy

For each container in a Pod with the best-effort topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the Pod to the node.

restricted policy

For each container in a Pod with the restricted topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager rejects this pod from the node, resulting in a pod in a Terminated state with a pod admission failure.

single-numa-node policy

For each container in a Pod with the single-numa-node topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure.

Setting up Topology Manager

To use Topology Manager, you must enable the LatencySensitive Feature Gate and configure the Topology Manager policy in the cpumanager-enabled custom resource (CR). This file might exist if you have set up CPU Manager. If the file does not exist, you can create the file.

Prequisites
  • Configure the CPU Manager policy to be static. Refer to Using CPU Manager in the Scalability and Performance section.

Procedure

To activate Topololgy Manager:

  1. Edit the FeatureGate object to add the LatencySensitive feature set:

    $ oc edit featuregate/cluster
    apiVersion: config.openshift.io/v1
    kind: FeatureGate
    metadata:
      annotations:
        release.openshift.io/create-only: "true"
      creationTimestamp: 2020-06-05T14:41:09Z
      generation: 2
      managedFields:
      - apiVersion: config.openshift.io/v1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:annotations:
              .: {}
              f:release.openshift.io/create-only: {}
          f:spec: {}
        manager: cluster-version-operator
        operation: Update
        time: 2020-06-05T14:41:09Z
      - apiVersion: config.openshift.io/v1
        fieldsType: FieldsV1
        fieldsV1:
          f:spec:
            f:featureSet: {}
        manager: oc
        operation: Update
        time: 2020-06-05T15:21:44Z
      name: cluster
      resourceVersion: "28457"
      selfLink: /apis/config.openshift.io/v1/featuregates/cluster
      uid: e802e840-89ee-4137-a7e5-ca15fd2806f8
    spec:
      featureSet: LatencySensitive (1)
    ...
    1 Add the LatencySensitive feature set in a comma-separated list.
  2. Configure the Topology Manager policy in the cpumanager-enabled custom resource (CR).

    $ oc edit KubeletConfig cpumanager-enabled
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: cpumanager-enabled
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: cpumanager-enabled
      kubeletConfig:
         cpuManagerPolicy: static (1)
         cpuManagerReconcilePeriod: 5s
         topologyManagerPolicy: single-numa-node (2)
    1 This parameter must be static.
    2 Specify your selected Topology Manager policy. Here, the policy is single-numa-node. Acceptable values are: default, best-effort, restricted, single-numa-node.

Pod interactions with Topology Manager policies

The example Pod specs below help illustrate Pod interactions with Topology Manager.

The following Pod runs in the BestEffort QoS class because no resource requests or limits are specified.

spec:
  containers:
  - name: nginx
    image: nginx

The next Pod runs in the Burstable QoS class because requests are less than limits.

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
      requests:
        memory: "100Mi"

If the selected policy is anything other than none, Topology Manager would not consider either of these Pod specifications.

The last example Pod below runs in the Guaranteed QoS class because requests are equal to limits.

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"
      requests:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"

Topology Manager would consider this Pod. The Topology Manager consults the CPU Manager static policy, which returns the topology of available CPUs. Topology Manager also consults Device Manager to discover the topology of available devices for example.com/device.

Topology Manager will use this information to store the best Topology for this container. In the case of this Pod, CPU Manager and Device Manager will use this stored information at the resource allocation stage.

Resource requests and overcommitment

For each compute resource, a container may specify a resource request and limit. Scheduling decisions are made based on the request to ensure that a node has enough capacity available to meet the requested value. If a container specifies limits, but omits requests, the requests are defaulted to the limits. A container is not able to exceed the specified limit on the node.

The enforcement of limits is dependent upon the compute resource type. If a container makes no request or limit, the container is scheduled to a node with no resource guarantees. In practice, the container is able to consume as much of the specified resource as is available with the lowest local priority. In low resource situations, containers that specify no resource requests are given the lowest quality of service.

Scheduling is based on resources requested, while quota and hard limits refer to resource limits, which can be set higher than requested resources. The difference between request and limit determines the level of overcommit; for instance, if a container is given a memory request of 1Gi and a memory limit of 2Gi, it is scheduled based on the 1Gi request being available on the node, but could use up to 2Gi; so it is 200% overcommitted.

Cluster-level overcommit using the Cluster Resource Override Operator

The Cluster Resource Override Operator is an admission webhook that allows you to control the level of overcommit and manage container density across all the nodes in your cluster. The Operator controls how nodes in specific projects can exceed defined memory and CPU limits.

You must install the Cluster Resource Override Operator using the OKD console or CLI as shown in the following sections. During the installation, you create a ClusterResourceOverride custom resource (CR), where you set the level of overcommit, as shown in the following example:

apiVersion: operator.autoscaling.openshift.io/v1
kind: ClusterResourceOverride
metadata:
-   name: cluster (1)
spec:
   memoryRequestToLimitPercent: 50 (2)
   cpuRequestToLimitPercent: 25 (3)
   limitCPUToMemoryPercent: 200 (4)
1 The name must be cluster.
2 Optional. If a container memory limit has been specified or defaulted, the memory request is overridden to this percentage of the limit, between 1-100. The default is 50.
3 Optional. If a container CPU limit has been specified or defaulted, the CPU request is overridden to this percentage of the limit, between 1-100. The default is 25.
4 Optional. If a container memory limit has been specified or defaulted, the CPU limit is overridden to a percentage of the memory limit, if specified. Scaling 1Gi of RAM at 100 percent is equal to 1 CPU core. This is processed prior to overriding the CPU request (if configured). The default is 200.

The Cluster Resource Override Operator overrides have no effect if limits have not been set on containers. Create a LimitRange object with default limits per individual project or configure limits in Pod specs in order for the overrides to apply.

When configured, overrides can be enabled per-project by applying the following label to the Namespace object for each project:

apiVersion: v1
kind: Namespace
metadata:

....

  labels:
    clusterresourceoverrides.admission.autoscaling.openshift.io/enabled: "true"

....

The Operator watches for the ClusterResourceOverride CR and ensures that the ClusterResourceOverride admission webhook is installed into the same namespace as the Operator.

Installing the Cluster Resource Override Operator using the web console

You can use the OKD web console to install the Cluster Resource Override Operator to help control overcommit in your cluster.

Prerequisites
  • The Cluster Resource Override Operator has no effect if limits have not been set on containers. You must specify default limits for a project using a LimitRange object or configure limits in Pod specs in order for the overrides to apply.

Procedure

To install the Cluster Resource Override Operator using the OKD web console:

  1. In the OKD web console, navigate to HomeProjects

    1. Click Create Project.

    2. Specify clusterresourceoverride-operator as the name of the project.

    3. Click Create.

  2. Navigate to OperatorsOperatorHub.

    1. Choose ClusterResourceOverride Operator from the list of available Operators and click Install.

    2. On the Install Operator page, make sure A specific Namespace on the cluster is selected for Installation Mode.

    3. Make sure clusterresourceoverride-operator is selected for Installed Namespace.

    4. Select an Update Channel and Approval Strategy.

    5. Click Install.

  3. On the Installed Operators page, click ClusterResourceOverride.

    1. On the ClusterResourceOverride Operator details page, click Create Instance.

    2. On the Create ClusterResourceOverride page, edit the YAML template to set the overcommit values as needed:

      apiVersion: operator.autoscaling.openshift.io/v1
      kind: ClusterResourceOverride
      metadata:
        name: cluster (1)
      spec:
        podResourceOverride:
          spec:
            memoryRequestToLimitPercent: 50 (2)
            cpuRequestToLimitPercent: 25 (3)
            limitCPUToMemoryPercent: 200 (4)
      1 The name must be cluster.
      2 Optional. Specify the percentage to override the container memory limit, if used, between 1-100. The default is 50.
      3 Optional. Specify the percentage to override the container CPU limit, if used, between 1-100. The default is 25.
      4 Optional. Specify the percentage to override the container memory limit, if used. Scaling 1Gi of RAM at 100 percent is equal to 1 CPU core. This is processed prior to overriding the CPU request, if configured. The default is 200.
    3. Click Create.

  4. Check the current state of the admission webhook by checking the status of the cluster custom resource:

    1. On the ClusterResourceOverride Operator page, click cluster.

    2. On the ClusterResourceOverride Details age, click YAML. The mutatingWebhookConfigurationRef section appears when the webhook is called.

      apiVersion: operator.autoscaling.openshift.io/v1
      kind: ClusterResourceOverride
      metadata:
        annotations:
          kubectl.kubernetes.io/last-applied-configuration: |
            {"apiVersion":"operator.autoscaling.openshift.io/v1","kind":"ClusterResourceOverride","metadata":{"annotations":{},"name":"cluster"},"spec":{"podResourceOverride":{"spec":{"cpuRequestToLimitPercent":25,"limitCPUToMemoryPercent":200,"memoryRequestToLimitPercent":50}}}}
        creationTimestamp: "2019-12-18T22:35:02Z"
        generation: 1
        name: cluster
        resourceVersion: "127622"
        selfLink: /apis/operator.autoscaling.openshift.io/v1/clusterresourceoverrides/cluster
        uid: 978fc959-1717-4bd1-97d0-ae00ee111e8d
      spec:
        podResourceOverride:
          spec:
            cpuRequestToLimitPercent: 25
            limitCPUToMemoryPercent: 200
            memoryRequestToLimitPercent: 50
      status:
      
      ....
      
          mutatingWebhookConfigurationRef: (1)
            apiVersion: admissionregistration.k8s.io/v1beta1
            kind: MutatingWebhookConfiguration
            name: clusterresourceoverrides.admission.autoscaling.openshift.io
            resourceVersion: "127621"
            uid: 98b3b8ae-d5ce-462b-8ab5-a729ea8f38f3
      
      ....
      1 Reference to the ClusterResourceOverride admission webhook.

Installing the Cluster Resource Override Operator using the CLI

You can use the OKD CLI to install the Cluster Resource Override Operator to help control overcommit in your cluster.

Prerequisites
  • The Cluster Resource Override Operator has no effect if limits have not been set on containers. You must specify default limits for a project using a LimitRange object or configure limits in Pod specs in order for the overrides to apply.

Procedure

To install the Cluster Resource Override Operator using the CLI:

  1. Create a Namespace for the Cluster Resource Override Operator:

    1. Create a Namespace object YAML file (for example, cro-namespace.yaml) for the Cluster Resource Override Operator:

      apiVersion: v1
      kind: Namespace
      metadata:
        name: clusterresourceoverride-operator
    2. Create the Namespace:

      $ oc create -f <file-name>.yaml

      For example:

      $ oc create -f cro-namespace.yaml
  2. Create an Operator Group:

    1. Create an OperatorGroup object YAML file (for example, cro-og.yaml) for the Cluster Resource Override Operator:

      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        name: clusterresourceoverride-operator
        namespace: clusterresourceoverride-operator
      spec:
        targetNamespaces:
          - clusterresourceoverride-operator
    2. Create the Operator Group:

      $ oc create -f <file-name>.yaml

      For example:

      $ oc create -f cro-og.yaml
  3. Create a Subscription:

    1. Create a Subscription object YAML file (for example, cro-sub.yaml) for the Cluster Resource Override Operator:

      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: clusterresourceoverride
        namespace: clusterresourceoverride-operator
      spec:
        channel: "4.6"
        name: clusterresourceoverride
        source: redhat-operators
        sourceNamespace: openshift-marketplace
    2. Create the Subscription:

      $ oc create -f <file-name>.yaml

      For example:

      $ oc create -f cro-sub.yaml
  4. Create a ClusterResourceOverride custom resource (CR) object in the clusterresourceoverride-operator Namespace:

    1. Change to the clusterresourceoverride-operator Namespace.

      $ oc project clusterresourceoverride-operator
    2. Create a ClusterResourceOverride object YAML file (for example, cro-cr.yaml) for the Cluster Resource Override Operator:

      apiVersion: operator.autoscaling.openshift.io/v1
      kind: ClusterResourceOverride
      metadata:
          name: cluster (1)
      spec:
        podResourceOverride:
          spec:
            memoryRequestToLimitPercent: 50 (2)
            cpuRequestToLimitPercent: 25 (3)
            limitCPUToMemoryPercent: 200 (4)
      1 The name must be cluster.
      2 Optional. Specify the percentage to override the container memory limit, if used, between 1-100. The default is 50.
      3 Optional. Specify the percentage to override the container CPU limit, if used, between 1-100. The default is 25.
      4 Optional. Specify the percentage to override the container memory limit, if used. Scaling 1Gi of RAM at 100 percent is equal to 1 CPU core. This is processed prior to overriding the CPU request, if configured. The default is 200.
    3. Create the ClusterResourceOverride object:

      $ oc create -f <file-name>.yaml

      For example:

      $ oc create -f cro-cr.yaml
  5. Verify the current state of the admission webhook by checking the status of the cluster custom resource.

    $ oc get clusterresourceoverride cluster -n clusterresourceoverride-operator -o yaml

    The mutatingWebhookConfigurationRef section appears when the webhook is called.

    Example output
    apiVersion: operator.autoscaling.openshift.io/v1
    kind: ClusterResourceOverride
    metadata:
      annotations:
        kubectl.kubernetes.io/last-applied-configuration: |
          {"apiVersion":"operator.autoscaling.openshift.io/v1","kind":"ClusterResourceOverride","metadata":{"annotations":{},"name":"cluster"},"spec":{"podResourceOverride":{"spec":{"cpuRequestToLimitPercent":25,"limitCPUToMemoryPercent":200,"memoryRequestToLimitPercent":50}}}}
      creationTimestamp: "2019-12-18T22:35:02Z"
      generation: 1
      name: cluster
      resourceVersion: "127622"
      selfLink: /apis/operator.autoscaling.openshift.io/v1/clusterresourceoverrides/cluster
      uid: 978fc959-1717-4bd1-97d0-ae00ee111e8d
    spec:
      podResourceOverride:
        spec:
          cpuRequestToLimitPercent: 25
          limitCPUToMemoryPercent: 200
          memoryRequestToLimitPercent: 50
    status:
    
    ....
    
        mutatingWebhookConfigurationRef: (1)
          apiVersion: admissionregistration.k8s.io/v1beta1
          kind: MutatingWebhookConfiguration
          name: clusterresourceoverrides.admission.autoscaling.openshift.io
          resourceVersion: "127621"
          uid: 98b3b8ae-d5ce-462b-8ab5-a729ea8f38f3
    
    ....
    1 Reference to the ClusterResourceOverride admission webhook.

Configuring cluster-level overcommit

The Cluster Resource Override Operator requires a ClusterResourceOverride custom resource (CR) and a label for each project where you want the Operator to control overcommit.

Prerequisites
  • The Cluster Resource Override Operator has no effect if limits have not been set on containers. You must specify default limits for a project using a LimitRange object or configure limits in Pod specs in order for the overrides to apply.

Procedure

To modify cluster-level overcommit:

  1. Edit the ClusterResourceOverride CR:

    apiVersion: operator.autoscaling.openshift.io/v1
    kind: ClusterResourceOverride
    metadata:
    -   name: cluster
    spec:
       memoryRequestToLimitPercent: 50 (1)
       cpuRequestToLimitPercent: 25 (2)
       limitCPUToMemoryPercent: 200 (3)
    1 Optional. Specify the percentage to override the container memory limit, if used, between 1-100. The default is 50.
    2 Optional. Specify the percentage to override the container CPU limit, if used, between 1-100. The default is 25.
    3 Optional. Specify the percentage to override the container memory limit, if used. Scaling 1Gi of RAM at 100 percent is equal to 1 CPU core. This is processed prior to overriding the CPU request, if configured. The default is 200.
  2. Ensure the following label has been added to the Namespace object for each project where you want the Cluster Resource Override Operator to control overcommit:

    apiVersion: v1
    kind: Namespace
    metadata:
    
    ....
    
      labels:
        clusterresourceoverrides.admission.autoscaling.openshift.io/enabled: "true" (1)
    
    ....
    1 Add this label to each project.

Node-level overcommit

You can use various ways to control overcommit on specific nodes, such as quality of service (QOS) guarantees, CPU limits, or reserve resources. You can also disable overcommit for specific nodes and specific projects.

Understanding compute resources and containers

The node-enforced behavior for compute resources is specific to the resource type.

Understanding container CPU requests

A container is guaranteed the amount of CPU it requests and is additionally able to consume excess CPU available on the node, up to any limit specified by the container. If multiple containers are attempting to use excess CPU, CPU time is distributed based on the amount of CPU requested by each container.

For example, if one container requested 500m of CPU time and another container requested 250m of CPU time, then any extra CPU time available on the node is distributed among the containers in a 2:1 ratio. If a container specified a limit, it will be throttled not to use more CPU than the specified limit. CPU requests are enforced using the CFS shares support in the Linux kernel. By default, CPU limits are enforced using the CFS quota support in the Linux kernel over a 100ms measuring interval, though this can be disabled.

Understanding container memory requests

A container is guaranteed the amount of memory it requests. A container can use more memory than requested, but once it exceeds its requested amount, it could be terminated in a low memory situation on the node. If a container uses less memory than requested, it will not be terminated unless system tasks or daemons need more memory than was accounted for in the node’s resource reservation. If a container specifies a limit on memory, it is immediately terminated if it exceeds the limit amount.

Understanding overcomitment and quality of service classes

A node is overcommitted when it has a pod scheduled that makes no request, or when the sum of limits across all pods on that node exceeds available machine capacity.

In an overcommitted environment, it is possible that the pods on the node will attempt to use more compute resource than is available at any given point in time. When this occurs, the node must give priority to one pod over another. The facility used to make this decision is referred to as a Quality of Service (QoS) Class.

For each compute resource, a container is divided into one of three QoS classes with decreasing order of priority:

Table 2. Quality of Service Classes
Priority Class Name Description

1 (highest)

Guaranteed

If limits and optionally requests are set (not equal to 0) for all resources and they are equal, then the container is classified as Guaranteed.

2

Burstable

If requests and optionally limits are set (not equal to 0) for all resources, and they are not equal, then the container is classified as Burstable.

3 (lowest)

BestEffort

If requests and limits are not set for any of the resources, then the container is classified as BestEffort.

Memory is an incompressible resource, so in low memory situations, containers that have the lowest priority are terminated first:

  • Guaranteed containers are considered top priority, and are guaranteed to only be terminated if they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.

  • Burstable containers under system memory pressure are more likely to be terminated once they exceed their requests and no other BestEffort containers exist.

  • BestEffort containers are treated with the lowest priority. Processes in these containers are first to be terminated if the system runs out of memory.

Understanding how to reserve memory across quality of service tiers

You can use the qos-reserved parameter to specify a percentage of memory to be reserved by a pod in a particular QoS level. This feature attempts to reserve requested resources to exclude pods from lower OoS classes from using resources requested by pods in higher QoS classes.

OKD uses the qos-reserved parameter as follows:

  • A value of qos-reserved=memory=100% will prevent the Burstable and BestEffort QOS classes from consuming memory that was requested by a higher QoS class. This increases the risk of inducing OOM on BestEffort and Burstable workloads in favor of increasing memory resource guarantees for Guaranteed and Burstable workloads.

  • A value of qos-reserved=memory=50% will allow the Burstable and BestEffort QOS classes to consume half of the memory requested by a higher QoS class.

  • A value of qos-reserved=memory=0% will allow a Burstable and BestEffort QoS classes to consume up to the full node allocatable amount if available, but increases the risk that a Guaranteed workload will not have access to requested memory. This condition effectively disables this feature.

Understanding swap memory and QOS

You can disable swap by default on your nodes in order to preserve quality of service (QOS) guarantees. Otherwise, physical resources on a node can oversubscribe, affecting the resource guarantees the Kubernetes scheduler makes during pod placement.

For example, if two guaranteed pods have reached their memory limit, each container could start using swap memory. Eventually, if there is not enough swap space, processes in the pods can be terminated due to the system being oversubscribed.

Failing to disable swap results in nodes not recognizing that they are experiencing MemoryPressure, resulting in pods not receiving the memory they made in their scheduling request. As a result, additional pods are placed on the node to further increase memory pressure, ultimately increasing your risk of experiencing a system out of memory (OOM) event.

If swap is enabled, any out-of-resource handling eviction thresholds for available memory will not work as expected. Take advantage of out-of-resource handling to allow pods to be evicted from a node when it is under memory pressure, and rescheduled on an alternative node that has no such pressure.

Understanding nodes overcommitment

In an overcommitted environment, it is important to properly configure your node to provide best system behavior.

When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

In an overcommitted environment, it is important to properly configure your node to provide best system behavior.

When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

To ensure this behavior, OKD configures the kernel to always overcommit memory by setting the vm.overcommit_memory parameter to 1, overriding the default operating system setting.

OKD also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom parameter to 0. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority

You can view the current setting by running the following commands on your nodes:

$ sysctl -a |grep commit
Example output
vm.overcommit_memory = 1
$ sysctl -a |grep panic
Example output
vm.panic_on_oom = 0

The above flags should already be set on nodes, and no further action is required.

You can also perform the following configurations for each node:

  • Disable or enforce CPU limits using CPU CFS quotas

  • Reserve resources for system processes

  • Reserve memory across quality of service tiers

Disabling or enforcing CPU limits using CPU CFS quotas

Nodes by default enforce specified CPU limits using the Completely Fair Scheduler (CFS) quota support in the Linux kernel.

Prerequisites
  1. Obtain the label associated with the static Machine Config Pool CRD for the type of node you want to configure. Perform one of the following steps:

    1. View the Machine Config Pool:

      $ oc describe machineconfigpool <name>

      For example:

      $ oc describe machineconfigpool worker
      Example output
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        creationTimestamp: 2019-02-08T14:52:39Z
        generation: 1
        labels:
          custom-kubelet: small-pods (1)
      
      1 If a label has been added it appears under labels.
    2. If the label is not present, add a key/value pair:

      $ oc label machineconfigpool worker custom-kubelet=small-pods
Procedure
  1. Create a Custom Resource (CR) for your configuration change.

    Sample configuration for a disabling CPU limits
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: disable-cpu-units (1)
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: small-pods (2)
      kubeletConfig:
        cpu-cfs-quota: (3)
          - "false"
    1 Assign a name to CR.
    2 Specify the label to apply the configuration change.
    3 Set the cpu-cfs-quota parameter to false.

If CPU limit enforcement is disabled, it is important to understand the impact that will have on your node:

  • If a container makes a request for CPU, it will continue to be enforced by CFS shares in the Linux kernel.

  • If a container makes no explicit request for CPU, but it does specify a limit, the request will default to the specified limit, and be enforced by CFS shares in the Linux kernel.

  • If a container specifies both a request and a limit for CPU, the request will be enforced by CFS shares in the Linux kernel, and the limit will have no impact on the node.

Reserving resources for system processes

To provide more reliable scheduling and minimize node resource overcommitment, each node can reserve a portion of its resources for use by system daemons that are required to run on your node for your cluster to function. In particular, it is recommended that you reserve resources for incompressible resources such as memory.

Procedure

To explicitly reserve resources for non-pod processes, allocate node resources by specifying resources available for scheduling. For more details, see Allocating Resources for Nodes.

Disabling overcommitment for a node

When enabled, overcommitment can be disabled on each node.

Procedure

To disable overcommitment in a node run the following command on that node:

$ sysctl -w vm.overcommit_memory=0

Project-level limits

To help control overcommit, you can set per-project resource limit ranges, specifying memory and CPU limits and defaults for a project that overcommit cannot exceed.

For information on project-level resource limits, see Additional Resources.

Alternatively, you can disable overcommitment for specific projects.

Disabling overcommitment for a project

When enabled, overcommitment can be disabled per-project. For example, you can allow infrastructure components to be configured independently of overcommitment.

Procedure

To disable overcommitment in a project:

  1. Edit the project object file

  2. Add the following annotation:

    quota.openshift.io/cluster-resource-override-enabled: "false"
  3. Create the project object:

    $ oc create -f <file-name>.yaml

Freeing node resources using garbage collection

Understand and use garbage collection.

Understanding how terminated containers are removed though garbage collection

Container garbage collection can be performed using eviction thresholds.

When eviction thresholds are set for garbage collection, the node tries to keep any container for any pod accessible from the API. If the pod has been deleted, the containers will be as well. Containers are preserved as long the pod is not deleted and the eviction threshold is not reached. If the node is under disk pressure, it will remove containers and their logs will no longer be accessible using oc logs.

  • eviction-soft - A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period.

  • eviction-hard - A hard eviction threshold has no grace period, and if observed, OKD takes immediate action.

If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node would constantly oscillate between true and false. As a consequence, the scheduler could make poor scheduling decisions.

To protect against this oscillation, use the eviction-pressure-transition-period flag to control how long OKD must wait before transitioning out of a pressure condition. OKD will not set an eviction threshold as being met for the specified pressure condition for the period specified before toggling the condition back to false.

Understanding how images are removed though garbage collection

Image garbage collection relies on disk usage as reported by cAdvisor on the node to decide which images to remove from the node.

The policy for image garbage collection is based on two conditions:

  • The percent of disk usage (expressed as an integer) which triggers image garbage collection. The default is 85.

  • The percent of disk usage (expressed as an integer) to which image garbage collection attempts to free. Default is 80.

For image garbage collection, you can modify any of the following variables using a Custom Resource.

Table 3. Variables for configuring image garbage collection
Setting Description

imageMinimumGCAge

The minimum age for an unused image before the image is removed by garbage collection. The default is 2m.

imageGCHighThresholdPercent

The percent of disk usage, expressed as an integer, which triggers image garbage collection. The default is 85.

imageGCLowThresholdPercent

The percent of disk usage, expressed as an integer, to which image garbage collection attempts to free. The default is 80.

Two lists of images are retrieved in each garbage collector run:

  1. A list of images currently running in at least one pod.

  2. A list of images available on a host.

As new containers are run, new images appear. All images are marked with a time stamp. If the image is running (the first list above) or is newly detected (the second list above), it is marked with the current time. The remaining images are already marked from the previous spins. All images are then sorted by the time stamp.

Once the collection starts, the oldest images get deleted first until the stopping criterion is met.

Configuring garbage collection for containers and images

As an administrator, you can configure how OKD performs garbage collection by creating a kubeletConfig object for each Machine Config Pool.

OKD supports only one kubeletConfig object for each Machine Config Pool.

You can configure any combination of the following:

  • soft eviction for containers

  • hard eviction for containers

  • eviction for images

For soft container eviction you can also configure a grace period before eviction.

Prerequisites
  1. Obtain the label associated with the static Machine Config Pool CRD for the type of node you want to configure. Perform one of the following steps:

    1. View the Machine Config Pool:

      $ oc describe machineconfigpool <name>

      For example:

      $ oc describe machineconfigpool worker
      Example output
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        creationTimestamp: 2019-02-08T14:52:39Z
        generation: 1
        labels:
          custom-kubelet: small-pods (1)
      1 If a label has been added it appears under labels.
    2. If the label is not present, add a key/value pair:

      $ oc label machineconfigpool worker custom-kubelet=small-pods
Procedure
  1. Create a Custom Resource (CR) for your configuration change.

    Sample configuration for a container garbage collection CR:
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: worker-kubeconfig (1)
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: small-pods (2)
      kubeletConfig:
        evictionSoft: (3)
          memory.available: "500Mi" (4)
          nodefs.available: "10%"
          nodefs.inodesFree: "5%"
          imagefs.available: "15%"
          imagefs.inodesFree: "10%"
        evictionSoftGracePeriod:  (5)
          memory.available: "1m30s"
          nodefs.available: "1m30s"
          nodefs.inodesFree: "1m30s"
          imagefs.available: "1m30s"
          imagefs.inodesFree: "1m30s"
        evictionHard:
          memory.available: "200Mi"
          nodefs.available: "5%"
          nodefs.inodesFree: "4%"
          imagefs.available: "10%"
          imagefs.inodesFree: "5%"
        evictionPressureTransitionPeriod: 0s (6)
        imageMinimumGCAge: 5m (7)
        imageGCHighThresholdPercent: 80 (8)
        imageGCLowThresholdPercent: 75 (9)
    1 Name for the object.
    2 Selector label.
    3 Type of eviction: EvictionSoft and EvictionHard.
    4 Eviction thresholds based on a specific eviction trigger signal.
    5 Grace periods for the soft eviction. This parameter does not apply to eviction-hard.
    6 The duration to wait before transitioning out of an eviction pressure condition
    7 The minimum age for an unused image before the image is removed by garbage collection.
    8 The percent of disk usage (expressed as an integer) which triggers image garbage collection.
    9 The percent of disk usage (expressed as an integer) to which image garbage collection attempts to free.
  2. Create the object:

    $ oc create -f <file-name>.yaml

    For example:

    $ oc create -f gc-container.yaml
    Example output
    kubeletconfig.machineconfiguration.openshift.io/gc-container created
  3. Verify that garbage collection is active. The Machine Config Pool you specified in the custom resource appears with UPDATING as 'true` until the change is fully implemented:

    $ oc get machineconfigpool
    Example output
    NAME     CONFIG                                   UPDATED   UPDATING
    master   rendered-master-546383f80705bd5aeaba93   True      False
    worker   rendered-worker-b4c51bb33ccaae6fc4a6a5   False     True

Using the Node Tuning Operator

Understand and use the Node Tuning Operator.

The Node Tuning Operator helps you manage node-level tuning by orchestrating the Tuned daemon. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs. The Operator manages the containerized Tuned daemon for OKD as a Kubernetes DaemonSet. It ensures the custom tuning specification is passed to all containerized Tuned daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.

Node-level settings applied by the containerized Tuned daemon are rolled back on an event that triggers a profile change or when the containerized Tuned daemon is terminated gracefully by receiving and handling a termination signal.

The Node Tuning Operator is part of a standard OKD installation in version 4.1 and later.

Accessing an example Node Tuning Operator specification

Use this process to access an example Node Tuning Operator specification.

Procedure
  1. Run:

    $ oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator

The default CR is meant for delivering standard node-level tuning for the OKD platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OKD nodes based on node or Pod labels and profile priorities.

While in certain situations the support for Pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without Pod label matching. If a custom profile is created with Pod label matching, then the functionality will be enabled at that time. The Pod label functionality might be deprecated in future versions of the Node Tuning Operator.

Custom tuning specification

The custom resource (CR) for the operator has two major sections. The first section, profile:, is a list of Tuned profiles and their names. The second, recommend:, defines the profile selection logic.

Multiple custom tuning specifications can co-exist as multiple CRs in the operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized Tuned daemons are updated.

Management state

The Operator Management state is set by adjusting the default Tuned CR. By default, the Operator is in the Managed state and the spec.managementState field is not present in the default Tuned CR. Valid values for the Operator Management state are as follows:

  • Managed: the Operator will update its operands as configuration resources are updated

  • Unmanaged: the Operator will ignore changes to the configuration resources

  • Removed: the Operator will remove its operands and resources the Operator provisioned

Profile data

The profile: section lists Tuned profiles and their names.

profile:
- name: tuned_profile_1
  data: |
    # Tuned profile specification
    [main]
    summary=Description of tuned_profile_1 profile

    [sysctl]
    net.ipv4.ip_forward=1
    # ... other sysctl's or other Tuned daemon plugins supported by the containerized Tuned

# ...

- name: tuned_profile_n
  data: |
    # Tuned profile specification
    [main]
    summary=Description of tuned_profile_n profile

    # tuned_profile_n profile settings

Recommended profiles

The profile: selection logic is defined by the recommend: section of the CR. The recommend: section is a list of items to recommend the profiles based on a selection criteria.

recommend:
<recommend-item-1>
# ...
<recommend-item-n>

The individual items of the list:

- machineConfigLabels: (1)
    <mcLabels> (2)
  match: (3)
  <match> (4)
  priority: <priority> (5)
  profile: <tuned_profile_name> (6)
1 Optional.
2 A dictionary of key/value MachineConfig labels. The keys must be unique.
3 If omitted, profile match is assumed unless a profile with a higher priority matches first or machineConfigLabels is set.
4 An optional list.
5 Profile ordering priority. Lower numbers mean higher priority (0 is the highest priority).
6 A Tuned profile to apply on a match. For example tuned_profile_1.

<match> is an optional list recursively defined as follows:

- label: <label_name> (1)
  value: <label_value> (2)
  type: <label_type> (3)
  <match> (4)
1 Node or Pod label name.
2 Optional node or Pod label value. If omitted, the presence of <label_name> is enough to match.
3 Optional object type ("node" or "pod"). If omitted, "node" is assumed.
4 An optional <match> list.

If <match> is not omitted, all nested <match> sections must also evaluate to true. Otherwise, false is assumed and the profile with the respective <match> section will not be applied or recommended. Therefore, the nesting (child <match> sections) works as logical AND operator. Conversely, if any item of the <match> list matches, the entire <match> list evaluates to true. Therefore, the list acts as logical OR operator.

If machineConfigLabels is defined, MachineConfigPool based matching is turned on for the given recommend: list item. <mcLabels> specifies the labels for a MachineConfig. The MachineConfig is created automatically to apply host settings, such as kernel boot parameters, for the profile <tuned_profile_name>. This involves finding all MachineConfigPools with machineConfigSelector matching <mcLabels> and setting the profile <tuned_profile_name> on all nodes that match the MachineConfigPools' nodeSelectors.

The list items match and machineConfigLabels are connected by the logical OR operator. The match item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true, machineConfigLabels item is not considered.

When using MachineConfigPool based matching, it is advised to group nodes with the same hardware configuration into the same MachineConfigPool. Not following this practice might result in Tuned operands calculating conflicting kernel parameters for two or more nodes sharing the same MachineConfigPool.

Example: Node/Pod label based matching
- match:
  - label: tuned.openshift.io/elasticsearch
    match:
    - label: node-role.kubernetes.io/master
    - label: node-role.kubernetes.io/infra
    type: pod
  priority: 10
  profile: openshift-control-plane-es
- match:
  - label: node-role.kubernetes.io/master
  - label: node-role.kubernetes.io/infra
  priority: 20
  profile: openshift-control-plane
- priority: 30
  profile: openshift-node

The CR above is translated for the containerized Tuned daemon into its recommend.conf file based on the profile priorities. The profile with the highest priority (10) is openshift-control-plane-es and, therefore, it is considered first. The containerized Tuned daemon running on a given node looks to see if there is a Pod running on the same node with the tuned.openshift.io/elasticsearch label set. If not, the entire <match> section evaluates as false. If there is such a Pod with the label, in order for the <match> section to evaluate to true, the node label also needs to be node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

If the labels for the profile with priority 10 matched, openshift-control-plane-es profile is applied and no other profile is considered. If the node/Pod label combination did not match, the second highest priority profile (openshift-control-plane) is considered. This profile is applied if the containerized Tuned Pod runs on a node with labels node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

Finally, the profile openshift-node has the lowest priority of 30. It lacks the <match> section and, therefore, will always match. It acts as a profile catch-all to set openshift-node profile, if no other profile with higher priority matches on a given node.

Decision workflow
Example: MachineConfigPool based matching
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-node-custom
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift node profile with an additional kernel parameter
      include=openshift-node
      [bootloader]
      cmdline_openshift_node_custom=+skew_tick=1
    name: openshift-node-custom

  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-custom"
    priority: 20
    profile: openshift-node-custom

To minimize node reboots, label the target nodes with a label the MachineConfigPool’s nodeSelector will match, then create the Tuned CR above and finally create the custom MachineConfigPool itself.

Default profiles set on a cluster

The following are the default profiles set on a cluster.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: default
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - name: "openshift"
    data: |
      [main]
      summary=Optimize systems running OpenShift (parent profile)
      include=${f:virt_check:virtual-guest:throughput-performance}

      [selinux]
      avc_cache_threshold=8192

      [net]
      nf_conntrack_hashsize=131072

      [sysctl]
      net.ipv4.ip_forward=1
      kernel.pid_max=>4194304
      net.netfilter.nf_conntrack_max=1048576
      net.ipv4.conf.all.arp_announce=2
      net.ipv4.neigh.default.gc_thresh1=8192
      net.ipv4.neigh.default.gc_thresh2=32768
      net.ipv4.neigh.default.gc_thresh3=65536
      net.ipv6.neigh.default.gc_thresh1=8192
      net.ipv6.neigh.default.gc_thresh2=32768
      net.ipv6.neigh.default.gc_thresh3=65536
      vm.max_map_count=262144

      [sysfs]
      /sys/module/nvme_core/parameters/io_timeout=4294967295
      /sys/module/nvme_core/parameters/max_retries=10

  - name: "openshift-control-plane"
    data: |
      [main]
      summary=Optimize systems running OpenShift control plane
      include=openshift

      [sysctl]
      # ktune sysctl settings, maximizing i/o throughput
      #
      # Minimal preemption granularity for CPU-bound tasks:
      # (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
      kernel.sched_min_granularity_ns=10000000
      # The total time the scheduler will consider a migrated process
      # "cache hot" and thus less likely to be re-migrated
      # (system default is 500000, i.e. 0.5 ms)
      kernel.sched_migration_cost_ns=5000000
      # SCHED_OTHER wake-up granularity.
      #
      # Preemption granularity when tasks wake up.  Lower the value to
      # improve wake-up latency and throughput for latency critical tasks.
      kernel.sched_wakeup_granularity_ns=4000000

  - name: "openshift-node"
    data: |
      [main]
      summary=Optimize systems running OpenShift nodes
      include=openshift

      [sysctl]
      net.ipv4.tcp_fastopen=3
      fs.inotify.max_user_watches=65536
      fs.inotify.max_user_instances=8192

  recommend:
  - profile: "openshift-control-plane"
    priority: 30
    match:
    - label: "node-role.kubernetes.io/master"
    - label: "node-role.kubernetes.io/infra"

  - profile: "openshift-node"
    priority: 40

Supported Tuned daemon plug-ins

Excluding the [main] section, the following Tuned plug-ins are supported when using custom profiles defined in the profile: section of the Tuned CR:

  • audio

  • cpu

  • disk

  • eeepc_she

  • modules

  • mounts

  • net

  • scheduler

  • scsi_host

  • selinux

  • sysctl

  • sysfs

  • usb

  • video

  • vm

There is some dynamic tuning functionality provided by some of these plug-ins that is not supported. The following Tuned plug-ins are currently not supported:

  • bootloader

  • script

  • systemd

Configuring the maximum number of Pods per Node

Two parameters control the maximum number of pods that can be scheduled to a node: podsPerCore and maxPods. If you use both options, the lower of the two limits the number of pods on a node.

For example, if podsPerCore is set to 10 on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40.

Prerequisite
  1. Obtain the label associated with the static Machine Config Pool CRD for the type of node you want to configure. Perform one of the following steps:

    1. View the Machine Config Pool:

      $ oc describe machineconfigpool <name>

      For example:

      $ oc describe machineconfigpool worker
      Example output
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        creationTimestamp: 2019-02-08T14:52:39Z
        generation: 1
        labels:
          custom-kubelet: small-pods (1)
      1 If a label has been added it appears under labels.
    2. If the label is not present, add a key/value pair:

      $ oc label machineconfigpool worker custom-kubelet=small-pods
Procedure
  1. Create a Custom Resource (CR) for your configuration change.

    Sample configuration for a max-pods CR
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: set-max-pods (1)
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: small-pods (2)
      kubeletConfig:
        podsPerCore: 10 (3)
        maxPods: 250 (4)
    1 Assign a name to CR.
    2 Specify the label to apply the configuration change.
    3 Specify the number of pods the node can run based on the number of processor cores on the node.
    4 Specify the number of pods the node can run to a fixed value, regardless of the properties of the node.

    Setting podsPerCore to 0 disables this limit.

    In the above example, the default value for podsPerCore is 10 and the default value for maxPods is 250. This means that unless the node has 25 cores or more, by default, podsPerCore will be the limiting factor.

  2. List the Machine Config Pool CRDs to see if the change is applied. The UPDATING column reports True if the change is picked up by the Machine Config Controller:

    $ oc get machineconfigpools
    Example output
    NAME     CONFIG                        UPDATED   UPDATING   DEGRADED
    master   master-9cc2c72f205e103bb534   False     False      False
    worker   worker-8cecd1236b33ee3f8a5e   False     True       False

    Once the change is complete, the UPDATED column reports True.

    $ oc get machineconfigpools
    Example output
    NAME     CONFIG                        UPDATED   UPDATING   DEGRADED
    master   master-9cc2c72f205e103bb534   False     True       False
    worker   worker-8cecd1236b33ee3f8a5e   True      False      False