×

Workload partitioning separates compute node CPU resources into distinct CPU sets. The primary objective is to keep platform pods on the specified cores to avoid interrupting the CPUs the customer workloads are running on.

Workload partitioning isolates OKD services, cluster management workloads, and infrastructure pods to run on a reserved set of CPUs. This ensures that the remaining CPUs in the cluster deployment are untouched and available exclusively for non-platform workloads. The minimum number of reserved CPUs required for the cluster management is four CPU Hyper-Threads (HTs).

In the context of enabling workload partitioning and managing CPU resources effectively, nodes that are not configured correctly will not be permitted to join the cluster through a node admission webhook. When the workload partitioning feature is enabled, the machine config pools for control plane and worker will be supplied with configurations for nodes to use. Adding new nodes to these pools will make sure they are correctly configured before joining the cluster.

Currently, nodes must have uniform configurations per machine config pool to ensure that correct CPU affinity is set across all nodes within that pool. After admission, nodes within the cluster identify themselves as supporting a new resource type called management.workload.openshift.io/cores and accurately report their CPU capacity. Workload partitioning can be enabled during cluster installation only by adding the additional field cpuPartitioningMode to the install-config.yaml file.

When workload partitioning is enabled, the management.workload.openshift.io/cores resource allows the scheduler to correctly assign pods based on the cpushares capacity of the host, not just the default cpuset. This ensures more precise allocation of resources for workload partitioning scenarios.

Workload partitioning ensures that CPU requests and limits specified in the pod’s configuration are respected. In OKD 4.16 or later, accurate CPU usage limits are set for platform pods through CPU partitioning. As workload partitioning uses the custom resource type of management.workload.openshift.io/cores, the values for requests and limits are the same due to a requirement by Kubernetes for extended resources. However, the annotations modified by workload partitioning correctly reflect the desired limits.

Extended resources cannot be overcommitted, so request and limit must be equal if both are present in a container spec.

Enabling workload partitioning

With workload partitioning, cluster management pods are annotated to correctly partition them into a specified CPU affinity. These pods operate normally within the minimum size CPU configuration specified by the reserved value in the Performance Profile. Additional Day 2 Operators that make use of workload partitioning should be taken into account when calculating how many reserved CPU cores should be set aside for the platform.

Workload partitioning isolates user workloads from platform workloads using standard Kubernetes scheduling capabilities.

Workload partitioning can only be enabled during cluster installation. You cannot disable workload partitioning postinstallation.

Use this procedure to enable workload partitioning cluster wide:

Procedure
  • In the install-config.yaml file, add the additional field cpuPartitioningMode and set it to AllNodes.

    apiVersion: v1
    baseDomain: devcluster.openshift.com
    cpuPartitioningMode: AllNodes (1)
    compute:
      - architecture: amd64
        hyperthreading: Enabled
        name: worker
        platform: {}
        replicas: 3
    controlPlane:
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform: {}
      replicas: 3
    1 Sets up a cluster for CPU partitioning at install time. The default value is None.

Performance profiles and workload partitioning

Applying a performance profile allows you to make use of the workload partitioning feature. An appropriately configured performance profile specifies the isolated and reserved CPUs. The recommended way to create a performance profile is to use the Performance Profile Creator (PPC) tool to create the performance profile.

Sample performance profile configuration

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  # if you change this name make sure the 'include' line in TunedPerformancePatch.yaml
  # matches this name: include=openshift-node-performance-${PerformanceProfile.metadata.name}
  # Also in file 'validatorCRs/informDuValidator.yaml':
  # name: 50-performance-${PerformanceProfile.metadata.name}
  name: openshift-node-performance-profile
  annotations:
    ran.openshift.io/reference-configuration: "ran-du.redhat.com"
spec:
  additionalKernelArgs:
    - "rcupdate.rcu_normal_after_boot=0"
    - "efi=runtime"
    - "vfio_pci.enable_sriov=1"
    - "vfio_pci.disable_idle_d3=1"
    - "module_blacklist=irdma"
  cpu:
    isolated: $isolated
    reserved: $reserved
  hugepages:
    defaultHugepagesSize: $defaultHugepagesSize
    pages:
      - size: $size
        count: $count
        node: $node
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/$mcp: ""
  nodeSelector:
    node-role.kubernetes.io/$mcp: ''
  numa:
    topologyPolicy: "restricted"
  # To use the standard (non-realtime) kernel, set enabled to false
  realTimeKernel:
    enabled: true
  workloadHints:
    # WorkloadHints defines the set of upper level flags for different type of workloads.
    # See https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md#workloadhints
    # for detailed descriptions of each item.
    # The configuration below is set for a low latency, performance mode.
    realTime: true
    highPowerConsumption: false
    perPodPowerManagement: false
Table 1. PerformanceProfile CR options for single-node OpenShift clusters
PerformanceProfile CR field Description

metadata.name

Ensure that name matches the following fields set in related GitOps ZTP custom resources (CRs):

  • include=openshift-node-performance-${PerformanceProfile.metadata.name} in TunedPerformancePatch.yaml

  • name: 50-performance-${PerformanceProfile.metadata.name} in validatorCRs/informDuValidator.yaml

spec.additionalKernelArgs

"efi=runtime" Configures UEFI secure boot for the cluster host.

spec.cpu.isolated

Set the isolated CPUs. Ensure all of the Hyper-Threading pairs match.

The reserved and isolated CPU pools must not overlap and together must span all available cores. CPU cores that are not accounted for cause an undefined behaviour in the system.

spec.cpu.reserved

Set the reserved CPUs. When workload partitioning is enabled, system processes, kernel threads, and system container threads are restricted to these CPUs. All CPUs that are not isolated should be reserved.

spec.hugepages.pages

  • Set the number of huge pages (count)

  • Set the huge pages size (size).

  • Set node to the NUMA node where the hugepages are allocated (node)

spec.realTimeKernel

Set enabled to true to use the realtime kernel.

spec.workloadHints

Use workloadHints to define the set of top level flags for different type of workloads. The example configuration configures the cluster for low latency and high performance.