apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpu-numa-static-config
spec:
kubeletConfig:
cpuManagerPolicy: static
# ...
Non-uniform memory access (NUMA) architecture is a multiprocessor architecture model where CPUs do not access all memory in all locations at the same speed. Instead, CPUs can gain faster access to memory that is in closer proximity to them, or local to them, but slower access to memory that is further away.
A CPU with multiple memory controllers can use any available memory across CPU complexes, regardless of where the memory is located. However, this increased flexibility comes at the expense of performance.
NUMA resource topology refers to the physical locations of CPUs, memory, and PCI devices relative to each other in a NUMA zone. In a NUMA architecture, a NUMA zone is a group of CPUs that has its own processors and memory. Colocated resources are said to be in the same NUMA zone, and CPUs in a zone have faster access to the same local memory than CPUs outside of that zone. A CPU processing a workload using memory that is outside its NUMA zone is slower than a workload processed in a single NUMA zone. For I/O-constrained workloads, the network interface on a distant NUMA zone slows down how quickly information can reach the application.
Applications can achieve better performance by containing data and processing within the same NUMA zone. For high-performance workloads and applications, such as telecommunications workloads, the cluster must process pod workloads in a single NUMA zone so that the workload can operate to specification.
You must enable the NUMA functionality for OKD Virtualization VMs to prevent performance degradation on nodes with multiple NUMA zones. This feature is vital for high-performance and latency-sensitive workloads.
Without NUMA awareness, a VM’s virtual CPUs might run on one physical NUMA zone, while its memory is allocated on another. This "cross-node" communication significantly increases latency and reduces memory bandwidth, and can cause the interconnect buses which link the NUMA zones to become a bottleneck.
When you enable the NUMA functionality for OKD Virtualization VMs, you allow the host to pass its physical topology directly to the VM’s guest operating system (OS). The guest OS can then make intelligent, NUMA-aware decisions about scheduling and memory allocation. This ensures that process threads and memory are kept on the same physical NUMA node. By aligning the virtual topology with the physical one, you minimize latency and maximize performance.
Before you can enable NUMA functionality with OKD Virtualization VMs, you must ensure that your environment meets the following prerequisites.
Worker nodes must have huge pages enabled.
The KubeletConfig object on worker nodes must be configured with the cpuManagerPolicy: static spec to guarantee dedicated CPU allocation, which is a prerequisite for NUMA pinning.
cpuManagerPolicy: static specapiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpu-numa-static-config
spec:
kubeletConfig:
cpuManagerPolicy: static
# ...
VM owners can enable NUMA with ComputeExclusive (CX) instance types, which are specifically designed for high-performance, compute-intensive workloads, and are configured to use NUMA features.
For information about creating VMs using a CX instance type, see Creating virtual machines from instance types.
Hot plugging is the ability to add resources like memory or CPU dynamically to a VM while it is running.
Default OKD Virtualization hot plug multipliers can cause VMs to request an excessive number of sockets. For example, if your VM requests 10 sockets, the default hot plug behavior multiplies this by 4, which means that the total request is 40 sockets. This can exceed the recommended CPUs supported by the Kernel-based Virtual Machine (KVM), which can cause deployment failures.
You can keep VM resource requests aligned with NUMA and optimize performance for resource-intensive workloads by disabling the VM’s default hot plug capability.
As a cluster administrator, you can disable the CPU hot plug by instance type. This is the recommended approach to standardize VM configurations and ensure NUMA-aware CPU allocation without hot plugs for specific instance types.
When a VM is created by using an instance type where the CPU hot plug is disabled, the VM inherits these settings and the CPU hot plug is disabled for that VM.
You have installed the OpenShift CLI (oc).
Create a YAML file for a VirtualMachineClusterInstancetype custom resource (CR). Add a maxSockets spec to the instance type that you want to configure:
VirtualMachineClusterInstancetype CRapiVersion: instancetype.kubevirt.io/v1beta1
kind: VirtualMachineClusterInstancetype
metadata:
name: cx1.mycustom-numa-instance
spec:
cpu:
dedicatedCPUPlacement: true
isolateEmulatorThread: true
numa:
guestMappingPassthrough: {}
guest: 8
maxSockets: 8
memory:
guest: 16Gi
hugepages:
pageSize: 1Gi
where:
Specifies whether dedicated resources are allocated to the VM instance. If this is set to true, the VM’s VCPUs are pinned to physical host CPUs. This is often used for high-performance workloads to minimize scheduling jitter.
Specifies whether the QEMU emulator thread should be isolated and run on a dedicated physical CPU core. This is a performance optimization that is typically used alongside the dedicatedCPUPlacement spec.
Specifies the NUMA topology configuration for the VM.
Specifies that the VM’s NUMA topology should directly pass through the NUMA topology of the underlying host machine. This is critical for applications that are NUMA-aware and require optimal performance.
Specifies the total number of vCPUs to be allocated to the VM.
Specifies the maximum number of CPU sockets the VM is allowed to have.
Specifies the memory configuration for the VM.
Specifies the total amount of memory to be allocated to the VM.
Specifies configuration related to hugepages.
Specifies the size of the hugepages to be used for the VM’s memory.
Create the VirtualMachineClusterInstancetype CR by running the following command:
$ oc create -f <filename>.yaml
Create a VM that uses the updated VirtualMachineClusterInstancetype configuration.
Inspect the configuration of the created VM by running the following command and inspecting the output:
$ oc get vmi <vm_name> -o yaml
Example output
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
name: example-vmi
labels:
instancetype.kubevirt.io/cluster-instancetype: cx1.example-numa-instance
spec:
domain:
cpu:
dedicatedCPUPlacement: true
isolateEmulatorThread: true
sockets: 8
cores: 1
threads: 1
numa:
guestMappingPassthrough: {}
guest: 8
maxSockets: 8
# ...
The update has applied successfully if in the spec.template.spec.domain.cpu section:
The sockets value matches the maxSockets and guest values from the instance type, which ensures that no extra hot plug slots are configured.
The dedicatedCPUPlacement and isolateEmulatorThread fields are present and set to true.
As a VM owner, you can adjust or disable the CPU hot plug for individual VMs. This is the simplest solution for large, performance-critical VMs where you want to ensure a fixed CPU allocation from the start.
You have installed the OpenShift CLI (oc).
Modify the VirtualMachine custom resource (CR) for the VM that you want to configure to add a maxSockets and sockets spec:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: large-numa-vm
spec:
template:
spec:
domain:
cpu:
maxSockets: 10
sockets: 10
cores: 1
threads: 1
By explicitly setting maxSockets and sockets to a value of 10 or higher, you are specifying that additional capacity is not reserved for hot plugging, which ensures that the entire requested cores are the actual cores allocated.
Apply the changes to the VirtualMachine CR by running the following command:
$ oc apply -f <filename>.yaml
Check that you have configured the maxSockets and sockets values correctly, by running the following commands:
$ oc get vmi -o jsonpath='{.spec.domain.cpu.maxSockets}'
$ oc get vmi -o jsonpath='{.spec.domain.cpu.sockets}'
If the configuration was successful, the outputs are the maxSockets and sockets values that you set in the previous procedure:
Example output
10
If you are a cluster administrator and want to disable hot plugging for an entire cluster, you must modify the spec.configuration.kubevirtConfiguration.developerConfiguration.maxHotplugRatio setting in the HyperConverged custom resource (CR).
You have installed the OpenShift CLI (oc).
You have installed the KubeVirt HyperConverged Cluster Operator.
Modify the HyperConverged CR and set the maxHotplugRatio value to 1.0:
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
name: kubevirt-hyperconverged
namespace: kubevirt-hyperconverged
spec:
# ...
kubevirtConfiguration:
developerConfiguration:
maxHotplugRatio: 1.0
# ...
Apply the changes to the HyperConverged CR by running the following command:
$ oc apply -f <filename>.yaml
Check that you have configured the maxHotplugRatio value correctly, by running the following command:
$ oc get hyperconverged -n openshift-cnv -o jsonpath='{.spec.liveUpdateConfiguration.maxHotplugRatio}'
If the configuration was successful, the output is the maxHotplugRatio value that you set in the previous procedure:
Example output
1.0
When you use NUMA topology with OKD Virtualization VMs, certain limitations can impact performance and VM management.
The host scheduler cannot guarantee assigning specific NUMA nodes to a VM. For example, if a VM is rescheduled to a different host machine because of a restart or maintenance, the new host might have a different physical NUMA layout. This means that the VM could be presented with an asymmetrical NUMA topology that reflects the new host’s configuration, rather than its original or desired layout. This change can have a negative impact on the VM’s performance.
Migrating a NUMA-enabled VM to a different host node can be challenging if the destination node’s NUMA topology differs significantly from the source node’s. A mismatch between the NUMA layouts of the source and destination can lead to a degradation of the VM’s performance after the migration is complete.
There is no explicit support for passing GPU NUMA zone information to the VM. This means that the VM’s guest operating system is not aware of the NUMA locality of PCI devices such as GPUs. For workloads that heavily rely on these devices, this lack of awareness could potentially lead to reduced performance if the GPU’s memory is not local to the accessing CPU within the NUMA architecture.
Migration outcomes for VMs are dependent on the configured Topology Manager policies.
These policies determine how CPU and memory resources are allocated with respect to the physical NUMA nodes of the host.
There are four available policies: None, single-numa-node, best-effort, and restricted.
The following table outlines which policies are supported for different VM configurations, and their effect on live migration.
A small VM is defined as a VM with less total cores than half of cores in NUMA node.
A large VM is defined as a VM with more total cores than half of cores in NUMA node.
An extra large VM is defined as a VM with more cores than 1 NUMA node.
| VM size | Topology Manager policy | Tested support status |
|---|---|---|
Any |
single-numa-node |
The VM fails to start because the pod requests more cpus than a single NUMA node on the host can provide. This triggers a topology affinity error during scheduling, which is expected behavior given the node’s hardware limits. |
Any |
None |
Live migration does not work. This is a known issue. The process ends with an incorrect memnode allocation error, and libvirt rejects the XML manifest generated by KubeVirt. See release notes for additional information. |
Small |
None |
Live migration works, as expected. |
Small |
single-numa-node |
Live migration works, as expected. |
Small |
best-effort |
Live migration works, as expected. |
Small |
restricted |
Live migration works, as expected. |
Large |
single-numa-node |
Live migration works, as expected. |
Large |
best-effort |
Live migration works, as expected. |
Large |
restricted |
Live migration works, as expected. |
Extra large |
None |
Live migration works, as expected. |
Extra large |
best-effort |
Live migration works, as expected. |
Extra large |
restricted |
VMs do not work, as expected. |