$ oc label node <node_name> feature.node.kubernetes.io/network-sriov.capable="true"
The Single Root I/O Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that can share a single device with multiple pods.
You can configure a Single Root I/O Virtualization (SR-IOV) device in your cluster by using the SR-IOV Operator.
SR-IOV can segment a compliant network device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device. The SR-IOV network device driver for the device determines how the VF is exposed in the container:
netdevice
driver: A regular kernel network device in the netns
of the container
vfio-pci
driver: A character device mounted in the container
You can use SR-IOV network devices with additional networks on your OKD cluster installed on bare metal or OpenStack infrastructure for applications that require high bandwidth or low latency.
You can configure multi-network policies for SR-IOV networks. The support for this is technology preview and SR-IOV additional networks are only supported with kernel NICs. They are not supported for Data Plane Development Kit (DPDK) applications.
Creating multi-network policies on SR-IOV networks might not deliver the same performance to applications compared to SR-IOV networks without a multi-network policy configured. |
Multi-network policies for SR-IOV network is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. |
You can enable SR-IOV on a node by using the following command:
$ oc label node <node_name> feature.node.kubernetes.io/network-sriov.capable="true"
The SR-IOV Network Operator creates and manages the components of the SR-IOV stack. The Operator performs the following functions:
Orchestrates discovery and management of SR-IOV network devices
Generates NetworkAttachmentDefinition
custom resources for the SR-IOV Container Network Interface (CNI)
Creates and updates the configuration of the SR-IOV network device plugin
Creates node specific SriovNetworkNodeState
custom resources
Updates the spec.interfaces
field in each SriovNetworkNodeState
custom resource
The Operator provisions the following components:
A daemon set that is deployed on worker nodes when the SR-IOV Network Operator starts. The daemon is responsible for discovering and initializing SR-IOV network devices in the cluster.
A dynamic admission controller webhook that validates the Operator custom resource and sets appropriate default values for unset fields.
A dynamic admission controller webhook that provides functionality for patching Kubernetes pod specifications with requests and limits for custom network resources such as SR-IOV VFs. The SR-IOV network resources injector adds the resource
field to only the first container in a pod automatically.
A device plugin that discovers, advertises, and allocates SR-IOV network virtual function (VF) resources. Device plugins are used in Kubernetes to enable the use of limited resources, typically in physical devices. Device plugins give the Kubernetes scheduler awareness of resource availability, so that the scheduler can schedule pods on nodes with sufficient resources.
A CNI plugin that attaches VF interfaces allocated from the SR-IOV network device plugin directly into a pod.
A CNI plugin that attaches InfiniBand (IB) VF interfaces allocated from the SR-IOV network device plugin directly into a pod.
The SR-IOV Network resources injector and SR-IOV Network Operator webhook are enabled by default and can be disabled by editing the |
The SR-IOV Network Operator is supported on the following platforms:
Bare metal
OpenStack
OKD supports the following network interface controllers:
Manufacturer | Model | Vendor ID | Device ID |
---|---|---|---|
Broadcom |
BCM57414 |
14e4 |
16d7 |
Broadcom |
BCM57508 |
14e4 |
1750 |
Broadcom |
BCM57504 |
14e4 |
1751 |
Intel |
X710 |
8086 |
1572 |
Intel |
X710 Backplane |
8086 |
1581 |
Intel |
X710 Base T |
8086 |
15ff |
Intel |
XL710 |
8086 |
1583 |
Intel |
XXV710 |
8086 |
158b |
Intel |
E810-CQDA2 |
8086 |
1592 |
Intel |
E810-2CQDA2 |
8086 |
1592 |
Intel |
E810-XXVDA2 |
8086 |
159b |
Intel |
E810-XXVDA4 |
8086 |
1593 |
Intel |
E810-XXVDA4T |
8086 |
1593 |
Mellanox |
MT27700 Family [ConnectX‑4] |
15b3 |
1013 |
Mellanox |
MT27710 Family [ConnectX‑4 Lx] |
15b3 |
1015 |
Mellanox |
MT27800 Family [ConnectX‑5] |
15b3 |
1017 |
Mellanox |
MT28880 Family [ConnectX‑5 Ex] |
15b3 |
1019 |
Mellanox |
MT28908 Family [ConnectX‑6] |
15b3 |
101b |
Mellanox |
MT2892 Family [ConnectX‑6 Dx] |
15b3 |
101d |
Mellanox |
MT2894 Family [ConnectX‑6 Lx] |
15b3 |
101f |
Mellanox |
Mellanox MT2910 Family [ConnectX‑7] |
15b3 |
1021 |
Mellanox |
MT42822 BlueField‑2 in ConnectX‑6 NIC mode |
15b3 |
a2d6 |
Pensando [1] |
DSC-25 dual-port 25G distributed services card for ionic driver |
0x1dd8 |
0x1002 |
Pensando [1] |
DSC-100 dual-port 100G distributed services card for ionic driver |
0x1dd8 |
0x1003 |
Silicom |
STS Family |
8086 |
1591 |
OpenShift SR-IOV is supported, but you must set a static, Virtual Function (VF) media access control (MAC) address using the SR-IOV CNI config file when using SR-IOV.
For the most up-to-date list of supported cards and compatible OKD versions available, see Openshift Single Root I/O Virtualization (SR-IOV) and PTP hardware networks Support Matrix. |
The SR-IOV Network Operator searches your cluster for SR-IOV capable network devices on worker nodes. The Operator creates and updates a SriovNetworkNodeState custom resource (CR) for each worker node that provides a compatible SR-IOV network device.
The CR is assigned the same name as the worker node.
The status.interfaces
list provides information about the network devices on a node.
Do not modify a |
The following YAML is an example of a SriovNetworkNodeState
object created by the SR-IOV Network Operator:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodeState
metadata:
name: node-25 (1)
namespace: openshift-sriov-network-operator
ownerReferences:
- apiVersion: sriovnetwork.openshift.io/v1
blockOwnerDeletion: true
controller: true
kind: SriovNetworkNodePolicy
name: default
spec:
dpConfigVersion: "39824"
status:
interfaces: (2)
- deviceID: "1017"
driver: mlx5_core
mtu: 1500
name: ens785f0
pciAddress: "0000:18:00.0"
totalvfs: 8
vendor: 15b3
- deviceID: "1017"
driver: mlx5_core
mtu: 1500
name: ens785f1
pciAddress: "0000:18:00.1"
totalvfs: 8
vendor: 15b3
- deviceID: 158b
driver: i40e
mtu: 1500
name: ens817f0
pciAddress: 0000:81:00.0
totalvfs: 64
vendor: "8086"
- deviceID: 158b
driver: i40e
mtu: 1500
name: ens817f1
pciAddress: 0000:81:00.1
totalvfs: 64
vendor: "8086"
- deviceID: 158b
driver: i40e
mtu: 1500
name: ens803f0
pciAddress: 0000:86:00.0
totalvfs: 64
vendor: "8086"
syncStatus: Succeeded
1 | The value of the name field is the same as the name of the worker node. |
2 | The interfaces stanza includes a list of all of the SR-IOV devices discovered by the Operator on the worker node. |
You can run a remote direct memory access (RDMA) or a Data Plane Development Kit (DPDK) application in a pod with SR-IOV VF attached.
This example shows a pod using a virtual function (VF) in RDMA mode:
Pod
spec that uses RDMA modeapiVersion: v1
kind: Pod
metadata:
name: rdma-app
annotations:
k8s.v1.cni.cncf.io/networks: sriov-rdma-mlnx
spec:
containers:
- name: testpmd
image: <RDMA_image>
imagePullPolicy: IfNotPresent
securityContext:
runAsUser: 0
capabilities:
add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
command: ["sleep", "infinity"]
The following example shows a pod with a VF in DPDK mode:
Pod
spec that uses DPDK modeapiVersion: v1
kind: Pod
metadata:
name: dpdk-app
annotations:
k8s.v1.cni.cncf.io/networks: sriov-dpdk-net
spec:
containers:
- name: testpmd
image: <DPDK_image>
securityContext:
runAsUser: 0
capabilities:
add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
resources:
limits:
memory: "1Gi"
cpu: "2"
hugepages-1Gi: "4Gi"
requests:
memory: "1Gi"
cpu: "2"
hugepages-1Gi: "4Gi"
command: ["sleep", "infinity"]
volumes:
- name: hugepage
emptyDir:
medium: HugePages
An optional library, app-netutil
, provides several API methods for gathering network information about a pod from within a container running within that pod.
This library can assist with integrating SR-IOV virtual functions (VFs) in Data Plane Development Kit (DPDK) mode into the container. The library provides both a Golang API and a C API.
Currently there are three API methods implemented:
GetCPUInfo()
This function determines which CPUs are available to the container and returns the list.
GetHugepages()
This function determines the amount of huge page memory requested in the Pod
spec for each container and returns the values.
GetInterfaces()
This function determines the set of interfaces in the container and returns the list. The return value includes the interface type and type-specific data for each interface.
The repository for the library includes a sample Dockerfile to build a container image, dpdk-app-centos
. The container image can run one of the following DPDK sample applications, depending on an environment variable in the pod specification: l2fwd
, l3wd
or testpmd
. The container image provides an example of integrating the app-netutil
library into the container image itself. The library can also integrate into an init container. The init container can collect the required data and pass the data to an existing DPDK workload.
When a pod specification includes a resource request or limit for huge pages, the Network Resources Injector automatically adds Downward API fields to the pod specification to provide the huge pages information to the container.
The Network Resources Injector adds a volume that is named podnetinfo
and is mounted at /etc/podnetinfo
for each container in the pod. The volume uses the Downward API and includes a file for huge pages requests and limits. The file naming convention is as follows:
/etc/podnetinfo/hugepages_1G_request_<container-name>
/etc/podnetinfo/hugepages_1G_limit_<container-name>
/etc/podnetinfo/hugepages_2M_request_<container-name>
/etc/podnetinfo/hugepages_2M_limit_<container-name>
The paths specified in the previous list are compatible with the app-netutil
library. By default, the library is configured to search for resource information in the /etc/podnetinfo
directory. If you choose to specify the Downward API path items yourself manually, the app-netutil
library searches for the following paths in addition to the paths in the previous list.
/etc/podnetinfo/hugepages_request
/etc/podnetinfo/hugepages_limit
/etc/podnetinfo/hugepages_1G_request
/etc/podnetinfo/hugepages_1G_limit
/etc/podnetinfo/hugepages_2M_request
/etc/podnetinfo/hugepages_2M_limit
As with the paths that the Network Resources Injector can create, the paths in the preceding list can optionally end with a _<container-name>
suffix.