About the Node Tuning Operator

The Node Tuning Operator helps you manage node-level tuning by orchestrating the Tuned daemon. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs. The Operator manages the containerized Tuned daemon for OKD as a Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized Tuned daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.

Node-level settings applied by the containerized Tuned daemon are rolled back on an event that triggers a profile change or when the containerized Tuned daemon is terminated gracefully by receiving and handling a termination signal.

The Node Tuning Operator is part of a standard OKD installation in version 4.1 and later.

Accessing an example Node Tuning Operator specification

Use this process to access an example Node Tuning Operator specification.

Procedure
  1. Run:

    $ oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator

The default CR is meant for delivering standard node-level tuning for the OKD platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OKD nodes based on node or pod labels and profile priorities.

While in certain situations the support for pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without pod label matching. If a custom profile is created with pod label matching, then the functionality will be enabled at that time. The pod label functionality might be deprecated in future versions of the Node Tuning Operator.

Default profiles set on a cluster

The following are the default profiles set on a cluster.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: default
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - name: "openshift"
    data: |
      [main]
      summary=Optimize systems running OpenShift (parent profile)
      include=${f:virt_check:virtual-guest:throughput-performance}

      [selinux]
      avc_cache_threshold=8192

      [net]
      nf_conntrack_hashsize=131072

      [sysctl]
      net.ipv4.ip_forward=1
      kernel.pid_max=>4194304
      net.netfilter.nf_conntrack_max=1048576
      net.ipv4.conf.all.arp_announce=2
      net.ipv4.neigh.default.gc_thresh1=8192
      net.ipv4.neigh.default.gc_thresh2=32768
      net.ipv4.neigh.default.gc_thresh3=65536
      net.ipv6.neigh.default.gc_thresh1=8192
      net.ipv6.neigh.default.gc_thresh2=32768
      net.ipv6.neigh.default.gc_thresh3=65536
      vm.max_map_count=262144

      [sysfs]
      /sys/module/nvme_core/parameters/io_timeout=4294967295
      /sys/module/nvme_core/parameters/max_retries=10

  - name: "openshift-control-plane"
    data: |
      [main]
      summary=Optimize systems running OpenShift control plane
      include=openshift

      [sysctl]
      # ktune sysctl settings, maximizing i/o throughput
      #
      # Minimal preemption granularity for CPU-bound tasks:
      # (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
      kernel.sched_min_granularity_ns=10000000
      # The total time the scheduler will consider a migrated process
      # "cache hot" and thus less likely to be re-migrated
      # (system default is 500000, i.e. 0.5 ms)
      kernel.sched_migration_cost_ns=5000000
      # SCHED_OTHER wake-up granularity.
      #
      # Preemption granularity when tasks wake up.  Lower the value to
      # improve wake-up latency and throughput for latency critical tasks.
      kernel.sched_wakeup_granularity_ns=4000000

  - name: "openshift-node"
    data: |
      [main]
      summary=Optimize systems running OpenShift nodes
      include=openshift

      [sysctl]
      net.ipv4.tcp_fastopen=3
      fs.inotify.max_user_watches=65536
      fs.inotify.max_user_instances=8192

  recommend:
  - profile: "openshift-control-plane"
    priority: 30
    match:
    - label: "node-role.kubernetes.io/master"
    - label: "node-role.kubernetes.io/infra"

  - profile: "openshift-node"
    priority: 40

Verifying that the Tuned profiles are applied

Use this procedure to check which Tuned profiles are applied on every node.

Procedure
  1. Check which Tuned pods are running on each node:

    $ oc get pods -n openshift-cluster-node-tuning-operator -o wide
    Example output
    NAME                                            READY   STATUS    RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
    cluster-node-tuning-operator-599489d4f7-k4hw4   1/1     Running   0          6d2h   10.129.0.76    ip-10-0-145-113.eu-west-3.compute.internal   <none>           <none>
    tuned-2jkzp                                     1/1     Running   1          6d3h   10.0.145.113   ip-10-0-145-113.eu-west-3.compute.internal   <none>           <none>
    tuned-g9mkx                                     1/1     Running   1          6d3h   10.0.147.108   ip-10-0-147-108.eu-west-3.compute.internal   <none>           <none>
    tuned-kbxsh                                     1/1     Running   1          6d3h   10.0.132.143   ip-10-0-132-143.eu-west-3.compute.internal   <none>           <none>
    tuned-kn9x6                                     1/1     Running   1          6d3h   10.0.163.177   ip-10-0-163-177.eu-west-3.compute.internal   <none>           <none>
    tuned-vvxwx                                     1/1     Running   1          6d3h   10.0.131.87    ip-10-0-131-87.eu-west-3.compute.internal    <none>           <none>
    tuned-zqrwq                                     1/1     Running   1          6d3h   10.0.161.51    ip-10-0-161-51.eu-west-3.compute.internal    <none>           <none>
  2. Extract the profile applied from each pod and match them against the previous list:

    $ for p in `oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned -o=jsonpath='{range .items[*]}{.metadata.name} {end}'`; do printf "\n*** $p ***\n" ; oc logs pod/$p -n openshift-cluster-node-tuning-operator | grep applied; done
    Example output
    *** tuned-2jkzp ***
    2020-07-10 13:53:35,368 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-control-plane' applied
    
    *** tuned-g9mkx ***
    2020-07-10 14:07:17,089 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
    2020-07-10 15:56:29,005 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied
    2020-07-10 16:00:19,006 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
    2020-07-10 16:00:48,989 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied
    
    *** tuned-kbxsh ***
    2020-07-10 13:53:30,565 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
    2020-07-10 15:56:30,199 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied
    
    *** tuned-kn9x6 ***
    2020-07-10 14:10:57,123 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
    2020-07-10 15:56:28,757 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied
    
    *** tuned-vvxwx ***
    2020-07-10 14:11:44,932 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-control-plane' applied
    
    *** tuned-zqrwq ***
    2020-07-10 14:07:40,246 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-control-plane' applied

Custom tuning specification

The custom resource (CR) for the Operator has two major sections. The first section, profile:, is a list of Tuned profiles and their names. The second, recommend:, defines the profile selection logic.

Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized Tuned daemons are updated.

Management state

The Operator Management state is set by adjusting the default Tuned CR. By default, the Operator is in the Managed state and the spec.managementState field is not present in the default Tuned CR. Valid values for the Operator Management state are as follows:

  • Managed: the Operator will update its operands as configuration resources are updated

  • Unmanaged: the Operator will ignore changes to the configuration resources

  • Removed: the Operator will remove its operands and resources the Operator provisioned

Profile data

The profile: section lists Tuned profiles and their names.

profile:
- name: tuned_profile_1
  data: |
    # Tuned profile specification
    [main]
    summary=Description of tuned_profile_1 profile

    [sysctl]
    net.ipv4.ip_forward=1
    # ... other sysctl's or other Tuned daemon plugins supported by the containerized Tuned

# ...

- name: tuned_profile_n
  data: |
    # Tuned profile specification
    [main]
    summary=Description of tuned_profile_n profile

    # tuned_profile_n profile settings

Recommended profiles

The profile: selection logic is defined by the recommend: section of the CR. The recommend: section is a list of items to recommend the profiles based on a selection criteria.

recommend:
<recommend-item-1>
# ...
<recommend-item-n>

The individual items of the list:

- machineConfigLabels: (1)
    <mcLabels> (2)
  match: (3)
  <match> (4)
  priority: <priority> (5)
  profile: <tuned_profile_name> (6)
1 Optional.
2 A dictionary of key/value MachineConfig labels. The keys must be unique.
3 If omitted, profile match is assumed unless a profile with a higher priority matches first or machineConfigLabels is set.
4 An optional list.
5 Profile ordering priority. Lower numbers mean higher priority (0 is the highest priority).
6 A Tuned profile to apply on a match. For example tuned_profile_1.

<match> is an optional list recursively defined as follows:

- label: <label_name> (1)
  value: <label_value> (2)
  type: <label_type> (3)
  <match> (4)
1 Node or pod label name.
2 Optional node or pod label value. If omitted, the presence of <label_name> is enough to match.
3 Optional object type (node or pod). If omitted, node is assumed.
4 An optional <match> list.

If <match> is not omitted, all nested <match> sections must also evaluate to true. Otherwise, false is assumed and the profile with the respective <match> section will not be applied or recommended. Therefore, the nesting (child <match> sections) works as logical AND operator. Conversely, if any item of the <match> list matches, the entire <match> list evaluates to true. Therefore, the list acts as logical OR operator.

If machineConfigLabels is defined, machine config pool based matching is turned on for the given recommend: list item. <mcLabels> specifies the labels for a machine config. The machine config is created automatically to apply host settings, such as kernel boot parameters, for the profile <tuned_profile_name>. This involves finding all machine config pools with machine config selector matching <mcLabels> and setting the profile <tuned_profile_name> on all nodes that match the machine config pools' node selectors.

The list items match and machineConfigLabels are connected by the logical OR operator. The match item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true, the machineConfigLabels item is not considered.

When using machine config pool based matching, it is advised to group nodes with the same hardware configuration into the same machine config pool. Not following this practice might result in Tuned operands calculating conflicting kernel parameters for two or more nodes sharing the same machine config pool.

Example: node or pod label based matching
- match:
  - label: tuned.openshift.io/elasticsearch
    match:
    - label: node-role.kubernetes.io/master
    - label: node-role.kubernetes.io/infra
    type: pod
  priority: 10
  profile: openshift-control-plane-es
- match:
  - label: node-role.kubernetes.io/master
  - label: node-role.kubernetes.io/infra
  priority: 20
  profile: openshift-control-plane
- priority: 30
  profile: openshift-node

The CR above is translated for the containerized Tuned daemon into its recommend.conf file based on the profile priorities. The profile with the highest priority (10) is openshift-control-plane-es and, therefore, it is considered first. The containerized Tuned daemon running on a given node looks to see if there is a pod running on the same node with the tuned.openshift.io/elasticsearch label set. If not, the entire <match> section evaluates as false. If there is such a pod with the label, in order for the <match> section to evaluate to true, the node label also needs to be node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

If the labels for the profile with priority 10 matched, openshift-control-plane-es profile is applied and no other profile is considered. If the node/pod label combination did not match, the second highest priority profile (openshift-control-plane) is considered. This profile is applied if the containerized Tuned pod runs on a node with labels node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

Finally, the profile openshift-node has the lowest priority of 30. It lacks the <match> section and, therefore, will always match. It acts as a profile catch-all to set openshift-node profile, if no other profile with higher priority matches on a given node.

Decision workflow
Example: MachineConfigPool based matching
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-node-custom
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift node profile with an additional kernel parameter
      include=openshift-node
      [bootloader]
      cmdline_openshift_node_custom=+skew_tick=1
    name: openshift-node-custom

  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-custom"
    priority: 20
    profile: openshift-node-custom

To minimize node reboots, label the target nodes with a label the machine config pool’s node selector will match, then create the Tuned CR above and finally create the custom machine config pool itself.

Custom tuning example

The following CR applies custom node-level tuning for OKD nodes with label tuned.openshift.io/ingress-node-label set to any value. As an administrator, use the following command to create a custom Tuned CR.

Custom tuning example
$ oc create -f- <<_EOF_
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: ingress
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=A custom OpenShift ingress profile
      include=openshift-control-plane
      [sysctl]
      net.ipv4.ip_local_port_range="1024 65535"
      net.ipv4.tcp_tw_reuse=1
    name: openshift-ingress
  recommend:
  - match:
    - label: tuned.openshift.io/ingress-node-label
    priority: 10
    profile: openshift-ingress
_EOF_

Custom profile writers are strongly encouraged to include the default Tuned daemon profiles shipped within the default Tuned CR. The example above uses the default openshift-control-plane profile to accomplish this.

Supported Tuned daemon plug-ins

Excluding the [main] section, the following Tuned plug-ins are supported when using custom profiles defined in the profile: section of the Tuned CR:

  • audio

  • cpu

  • disk

  • eeepc_she

  • modules

  • mounts

  • net

  • scheduler

  • scsi_host

  • selinux

  • sysctl

  • sysfs

  • usb

  • video

  • vm

There is some dynamic tuning functionality provided by some of these plug-ins that is not supported. The following Tuned plug-ins are currently not supported:

  • bootloader

  • script

  • systemd