You can configure OKD clusters with nodes located at your network edge. In this topic, they are called remote worker nodes. A typical cluster with remote worker nodes combines on-premise master and worker nodes with worker nodes in other locations that connect to the cluster. This topic is intended to provide guidance on best practices for using remote worker nodes and does not contain specific configuration details.

There are multiple use cases across different industries, such as telecommunications, retail, manufacturing, and government, for using a deployment pattern with remote worker nodes. For example, you can separate and isolate your projects and workloads by combining the remote worker nodes into Kubernetes zones.

However, having remote worker nodes can introduce higher latency, intermittent loss of network connectivity, and other issues. Among the challenges in a cluster with remote worker node are:

  • Network separation: The OKD control plane and the remote worker nodes must be able communicate with each other. Because of the distance between the control plane and the remote worker nodes, network issues could prevent this communication. See Network separation with remote worker nodes for information on how OKD responds to network separation and for methods to diminish the impact to your cluster.

  • Power outage: Because the control plane and remote worker nodes are in separate locations, a power outage at the remote location or at any point between the two can negatively impact your cluster. See Power loss on remote worker nodes for information on how OKD responds to a node losing power and for methods to diminish the impact to your cluster.

  • Latency spikes or temporary reduction in throughput: As with any network, any changes in network conditions between your cluster and the remote worker nodes can negatively impact your cluster. These types of situations are beyond the scope of this documentation.

Note the following limitations when planning a cluster with remote worker nodes:

  • Remote worker nodes are supported on only Bare Metal clusters with user-provisioned infrastructure.

  • OKD does not support remote worker nodes that use a different cloud provider than the on-premise cluster uses.

  • Moving workloads from one Kubernetes zone to a different Kubernetes zone can be problematic due to system and environment issues, such as a specific type of memory not being available in a different zone.

  • Proxies and firewalls can present additional limitations that are beyond the scope of this document. Refer to the relevant OKD documentation for how to address such limitations, such as Configuring your firewall.

  • You are responsible for configuring and maintaining L2/L3-level network connectivity between the control plane and the network-edge nodes.

Network separation with remote worker nodes

All nodes send heartbeats to the Kubernetes Controller Manager Operator (kube controller) in the OKD cluster every 10 seconds. If the controller manager cannot reach a remote node because of network issues, OKD responds using several default mechanisms.

OKD is designed to be resilient to network partitions and other disruptions. You can mitigate some of the more common disruptions, such as interruptions from software upgrades, network splits, and routing issues. Mitigation strategies include ensuring that pods on remote worker nodes request the correct amount of CPU and memory resources, configuring an appropriate replication policy, using redundancy across zones, and using Pod Disruption Budgets on workloads.

If the kube controller cannot reach a node after a configured period, the node controller on the control plane updates the node health to Unhealthy and marks the node Ready condition as Unknown. In response, the scheduler stops scheduling pods to that node. The on-premise node controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to the node and schedules pods on the node for eviction after five minutes, by default.

On that node, the kubelet adds a node.kubernetes.io/unreachable taint with a NoExecute effect to the node and begins to evict pods on the node after five minutes, by default.

If a workload controller, such as a Deployment or StatefulSet, is directing traffic to pods on the unhealthy node and other nodes can reach the cluster, OKD routes the traffic away from the pods on the node. Nodes that cannot reach the cluster do not get updated with the new traffic routing. As a result, the workloads on those nodes might continue to attempt to reach the unhealthy node.

You can mitigate the effects of connection loss by:

  • using DaemonSets to create pods that tolerate the taints

  • using static pods that automatically restart if a node goes down

  • using Kubernetes zones to control pod eviction

  • configuring pod tolerations to delay or avoid pod eviction

  • configuring the kubelet to control the timing of when it marks nodes as unhealthy.

For more information on using these objects in a cluster with remote worker nodes, see About remote worker node strategies.

Power loss on remote worker nodes

If a remote worker node loses power or restarts ungracefully, OKD responds using several default mechanisms.

If the Kubernetes Controller Manager Operator (kube controller) cannot reach a node after a configured period, the control plane updates the node health to Unhealthy and marks the node Ready condition as Unknown. In response, the scheduler stops scheduling pods to that node. The on-premise node controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to the node and schedules pods on the node for eviction after five minutes, by default.

On the node, the pods must be restarted when the node recovers power and reconnects with the control plane.

If you want the pods to restart immediately upon restart, use static pods.

After the node restarts, the kubelet also restarts and attempts to restart the pods that were scheduled on the node. If the connection to the control plane takes longer than the default five minutes, the control plane cannot update the node health and remove the node.kubernetes.io/unreachable taint. On the node, the kubelet terminates any running pods. When these conditions are cleared, the scheduler can start scheduling pods to that node.

You can mitigate the effects of power loss by:

  • using DaemonSets to create pods that tolertate the taints

  • using static pods that automatically restart with a node

  • configuring pods tolerations to delay or avoid pod eviction

  • configuring the kubelet to control the timing of when the node controller marks nodes as unhealthy.

For more information on using these objects in a cluster with remote worker nodes, see About remote worker node strategies.

Remote worker node strategies

If you use remote worker nodes, consider which objects to use to run your applications.

It is recommend using DaemonSets or static pods based on the behavior you want in the event of network issues or power loss. In addition, you can use Kubernetes zones and tolerations to control or avoid pod evictions if the control plane cannot reach remote worker nodes.

DaemonSets

DaemonSets are the best approach to managing pods on remote worker nodes for the following reasons:

  • DaemonSets do not typically need rescheduling behavior. If a Node disconnects from the cluster, pods on the node can continue to run. OKD does not change the state of DaemonSet pods, and leaves the pods in the state they last reported. For example, if a DaemonSet pod is in the Running state, when a node stops communicating, the pod keeps running and is assumed to be running by OKD.

  • DaemonSet pods, by default, are created with NoExecute tolerations for the node.kubernetes.io/unreachable and node.kubernetes.io/not-ready taints with no tolerationSeconds value. These default values ensure that DaemonSet pods are never evicted if the control plane cannot reach a node. For example:

    Tolerations added to DaemonSet pods by default
      tolerations:
        - key: node.kubernetes.io/not-ready
          operator: Exists
          effect: NoExecute
        - key: node.kubernetes.io/unreachable
          operator: Exists
          effect: NoExecute
        - key: node.kubernetes.io/disk-pressure
          operator: Exists
          effect: NoSchedule
        - key: node.kubernetes.io/memory-pressure
          operator: Exists
          effect: NoSchedule
        - key: node.kubernetes.io/pid-pressure
          operator: Exists
          effect: NoSchedule
        - key: node.kubernetes.io/unschedulable
          operator: Exists
          effect: NoSchedule
  • DaemonSets can use labels to ensure that a workload runs on a matching worker node.

  • You can use an OKD service endpoint to load balance DaemonSet pods.

DaemonSets do not schedule pods after a reboot of the node if OKD cannot reach the node.

Static pods

If you want pods restart if a node reboots, after a power loss for example, consider static pods. The kubelet on a node automatically restarts static pods as node restarts.

Static pods cannot use secrets and ConfigMaps.

Kubernetes zones

Kubernetes zones can slow down the rate or, in some cases, completely stop pod evictions.

When the control plane cannot reach a node, the node controller, by default, applies node.kubernetes.io/unreachable taints and evicts pods at a rate of 0.1 nodes per second. However, in a cluster that uses Kubernetes zones, pod eviction behavior is altered.

If a zone is fully disrupted, where all nodes in the zone have a Ready condition that is False or Unknown, the control plane does not apply the taint to the nodes in that zone.

For partially disrupted zones, where more than 55% of the nodes have a False or Unknown condition, the pod eviction rate is reduced to 0.01 nodes per second. Nodes in smaller clusters, with fewer than 50 nodes, are not tainted. Your cluster must have more than three zones for these behavior to take effect.

You assign a node to a specific zone by applying the topology.kubernetes.io/region label in the node specification.

Sample node labels for Kubernetes zones
kind: Node
apiVersion: v1
metadata:
  labels:
    topology.kubernetes.io/region=east
KubeletConfig objects

You can adjust the amount of time that the Kubernetes Controller Manager Operator (controller manager) checks the state of each node.

To set the interval that affects the timing of when the on-premise node controller marks nodes with the Unhealthy or Unreachable condition, create a KubeletConfig object that contains the node-status-update-frequency parameter:

Example KubeletConfig
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: disable-cpu-units
spec:
  machineConfigPoolSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  kubeletConfig:
    node-status-update-frequency: (1)
      - "5s"
1 Specify the amount of time that the controller manager checks the state of each node that is associated with this MachineConfig. The default is 5s.

This parameter works with the node-monitor-grace-period and the pod-eviction-timeout parameters, which are not configurable.

  • The node-monitor-grace-period parameter specifies how long the OKD waits after a node associated with this MachineConfig is marked Unhealthy if the controller manager cannot reach the node. Workloads on the node continue to run after this time. If the remote worker node rejoins the cluster after the node-monitor-grace-period expires, pods continue to run. New pods can be scheduled to that node. The node-monitor-grace-period interval is 40s.

  • The pod-eviction-timeout parameter specifies the amount of time OKD waits after marking a node that is associated with this MachineConfig as Unreachable to start marking pods for eviction. Evicted pods are rescheduled on other nodes. If the remote worker node rejoins the cluster after pod-eviction-timeout expires, the pods running on the remote worker node are terminated because the node controller has evicted the pods on-premise. Pods can then be rescheduled to that node. The pod-eviction-timeout period is 5m0s.

Tolerations

You can use pod tolerations to mitigate the effects if the on-premise node controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to a node it cannot reach.

A taint with the NoExecute effect affects pods that are running on the node in the following ways:

  • Pods that do not tolerate the taint are queued for eviction.

  • Pods that tolerate the taint without specifying a tolerationSeconds value in their toleration specification remain bound forever.

  • Pods that tolerate the taint with a specified tolerationSeconds value remain bound for the specified amount of time. After the time elapses, the pods are queued for eviction.

You can delay or avoid pod eviction by configuring pods tolerations with the NoExecute effect for the node.kubernetes.io/unreachable and node.kubernetes.io/not-ready taints.

Example toleration in a pod spec
...
tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute" (1)
  tolerationSeconds: 0
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute" (2)
  tolerationSeconds: 0
...
1 The NoSchedule effect with tolerationSeconds: 0 allows pods to remain if the control plane cannot reach the node.
2 The NoSchedule effect with tolerationSeconds: 0 allows pods to remain if the control plane marks the node as Unhealthy.

OKD uses the tolerationSeconds value after the pod-eviction-timeout value elapses.

Other types of OKD objects

You can use ReplicaSets, Deployments, and ReplicationControllers. The scheduler can reschedule these pods onto other nodes after the node is disconnected for five minutes. Rescheduling onto other nodes can be beneficial for some workloads, such as REST APIs, where an administrator can guarantee a specific number of pods are running and accessible.

When working with remote worker nodes, rescheduling pods on different nodes might not be acceptable if remote worker nodes are intended to be reserved for specific functions.

StatefulSets do not get restarted when there is an outage. The pods remain in the terminating state until the control plane can acknowledge that the pods are terminated.

To avoid scheduling a to a node that does not have access to the same type of persistent storage, OKD cannot migrate pods that require persistent volumes to other zones in the case of network separation.

Additional resources