If you use remote worker nodes, consider which objects to use to run your applications.
It is recommend using DaemonSets or static pods based on the behavior you want in the event of network issues or power loss. In addition, you can use Kubernetes zones and tolerations to control or avoid pod evictions if the control plane cannot reach remote worker nodes.
DaemonSets are the best approach to managing pods on remote worker nodes for the following reasons:
DaemonSets do not typically need rescheduling behavior. If a Node disconnects from the cluster, pods on the node can continue to run. OKD does not change the state of DaemonSet pods, and leaves the pods in the state they last reported. For example, if a DaemonSet pod is in the
Running state, when a node stops communicating, the pod keeps running and is assumed to be running by OKD.
DaemonSet pods, by default, are created with
NoExecute tolerations for the
node.kubernetes.io/not-ready taints with no
tolerationSeconds value. These default values ensure that DaemonSet pods are never evicted if the control plane cannot reach a node. For example:
Tolerations added to DaemonSet pods by default
- key: node.kubernetes.io/not-ready
- key: node.kubernetes.io/unreachable
- key: node.kubernetes.io/disk-pressure
- key: node.kubernetes.io/memory-pressure
- key: node.kubernetes.io/pid-pressure
- key: node.kubernetes.io/unschedulable
DaemonSets can use labels to ensure that a workload runs on a matching worker node.
You can use an OKD service endpoint to load balance DaemonSet pods.
DaemonSets do not schedule pods after a reboot of the node if OKD cannot reach the node.
- Static pods
If you want pods restart if a node reboots, after a power loss for example, consider static pods. The kubelet on a node automatically restarts static pods as node restarts.
Static pods cannot use secrets and ConfigMaps.
- Kubernetes zones
Kubernetes zones can slow down the rate or, in some cases, completely stop pod evictions.
When the control plane cannot reach a node, the node controller, by default, applies
node.kubernetes.io/unreachable taints and evicts pods at a rate of 0.1 nodes per second. However, in a cluster that uses Kubernetes zones, pod eviction behavior is altered.
If a zone is fully disrupted, where all nodes in the zone have a
Ready condition that is
Unknown, the control plane does not apply the taint to the nodes in that zone.
For partially disrupted zones, where more than 55% of the nodes have a
Unknown condition, the pod eviction rate is reduced to 0.01 nodes per second. Nodes in smaller clusters, with fewer than 50 nodes, are not tainted. Your cluster must have more than three zones for these behavior to take effect.
You assign a node to a specific zone by applying the
topology.kubernetes.io/region label in the node specification.
Sample node labels for Kubernetes zones
You can adjust the amount of time that the Kubernetes Controller Manager Operator (controller manager) checks the state of each node.
To set the interval that affects the timing of when the on-premise node controller marks nodes with the
Unreachable condition, create a KubeletConfig object that contains the
||Specify the amount of time that the controller manager checks the state of each node that is associated with this MachineConfig. The default is
This parameter works with the
node-monitor-grace-period and the
pod-eviction-timeout parameters, which are not configurable.
node-monitor-grace-period parameter specifies how long the OKD waits after a node associated with this MachineConfig is marked
Unhealthy if the controller manager cannot reach the node. Workloads on the node continue to run after this time. If the remote worker node rejoins the cluster after the
node-monitor-grace-period expires, pods continue to run. New pods can be scheduled to that node. The
node-monitor-grace-period interval is
pod-eviction-timeout parameter specifies the amount of time OKD waits after marking a node that is associated with this MachineConfig as
Unreachable to start marking pods for eviction. Evicted pods are rescheduled on other nodes. If the remote worker node rejoins the cluster after
pod-eviction-timeout expires, the pods running on the remote worker node are terminated because the node controller has evicted the pods on-premise. Pods can then be rescheduled to that node. The
pod-eviction-timeout period is
You can use pod tolerations to mitigate the effects if the on-premise node controller adds a
node.kubernetes.io/unreachable taint with a
NoExecute effect to a node it cannot reach.
A taint with the
NoExecute effect affects pods that are running on the node in the following ways:
Pods that do not tolerate the taint are queued for eviction.
Pods that tolerate the taint without specifying a
tolerationSeconds value in their toleration specification remain bound forever.
Pods that tolerate the taint with a specified
tolerationSeconds value remain bound for the specified amount of time. After the time elapses, the pods are queued for eviction.
You can delay or avoid pod eviction by configuring pods tolerations with the
NoExecute effect for the
Example toleration in a pod spec
- key: "node.kubernetes.io/unreachable"
effect: "NoExecute" (1)
- key: "node.kubernetes.io/not-ready"
effect: "NoExecute" (2)
NoSchedule effect with
tolerationSeconds: 0 allows pods to remain if the control plane cannot reach the node.
NoSchedule effect with
tolerationSeconds: 0 allows pods to remain if the control plane marks the node as
OKD uses the
tolerationSeconds value after the
pod-eviction-timeout value elapses.
- Other types of OKD objects
You can use ReplicaSets, Deployments, and ReplicationControllers. The scheduler can reschedule these pods onto other nodes after the node is disconnected for five minutes. Rescheduling onto other nodes can be beneficial for some workloads, such as REST APIs, where an administrator can guarantee a specific number of pods are running and accessible.
When working with remote worker nodes, rescheduling pods on different nodes might not be acceptable if remote worker nodes are intended to be reserved for specific functions.
StatefulSets do not get restarted when there is an outage. The pods remain in the
terminating state until the control plane can acknowledge that the pods are terminated.
To avoid scheduling a to a node that does not have access to the same type of persistent storage, OKD cannot migrate pods that require persistent volumes to other zones in the case of network separation.