If you use remote worker nodes, consider which objects to use to run your applications.
It is recommend to use daemon sets or static pods based on the behavior you want in the event of network issues or power loss. In addition, you can use Kubernetes zones and tolerations to control or avoid pod evictions if the control plane cannot reach remote worker nodes.
- Daemon sets
Daemon sets are the best approach to managing pods on remote worker nodes for the following reasons:
Daemon sets do not typically need rescheduling behavior. If a node disconnects from the cluster, pods on the node can continue to run. OKD does not change the state of daemon set pods, and leaves the pods in the state they last reported. For example, if a daemon set pod is in the
Running state, when a node stops communicating, the pod keeps running and is assumed to be running by OKD.
Daemon set pods, by default, are created with
NoExecute tolerations for the
node.kubernetes.io/not-ready taints with no
tolerationSeconds value. These default values ensure that daemon set pods are never evicted if the control plane cannot reach a node. For example:
Tolerations added to daemon set pods by default
- key: node.kubernetes.io/not-ready
- key: node.kubernetes.io/unreachable
- key: node.kubernetes.io/disk-pressure
- key: node.kubernetes.io/memory-pressure
- key: node.kubernetes.io/pid-pressure
- key: node.kubernetes.io/unschedulable
Daemon sets can use labels to ensure that a workload runs on a matching worker node.
You can use an OKD service endpoint to load balance daemon set pods.
Daemon sets do not schedule pods after a reboot of the node if OKD cannot reach the node.
- Static pods
If you want pods restart if a node reboots, after a power loss for example, consider static pods. The kubelet on a node automatically restarts static pods as node restarts.
Static pods cannot use secrets and config maps.
- Kubernetes zones
Kubernetes zones can slow down the rate or, in some cases, completely stop pod evictions.
When the control plane cannot reach a node, the node controller, by default, applies
node.kubernetes.io/unreachable taints and evicts pods at a rate of 0.1 nodes per second. However, in a cluster that uses Kubernetes zones, pod eviction behavior is altered.
If a zone is fully disrupted, where all nodes in the zone have a
Ready condition that is
Unknown, the control plane does not apply the
node.kubernetes.io/unreachable taint to the nodes in that zone.
For partially disrupted zones, where more than 55% of the nodes have a
Unknown condition, the pod eviction rate is reduced to 0.01 nodes per second. Nodes in smaller clusters, with fewer than 50 nodes, are not tainted. Your cluster must have more than three zones for these behavior to take effect.
You assign a node to a specific zone by applying the
topology.kubernetes.io/region label in the node specification.
Sample node labels for Kubernetes zones
You can adjust the amount of time that the kubelet checks the state of each node.
To set the interval that affects the timing of when the on-premise node controller marks nodes with the
Unreachable condition, create a
KubeletConfig object that contains the
The kubelet on each node determines the node status as defined by the
node-status-update-frequency setting and reports that status to the cluster based on the
node-status-report-frequency setting. By default, the kubelet determines the pod status every 10 seconds and reports the status every minute. However, if the node state changes, the kubelet reports the change to the cluster immediately. OKD uses the
node-status-report-frequency setting only when the Node Lease feature gate is enabled, which is the default state in OKD clusters. If the Node Lease feature gate is disabled, the node reports its status based on the
Example kubelet config
machineconfiguration.openshift.io/role: worker (1)
||Specify the type of node type to which this
KubeletConfig object applies using the label from the
||Specify the frequency that the kubelet checks the status of a node associated with this
MachineConfig object. The default value is
10s. If you change this default, the
node-status-report-frequency value is changed to the same value.
||Specify the frequency that the kubelet reports the status of a node associated with this
MachineConfig object. The default value is
This parameter works with the
node-monitor-grace-period and the
pod-eviction-timeout parameters, which are not configurable.
node-monitor-grace-period parameter specifies how long the OKD waits after a node associated with a
MachineConfig object is marked
Unhealthy if the controller manager does not receive the node heartbeat. Workloads on the node continue to run after this time. If the remote worker node rejoins the cluster after the
node-monitor-grace-period expires, pods continue to run. New pods can be scheduled to that node. The
node-monitor-grace-period interval is
node-status-update-frequency value must be lower than the
pod-eviction-timeout parameter specifies the amount of time OKD waits after marking a node that is associated with a
MachineConfig object as
Unreachable to start marking pods for eviction. Evicted pods are rescheduled on other nodes. If the remote worker node rejoins the cluster after
pod-eviction-timeout expires, the pods running on the remote worker node are terminated because the node controller has evicted the pods on-premise. Pods can then be rescheduled to that node. The
pod-eviction-timeout period is
You can use pod tolerations to mitigate the effects if the on-premise node controller adds a
node.kubernetes.io/unreachable taint with a
NoExecute effect to a node it cannot reach.
A taint with the
NoExecute effect affects pods that are running on the node in the following ways:
Pods that do not tolerate the taint are queued for eviction.
Pods that tolerate the taint without specifying a
tolerationSeconds value in their toleration specification remain bound forever.
Pods that tolerate the taint with a specified
tolerationSeconds value remain bound for the specified amount of time. After the time elapses, the pods are queued for eviction.
You can delay or avoid pod eviction by configuring pods tolerations with the
NoExecute effect for the
Example toleration in a pod spec
- key: "node.kubernetes.io/unreachable"
effect: "NoExecute" (1)
- key: "node.kubernetes.io/not-ready"
effect: "NoExecute" (2)
NoExecute effect with
tolerationSeconds: 0 allows pods to remain if the control plane cannot reach the node.
NoExecute effect with
tolerationSeconds: 0 allows pods to remain if the control plane marks the node as
OKD uses the
tolerationSeconds value after the
pod-eviction-timeout value elapses.
- Other types of OKD objects
You can use replica sets, deployments, and replication controllers. The scheduler can reschedule these pods onto other nodes after the node is disconnected for five minutes. Rescheduling onto other nodes can be beneficial for some workloads, such as REST APIs, where an administrator can guarantee a specific number of pods are running and accessible.
When working with remote worker nodes, rescheduling pods on different nodes might not be acceptable if remote worker nodes are intended to be reserved for specific functions.
stateful sets do not get restarted when there is an outage. The pods remain in the
terminating state until the control plane can acknowledge that the pods are terminated.
To avoid scheduling a to a node that does not have access to the same type of persistent storage, OKD cannot migrate pods that require persistent volumes to other zones in the case of network separation.