Remediating nodes with the Poison Pill Operator - Working with nodes | Nodes

About the Poison Pill Operator
Installing the Poison Pill Operator by using the web console
Installing the Poison Pill Operator by using the CLI
Configuring machine health checks to use the Poison Pill Operator
Troubleshooting the Poison Pill Operator
Gathering data about the Poison Pill Operator
Additional resources

You can use the Poison Pill Operator to automatically reboot unhealthy nodes. This remediation strategy minimizes downtime for stateful applications and ReadWriteOnce (RWO) volumes, and restores compute capacity if transient failures occur.

About the Poison Pill Operator

The Poison Pill Operator runs on the cluster nodes and reboots nodes that are identified as unhealthy. The Operator uses the MachineHealthCheck controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the MachineHealthCheck resource creates the PoisonPillRemediation custom resource (CR), which triggers the Poison Pill Operator.

The Poison Pill Operator minimizes downtime for stateful applications and restores compute capacity if transient failures occur. You can use this Operator regardless of the management interface, such as IPMI or an API to provision a node, and regardless of the cluster installation type, such as installer-provisioned infrastructure or user-provisioned infrastructure.

Understanding the Poison Pill Operator configuration

The Poison Pill Operator creates the PoisonPillConfig CR with the name poison-pill-config in the Poison Pill Operator’s namespace. You can edit this CR. However, you cannot create a new CR for the Poison Pill Operator.

A change in the PoisonPillConfig CR re-creates the Poison Pill daemon set.

The PoisonPillConfig CR resembles the following YAML file:

apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillConfig
metadata:
  name: poison-pill-config
  namespace: openshift-operators
spec:
  safeTimeToAssumeNodeRebootedSeconds: 180 (1)
  watchdogFilePath: /test/watchdog1 (2)
  isSoftwareRebootEnabled: true (3)
  apiServerTimeout: 15s (4)
  apiCheckInterval: 5s (5)
  maxApiErrorThreshold: 3 (6)
  peerApiServerTimeout: 5s (7)
  peerDialTimeout: 5s (8)
  peerRequestTimeout: 5s (9)
  peerUpdateInterval: 15m (10)

1	Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value.
2	Specify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Poison Pill Operator automatically detects the softdog device path. If a watchdog device is unavailable, the `PoisonPillConfig` CR uses a software reboot.
3	Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of `isSoftwareRebootEnabled` is set to `true`. To disable the software reboot, set the parameter value to `false`.
4	Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation.
5	Specify the frequency to check connectivity with each API server.
6	Specify a threshold value. After reaching this threshold, the node starts contacting its peers.
7	Specify the timeout duration for the peer to connect the API server.
8	Specify the timeout duration for establishing connection with the peer.
9	Specify the timeout duration to get a response from the peer.
10	Specify the frequency to update peer information, such as IP address.

Understanding the Poison Pill Remediation Template configuration

The Poison Pill Operator also creates the PoisonPillRemediationTemplate CR with the name poison-pill-default-template in the Poison Pill Operator’s namespace. This CR defines the remediation strategy for the nodes.

The default remediation strategy is NodeDeletion that removes the node object. In OKD 4.10, the Poison Pill Operator introduces a new remediation strategy called ResourceDeletion. The ResourceDeletion remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster.

The PoisonPillRemediationTemplate CR resembles the following YAML file:

apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillRemediationTemplate
metadata:
  creationTimestamp: "2022-03-02T08:02:40Z"
  generation: 1
  name: poison-pill-default-template
  namespace: openshift-operators
  resourceVersion: "596469"
  uid: 5d29e437-c485-48fa-ba9e-0354649afd31
spec:
  template:
    spec:
      remediationStrategy: NodeDeletion (1)

1	Specifies the remediation strategy. The default remediation strategy is `NodeDeletion`.

About watchdog devices

Watchdog devices can be any of the following:

Independently powered hardware devices
Hardware devices that share power with the hosts they control
Virtual devices implemented in software, or softdog

Hardware watchdog and softdog devices have electronic or software timers, respectively. These watchdog devices are used to ensure that the machine enters a safe state when an error condition is detected. The cluster is required to repeatedly reset the watchdog timer to prove that it is in a healthy state. This timer might elapse due to fault conditions, such as deadlocks, CPU starvation, and loss of network or disk access. If the timer expires, the watchdog device assumes that a fault has occurred and the device triggers a forced reset of the node.

Hardware watchdog devices are more reliable than softdog devices.

Understanding Poison Pill Operator behavior with watchdog devices

The Poison Pill Operator determines the remediation strategy based on the watchdog devices that are present.

If a hardware watchdog device is configured and available, the Operator uses it for remediation. If a hardware watchdog device is not configured, the Operator enables and uses a softdog device for remediation.

If neither watchdog devices are supported, either by the system or by the configuration, the Operator remediates nodes by using software reboot.

Additional resources

Configuring a watchdog

Installing the Poison Pill Operator by using the web console

You can use the OKD web console to install the Poison Pill Operator.

Prerequisites

Procedure

In the OKD web console, navigate to Operators → OperatorHub.
Search for the Poison Pill Operator from the list of available Operators, and then click Install.
Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the openshift-operators namespace.
Click Install.

Verification

To confirm that the installation is successful:

Navigate to the Operators → Installed Operators page.
Check that the Operator is installed in the openshift-operators namespace and its status is Succeeded.

If the Operator is not installed successfully:

Navigate to the Operators → Installed Operators page and inspect the Status column for any errors or failures.
Navigate to the Workloads → Pods page and check the logs in any pods in the poison-pill-controller-manager project that are reporting issues.

Installing the Poison Pill Operator by using the CLI

You can use the OpenShift CLI (oc) to install the Poison Pill Operator.

You can install the Poison Pill Operator in your own namespace or in the openshift-operators namespace.

To install the Operator in your own namespace, follow the steps in the procedure.

To install the Operator in the openshift-operators namespace, skip to step 3 of the procedure because the steps to create a new Namespace custom resource (CR) and an OperatorGroup CR are not required.

Prerequisites

Install the OpenShift CLI (oc).
Log in as a user with cluster-admin privileges.

Procedure

Create a Namespace custom resource (CR) for the Poison Pill Operator:
1. Define the Namespace CR and save the YAML file, for example, poison-pill-namespace.yaml:
  apiVersion: v1 kind: Namespace metadata: name: poison-pill
2. To create the Namespace CR, run the following command:
  $ oc create -f poison-pill-namespace.yaml

Create an OperatorGroup CR:

Define the OperatorGroup CR and save the YAML file, for example, poison-pill-operator-group.yaml:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: poison-pill-manager
  namespace: poison-pill

To create the OperatorGroup CR, run the following command:
```
$ oc create -f poison-pill-operator-group.yaml
```

Create a Subscription CR:

Define the Subscription CR and save the YAML file, for example, poison-pill-subscription.yaml:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
    name: poison-pill-manager
    namespace: poison-pill (1)
spec:
    channel: stable
    installPlanApproval: Manual (2)
    name: poison-pill-manager
    source: redhat-operators
    sourceNamespace: openshift-marketplace
    package: poison-pill-manager

1	Specify the `Namespace` where you want to install the Poison Pill Operator. To install the Poison Pill Operator in the `openshift-operators` namespace, specify `openshift-operators` in the `Subscription` CR.
2	Set the approval strategy to Manual in case your specified version is superseded by a later version in the catalog. This plan prevents an automatic upgrade to a later version and requires manual approval before the starting CSV can complete the installation.

To create the Subscription CR, run the following command:
```
$ oc create -f poison-pill-subscription.yaml
```

Verification

Verify that the installation succeeded by inspecting the CSV resource:

$ oc get csv -n poison-pill

Example output

NAME                   DISPLAY                 VERSION   REPLACES    PHASE
poison-pill.v.0.2.0     Poison Pill Operator    0.2.0                 Succeeded

Verify that the Poison Pill Operator is up and running:

$ oc get deploy -n poison-pill

Example output

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
poison-pill-controller-manager       1/1     1            1           10d

Verify that the Poison Pill Operator created the PoisonPillConfig CR:

$ oc get PoisonPillConfig -n poison-pill

Example output

NAME                 AGE
poison-pill-config   10d

Verify that each poison pill pod is scheduled and running on each worker node:

$ oc get daemonset -n poison-pill

Example output

NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
poison-pill-ds   2         2         2       2            2           <none>          10d

This command is unsupported for the control plane nodes.

Configuring machine health checks to use the Poison Pill Operator

Use the following procedure to configure the machine health checks to use the Poison Pill Operator as a remediation provider.

Prerequisites

Install the OpenShift CLI (oc).
Log in as a user with cluster-admin privileges.

Procedure

Create a PoisonPillRemediationTemplate CR:

Define the PoisonPillRemediationTemplate CR:

apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillRemediationTemplate
metadata:
  namespace: openshift-machine-api
  name: poisonpillremediationtemplate-sample
spec:
  template:
    spec: {}

To create the PoisonPillRemediationTemplate CR, run the following command:
```
$ oc create -f <ppr-name>.yaml
```

Create or update the MachineHealthCheck CR to point to the PoisonPillRemediationTemplate CR:

Define or update the MachineHealthCheck CR:

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: machine-health-check
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: "worker"
      machine.openshift.io/cluster-api-machine-type: "worker"
  unhealthyConditions:
  - type:    "Ready"
    timeout: "300s"
    status: "False"
  - type:    "Ready"
    timeout: "300s"
    status: "Unknown"
  maxUnhealthy: "40%"
  nodeStartupTimeout: "10m"
  remediationTemplate: (1)
    kind: PoisonPillRemediationTemplate
    apiVersion: poison-pill.medik8s.io/v1alpha1
    name: poisonpillremediationtemplate-sample

1	Specify the details for the remediation template.

To create a MachineHealthCheck CR, run the following command:
```
$ oc create -f <file-name>.yaml
```
To update a MachineHealthCheck CR, run the following command:
```
$ oc apply -f <file-name>.yaml
```

Troubleshooting the Poison Pill Operator

General troubleshooting

Issue: You want to troubleshoot issues with the Poison Pill Operator.
Resolution: Check the Operator logs.

Checking the daemon set

Issue: The Poison Pill Operator is installed but the daemon set is not available.
Resolution: Check the Operator logs for errors or warnings.

Unsuccessful remediation

Issue

An unhealthy node was not remediated.

Resolution

Verify that the PoisonPillRemediation CR was created by running the following command:

$ oc get ppr -A

If the MachineHealthCheck controller did not create the PoisonPillRemediation CR when the node turned unhealthy, check the logs of the MachineHealthCheck controller. Additionally, ensure that the MachineHealthCheck CR includes the required specification to use the remediation template.

If the PoisonPillRemediation CR was created, ensure that its name matches the unhealthy node or the machine object.

Daemon set and other Poison Pill Operator resources exist even after uninstalling the Poison Pill Operator

Issue

The Poison Pill Operator resources, such as the daemon set, configuration CR, and the remediation template CR, exist even after after uninstalling the Operator.

Resolution

To remove the Poison Pill Operator resources, delete the resources by running the following commands for each resource type:

$ oc delete ds <poison-pill-ds> -n <namespace>

$ oc delete ppc <poison-pill-config> -n <namespace>

$ oc delete pprt <poison-pill-remediation-template> -n <namespace>

Gathering data about the Poison Pill Operator

To collect debugging information about the Poison Pill Operator, use the must-gather tool. For information about the must-gather image for the Poison Pill Operator, see Gathering data about specific features.

Additional resources

The Poison Pill Operator is supported in a restricted network environment. For more information, see Using Operator Lifecycle Manager on restricted networks.
Deleting Operators from a cluster