apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillConfig
metadata:
name: poison-pill-config
namespace: openshift-operators
spec:
safeTimeToAssumeNodeRebootedSeconds: 180 (1)
watchdogFilePath: /test/watchdog1 (2)
You can use the Poison Pill Operator to automatically reboot unhealthy nodes. This remediation strategy minimizes downtime for stateful applications and ReadWriteOnce (RWO) volumes, and restores compute capacity if transient failures occur.
The Poison Pill Operator runs on the cluster nodes and reboots nodes that are identified as unhealthy. The Operator uses the MachineHealthCheck
controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the MachineHealthCheck
resource creates the PoisonPillRemediation
custom resource (CR), which triggers the Poison Pill Operator.
The Poison Pill Operator provides the following capabilities:
Minimizes downtime for stateful applications and restores compute capacity if transient failures occur.
Independent of any management interface, such as IPMI or an API to provision a node.
The Poison Pill Operator creates the PoisonPillConfig
CR with the name poison-pill-config
in the Poison Pill Operator’s namespace. You can edit this CR. However, you cannot create a new CR for the Poison Pill Operator.
A change in the PoisonPillConfig
CR re-creates the Poison Pill daemon set.
The PoisonPillConfig
CR resembles the following YAML file:
apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillConfig
metadata:
name: poison-pill-config
namespace: openshift-operators
spec:
safeTimeToAssumeNodeRebootedSeconds: 180 (1)
watchdogFilePath: /test/watchdog1 (2)
1 | Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value. |
2 | Specify the file path of the watchdog device in the nodes. If a watchdog device is unavailable, the PoisonPillConfig CR uses a software reboot. |
You can use the OKD web console to install the Poison Pill Operator.
Log in as a user with cluster-admin
privileges.
In the OKD web console, navigate to Operators → OperatorHub.
Search for the Poison Pill Operator from the list of available Operators, and then click Install.
Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the poison-pill
namespace.
Click Install.
To confirm that the installation is successful:
Navigate to the Operators → Installed Operators page.
Check that the Operator is installed in the poison-pill
namespace and its status is Succeeded
.
If the Operator is not installed successfully:
Navigate to the Operators → Installed Operators page and inspect the Status
column for any errors or failures.
Navigate to the Workloads → Pods page and check the logs in any pods in the poison-pill-controller-manager
project that are reporting issues.
You can use the OpenShift CLI (oc
) to install the Poison Pill Operator.
Install the OpenShift CLI (oc
).
Log in as a user with cluster-admin
privileges.
Create a Namespace
custom resource (CR) for the Poison Pill Operator:
Define the Namespace
CR and save the YAML file, for example, poison-pill-namespace.yaml
:
apiVersion: v1
kind: Namespace
metadata:
name: poison-pill
To create the Namespace
CR, run the following command:
$ oc create -f poison-pill-namespace.yaml
Create an OperatorGroup
CR:
Define the OperatorGroup
CR and save the YAML file, for example, poison-pill-operator-group.yaml
:
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: poison-pill-manager
namespace: poison-pill
spec:
targetNamespaces:
- poison-pill
To create the OperatorGroup
CR, run the following command:
$ oc create -f poison-pill-operator-group.yaml
Create a Subscription
CR:
Define the Subscription
CR and save the YAML file, for example, poison-pill-subscription.yaml
:
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: poison-pill-manager
namespace: poison-pill
spec:
channel: alpha
name: poison-pill-manager
source: redhat-operators
sourceNamespace: openshift-marketplace
package: poison-pill-manager
To create the Subscription
CR, run the following command:
$ oc create -f poison-pill-subscription.yaml
Verify that the installation succeeded by inspecting the CSV resource:
$ oc get csv -n poison-pill
NAME DISPLAY VERSION REPLACES PHASE
poison-pill.v0.1.4 Poison Pill Operator 0.1.4 Succeeded
Verify that the Poison Pill Operator is up and running:
$ oc get deploy -n poison-pill
NAME READY UP-TO-DATE AVAILABLE AGE
poison-pill-controller-manager 1/1 1 1 10d
Verify that the Poison Pill Operator created the PoisonPillConfig
CR:
$ oc get PoisonPillConfig -n poison-pill
NAME AGE
poison-pill-config 10d
Verify that each poison pill pod is scheduled and running on each worker node:
$ oc get daemonset -n poison-pill
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
poison-pill-ds 2 2 2 2 2 <none> 10d
This command is unsupported for the control plane nodes. |
Use the following procedure to configure the machine health checks to use the Poison Pill Operator as a remediation provider.
Install the OpenShift CLI (oc
).
Log in as a user with cluster-admin
privileges.
Create a PoisonPillRemediationTemplate
CR:
Define the PoisonPillRemediationTemplate
CR:
apiVersion: poison-pill.medik8s.io/v1alpha1
kind: PoisonPillRemediationTemplate
metadata:
namespace: openshift-machine-api
name: poisonpillremediationtemplate-sample
spec:
template:
spec: {}
To create the PoisonPillRemediationTemplate
CR, run the following command:
$ oc create -f <ppr-name>.yaml
Create or update the MachineHealthCheck
CR to point to the PoisonPillRemediationTemplate
CR:
Define or update the MachineHealthCheck
CR:
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: machine-health-check
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: "worker"
machine.openshift.io/cluster-api-machine-type: "worker"
unhealthyConditions:
- type: "Ready"
timeout: "300s"
status: "False"
- type: "Ready"
timeout: "300s"
status: "Unknown"
maxUnhealthy: "40%"
nodeStartupTimeout: "10m"
remediationTemplate: (1)
kind: PoisonPillRemediationTemplate
apiVersion: poison-pill.medik8s.io/v1alpha1
name: <poison-pill-remediation-template-sample>
1 | Specify the details for the remediation template. |
To create a MachineHealthCheck
CR, run the following command:
$ oc create -f <file-name>.yaml
To update a MachineHealthCheck
CR, run the following command:
$ oc apply -f <file-name>.yaml
You want to troubleshoot issues with the Poison Pill Operator.
Check the Operator logs.
The Poison Pill Operator is installed but the daemon set is not available.
Check the Operator logs for errors or warnings.
An unhealthy node was not remediated.
Verify that the PoisonPillRemediation
CR was created by running the following command:
$ oc get ppr -A
If the MachineHealthCheck
controller did not create the PoisonPillRemediation
CR when the node turned unhealthy, check the logs of the MachineHealthCheck
controller. Additionally, ensure that the MachineHealthCheck
CR includes the required specification to use the remediation template.
If the PoisonPillRemediation
CR was created, ensure that its name matches the unhealthy node or the machine object.
The Poison Pill Operator resources, such as the daemon set, configuration CR, and the remediation template CR, exist even after after uninstalling the Operator.
To remove the Poison Pill Operator resources, delete the resources by running the following commands for each resource type:
$ oc delete ds <poison-pill-ds> -n <namespace>
$ oc delete ppc <poison-pill-config> -n <namespace>
$ oc delete pprt <poison-pill-remediation-template> -n <namespace>
The Poison Pill Operator is supported in a restricted network environment. For more information, see Using Operator Lifecycle Manager on restricted networks.