Virtual machine health checks - Monitoring | Virtualization

About readiness and liveness probes
Defining a watchdog
- Configuring a watchdog device for the virtual machine
- Installing the watchdog agent on the guest
Defining a guest agent ping probe
Additional resources

You can configure virtual machine (VM) health checks by defining readiness and liveness probes in the VirtualMachine resource.

About readiness and liveness probes

Use readiness and liveness probes to detect and handle unhealthy virtual machines (VMs). You can include one or more probes in the specification of the VM to ensure that traffic does not reach a VM that is not ready for it and that a new VM is created when a VM becomes unresponsive.

A readiness probe determines whether a VM is ready to accept service requests. If the probe fails, the VM is removed from the list of available endpoints until the VM is ready.

A liveness probe determines whether a VM is responsive. If the probe fails, the VM is deleted and a new VM is created to restore responsiveness.

You can configure readiness and liveness probes by setting the spec.readinessProbe and the spec.livenessProbe fields of the VirtualMachine object. These fields support the following tests:

HTTP GET: The probe determines the health of the VM by using a web hook. The test is successful if the HTTP response code is between 200 and 399. You can use an HTTP GET test with applications that return HTTP status codes when they are completely initialized.
TCP socket: The probe attempts to open a socket to the VM. The VM is only considered healthy if the probe can establish a connection. You can use a TCP socket test with applications that do not start listening until initialization is complete.
Guest agent ping: The probe uses the guest-ping command to determine if the QEMU guest agent is running on the virtual machine.

Defining an HTTP readiness probe

You can define an HTTP readiness probe by setting the spec.readinessProbe.httpGet field of the virtual machine (VM) configuration.

Prerequisites

You have installed the OpenShift CLI (oc).

Procedure

Include details of the readiness probe in the VM configuration file.

Sample readiness probe with an HTTP GET test:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
  name: fedora-vm
  namespace: example-namespace
# ...
spec:
  template:
    spec:
      readinessProbe:
        httpGet: (1)
          port: 1500 (2)
          path: /healthz (3)
          httpHeaders:
          - name: Custom-Header
            value: Awesome
        initialDelaySeconds: 120 (4)
        periodSeconds: 20 (5)
        timeoutSeconds: 10 (6)
        failureThreshold: 3 (7)
        successThreshold: 3 (8)
# ...

1	The HTTP GET request to perform to connect to the VM.
2	The port of the VM that the probe queries. In the above example, the probe queries port 1500.
3	The path to access on the HTTP server. In the above example, if the handler for the server’s /healthz path returns a success code, the VM is considered to be healthy. If the handler returns a failure code, the VM is removed from the list of available endpoints.
4	The time, in seconds, after the VM starts before the readiness probe is initiated.
5	The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than `timeoutSeconds`.
6	The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than `periodSeconds`.
7	The number of times that the probe is allowed to fail. The default is 3. After the specified number of attempts, the pod is marked `Unready`.
8	The number of times that the probe must report success, after a failure, to be considered successful. The default is 1.

Create the VM by running the following command:
```
$ oc create -f <file_name>.yaml
```

Defining a TCP readiness probe

You can define a TCP readiness probe by setting the spec.readinessProbe.tcpSocket field of the virtual machine (VM) configuration.

Prerequisites

You have installed the OpenShift CLI (oc).

Procedure

Include details of the TCP readiness probe in the VM configuration file.

Sample readiness probe with a TCP socket test:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
  name: fedora-vm
  namespace: example-namespace
# ...
spec:
  template:
    spec:
      readinessProbe:
        initialDelaySeconds: 120 (1)
        periodSeconds: 20 (2)
        tcpSocket: (3)
          port: 1500 (4)
        timeoutSeconds: 10 (5)
# ...

1	The time, in seconds, after the VM starts before the readiness probe is initiated.
2	The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than `timeoutSeconds`.
3	The TCP action to perform.
4	The port of the VM that the probe queries.
5	The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than `periodSeconds`.

Create the VM by running the following command:
```
$ oc create -f <file_name>.yaml
```

Defining an HTTP liveness probe

Define an HTTP liveness probe by setting the spec.livenessProbe.httpGet field of the virtual machine (VM) configuration. You can define both HTTP and TCP tests for liveness probes in the same way as readiness probes. This procedure configures a sample liveness probe with an HTTP GET test.

Prerequisites

You have installed the OpenShift CLI (oc).

Procedure

Include details of the HTTP liveness probe in the VM configuration file.

Sample liveness probe with an HTTP GET test:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
  name: fedora-vm
  namespace: example-namespace
# ...
spec:
  template:
    spec:
      livenessProbe:
        initialDelaySeconds: 120 (1)
        periodSeconds: 20 (2)
        httpGet: (3)
          port: 1500 (4)
          path: /healthz (5)
          httpHeaders:
          - name: Custom-Header
            value: Awesome
        timeoutSeconds: 10 (6)
# ...

1	The time, in seconds, after the VM starts before the liveness probe is initiated.
2	The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than `timeoutSeconds`.
3	The HTTP GET request to perform to connect to the VM.
4	The port of the VM that the probe queries. In the above example, the probe queries port 1500. The VM installs and runs a minimal HTTP server on port 1500 via cloud-init.
5	The path to access on the HTTP server. In the above example, if the handler for the server’s `/healthz` path returns a success code, the VM is considered to be healthy. If the handler returns a failure code, the VM is deleted and a new VM is created.
6	The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than `periodSeconds`.

Create the VM by running the following command:
```
$ oc create -f <file_name>.yaml
```

Defining a watchdog

You can define a watchdog to monitor the health of the guest operating system by performing the following steps:

Configure a watchdog device for the virtual machine (VM).
Install the watchdog agent on the guest.

The watchdog device monitors the agent and performs one of the following actions if the guest operating system is unresponsive:

poweroff: The VM powers down immediately. If spec.runStrategy is not set to manual, the VM reboots.
reset: The VM reboots in place and the guest operating system cannot react.

The reboot time might cause liveness probes to time out. If cluster-level protections detect a failed liveness probe, the VM might be forcibly rescheduled, increasing the reboot time.
shutdown: The VM gracefully powers down by stopping all services.

Watchdog is not available for Windows VMs.

Configuring a watchdog device for the virtual machine

You configure a watchdog device for the virtual machine (VM).

Prerequisites

For x86 systems, the VM must use a kernel that works with the i6300esb watchdog device. If you use s390x architecture, the kernel must be enabled for diag288. Fedora images support i6300esb and diag288.
You have installed the OpenShift CLI (oc).

Procedure

Create a YAML file with the following contents:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: <vm-label>
  name: <vm-name>
spec:
  runStrategy: Halted
  template:
    metadata:
      labels:
        kubevirt.io/vm: <vm-label>
    spec:
      domain:
        devices:
          watchdog:
            name: <watchdog>
            <watchdog-device-model>: (1)
              action: "poweroff" (2)
# ...

1	The watchdog device model to use. For `x86` specify `i6300esb`. For `s390x` specify `diag288`.
2	Specify `poweroff`, `reset`, or `shutdown`. The `shutdown` action requires that the guest virtual machine is responsive to ACPI signals. Therefore, using `shutdown` is not recommended.

The example above configures the watchdog device on a VM with the poweroff action and exposes the device as /dev/watchdog.

This device can now be used by the watchdog binary.

Apply the YAML file to your cluster by running the following command:
```
$ oc apply -f <file_name>.yaml
```

Verification

This procedure is provided for testing watchdog functionality only and must not be run on production machines.

Run the following command to verify that the VM is connected to the watchdog device:
```
$ lspci | grep watchdog -i
```
Run one of the following commands to confirm the watchdog is active:
- Trigger a kernel panic:
  # echo c > /proc/sysrq-trigger
- Stop the watchdog service:
  # pkill -9 watchdog

Installing the watchdog agent on the guest

You can install the watchdog agent on the guest and start the watchdog service.

Procedure

Log in to the virtual machine as root user.
This step is only required when installing on IBM Z® (s390x). Enable watchdog by running the following command:
```
# modprobe diag288_wdt
```
Verify that the /dev/watchdog file path is present in the VM by running the following command:
```
# ls /dev/watchdog
```
Install the watchdog package and its dependencies:
```
# yum install watchdog
```
Uncomment the following line in the /etc/watchdog.conf file and save the changes:
```
#watchdog-device = /dev/watchdog
```

Enable the watchdog service to start on boot:

# systemctl enable --now watchdog.service

Defining a guest agent ping probe

You can define a guest agent ping probe by setting the spec.readinessProbe.guestAgentPing field of the virtual machine (VM) configuration.

Prerequisites

The QEMU guest agent must be installed and enabled on the virtual machine.
You have installed the OpenShift CLI (oc).

Procedure

Include details of the guest agent ping probe in the VM configuration file. For example:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
  name: fedora-vm
  namespace: example-namespace
# ...
spec:
  template:
    spec:
      readinessProbe:
        guestAgentPing: {} (1)
        initialDelaySeconds: 120 (2)
        periodSeconds: 20 (3)
        timeoutSeconds: 10 (4)
        failureThreshold: 3 (5)
        successThreshold: 3 (6)
# ...

1	The guest agent ping probe to connect to the VM.
2	Optional: The time, in seconds, after the VM starts before the guest agent probe is initiated.
3	Optional: The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than `timeoutSeconds`.
4	Optional: The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than `periodSeconds`.
5	Optional: The number of times that the probe is allowed to fail. The default is 3. After the specified number of attempts, the pod is marked `Unready`.
6	Optional: The number of times that the probe must report success, after a failure, to be considered successful. The default is 1.

Create the VM by running the following command:
```
$ oc create -f <file_name>.yaml
```

Additional resources

Monitoring application health by using health checks