Configuring IP failover

IP failover environment variables
Configuring IP failover
About virtual IP addresses
Configuring check and notify scripts
Configuring VRRP preemption
About VRRP ID offset
Configuring IP failover for more than 254 addresses
High availability For ingressIP
Removing IP failover

This topic describes configuring IP failover for pods and services on your OKD cluster.

IP failover manages a pool of Virtual IP (VIP) addresses on a set of nodes. Every VIP in the set is serviced by a node selected from the set. As long a single node is available, the VIPs are served. There is no way to explicitly distribute the VIPs over the nodes, so there can be nodes with no VIPs and other nodes with many VIPs. If there is only one node, all VIPs are on it.

The VIPs must be routable from outside the cluster.

IP failover monitors a port on each VIP to determine whether the port is reachable on the node. If the port is not reachable, the VIP is not assigned to the node. If the port is set to 0, this check is suppressed. The check script does the needed testing.

IP failover uses Keepalived to host a set of externally accessible VIP addresses on a set of hosts. Each VIP is only serviced by a single host at a time. Keepalived uses the Virtual Router Redundancy Protocol (VRRP) to determine which host, from the set of hosts, services which VIP. If a host becomes unavailable, or if the service that Keepalived is watching does not respond, the VIP is switched to another host from the set. This means a VIP is always serviced as long as a host is available.

When a node running Keepalived passes the check script, the VIP on that node can enter the master state based on its priority and the priority of the current master and as determined by the preemption strategy.

A cluster administrator can provide a script through the OPENSHIFT_HA_NOTIFY_SCRIPT variable, and this script is called whenever the state of the VIP on the node changes. Keepalived uses the master state when it is servicing the VIP, the backup state when another node is servicing the VIP, or in the fault state when the check script fails. The notify script is called with the new state whenever the state changes.

You can create an IP failover deployment configuration on OKD. The IP failover deployment configuration specifies the set of VIP addresses, and the set of nodes on which to service them. A cluster can have multiple IP failover deployment configurations, with each managing its own set of unique VIP addresses. Each node in the IP failover configuration runs an IP failover pod, and this pod runs Keepalived.

When using VIPs to access a pod with host networking, the application pod runs on all nodes that are running the IP failover pods. This enables any of the IP failover nodes to become the master and service the VIPs when needed. If application pods are not running on all nodes with IP failover, either some IP failover nodes never service the VIPs or some application pods never receive any traffic. Use the same selector and replication count, for both IP failover and the application pods, to avoid this mismatch.

While using VIPs to access a service, any of the nodes can be in the IP failover set of nodes, since the service is reachable on all nodes, no matter where the application pod is running. Any of the IP failover nodes can become master at any time. The service can either use external IPs and a service port or it can use a NodePort.

When using external IPs in the service definition, the VIPs are set to the external IPs, and the IP failover monitoring port is set to the service port. When using a node port, the port is open on every node in the cluster, and the service load-balances traffic from whatever node currently services the VIP. In this case, the IP failover monitoring port is set to the NodePort in the service definition.

Setting up a NodePort is a privileged operation.

Even though a service VIP is highly available, performance can still be affected. Keepalived makes sure that each of the VIPs is serviced by some node in the configuration, and several VIPs can end up on the same node even when other nodes have none. Strategies that externally load-balance across a set of VIPs can be thwarted when IP failover puts multiple VIPs on the same node.

When you use ingressIP, you can set up IP failover to have the same VIP range as the ingressIP range. You can also disable the monitoring port. In this case, all the VIPs appear on same node in the cluster. Any user can set up a service with an ingressIP and have it highly available.

There are a maximum of 254 VIPs in the cluster.

IP failover environment variables

The following table contains the variables used to configure IP failover.

Table 1. IP failover environment variables
Variable Name	Default	Description
`OPENSHIFT_HA_MONITOR_PORT`	`80`	The IP failover pod tries to open a TCP connection to this port on each Virtual IP (VIP). If connection is established, the service is considered to be running. If this port is set to `0`, the test always passes.
`OPENSHIFT_HA_NETWORK_INTERFACE`		The interface name that IP failover uses to send Virtual Router Redundancy Protocol (VRRP) traffic. The default value is `eth0`.
`OPENSHIFT_HA_REPLICA_COUNT`	`2`	The number of replicas to create. This must match `spec.replicas` value in IP failover deployment configuration.
`OPENSHIFT_HA_VIRTUAL_IPS`		The list of IP address ranges to replicate. This must be provided. For example, `1.2.3.4-6,1.2.3.9`.
`OPENSHIFT_HA_VRRP_ID_OFFSET`	`0`	The offset value used to set the virtual router IDs. Using different offset values allows multiple IP failover configurations to exist within the same cluster. The default offset is `0`, and the allowed range is `0` through `255`.
`OPENSHIFT_HA_VIP_GROUPS`		The number of groups to create for VRRP. If not set, a group is created for each virtual IP range specified with the `OPENSHIFT_HA_VIP_GROUPS` variable.
`OPENSHIFT_HA_IPTABLES_CHAIN`	INPUT	The name of the iptables chain, to automatically add an `iptables` rule to allow the VRRP traffic on. If the value is not set, an `iptables` rule is not added. If the chain does not exist, it is not created.
`OPENSHIFT_HA_CHECK_SCRIPT`		The full path name in the pod file system of a script that is periodically run to verify the application is operating.
`OPENSHIFT_HA_CHECK_INTERVAL`	`2`	The period, in seconds, that the check script is run.
`OPENSHIFT_HA_NOTIFY_SCRIPT`		The full path name in the pod file system of a script that is run whenever the state changes.
`OPENSHIFT_HA_PREEMPTION`	`preempt_nodelay 300`	The strategy for handling a new higher priority host. The `nopreempt` strategy does not move master from the lower priority host to the higher priority host.

As a cluster administrator, you can configure IP failover on an entire cluster, or on a subset of nodes, as defined by the label selector. You can also configure multiple IP failover deployment configurations in your cluster, where each one is independent of the others.

The IP failover deployment configuration ensures that a failover pod runs on each of the nodes matching the constraints or the label used.

This pod runs Keepalived, which can monitor an endpoint and use Virtual Router Redundancy Protocol (VRRP) to fail over the virtual IP (VIP) from one node to another if the first node cannot reach the service or endpoint.

For production use, set a selector that selects at least two nodes, and set replicas equal to the number of selected nodes.

Prerequisites

You are logged in to the cluster with a user with cluster-admin privileges.
You created a pull secret.

Procedure

Create an IP failover service account:
```
$ oc create sa ipfailover
```

Update security context constraints (SCC) for hostNetwork:

$ oc adm policy add-scc-to-user privileged -z ipfailover
$ oc adm policy add-scc-to-user hostnetwork -z ipfailover

Create a deployment YAML file to configure IP failover:

Example deployment YAML for IP failover configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ipfailover-keepalived (1)
  labels:
    ipfailover: hello-openshift
spec:
  strategy:
    type: Recreate
  replicas: 2
  selector:
    matchLabels:
      ipfailover: hello-openshift
  template:
    metadata:
      labels:
        ipfailover: hello-openshift
    spec:
      serviceAccountName: ipfailover
      privileged: true
      hostNetwork: true
      nodeSelector:
        node-role.kubernetes.io/worker: ""
      containers:
      - name: openshift-ipfailover
        image: quay.io/openshift/origin-keepalived-ipfailover
        ports:
        - containerPort: 63000
          hostPort: 63000
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        volumeMounts:
        - name: lib-modules
          mountPath: /lib/modules
          readOnly: true
        - name: host-slash
          mountPath: /host
          readOnly: true
          mountPropagation: HostToContainer
        - name: etc-sysconfig
          mountPath: /etc/sysconfig
          readOnly: true
        - name: config-volume
          mountPath: /etc/keepalive
        env:
        - name: OPENSHIFT_HA_CONFIG_NAME
          value: "ipfailover"
        - name: OPENSHIFT_HA_VIRTUAL_IPS (2)
          value: "1.1.1.1-2"
        - name: OPENSHIFT_HA_VIP_GROUPS (3)
          value: "10"
        - name: OPENSHIFT_HA_NETWORK_INTERFACE (4)
          value: "ens3" #The host interface to assign the VIPs
        - name: OPENSHIFT_HA_MONITOR_PORT (5)
          value: "30060"
        - name: OPENSHIFT_HA_VRRP_ID_OFFSET (6)
          value: "0"
        - name: OPENSHIFT_HA_REPLICA_COUNT (7)
          value: "2" #Must match the number of replicas in the deployment
        - name: OPENSHIFT_HA_USE_UNICAST
          value: "false"
        #- name: OPENSHIFT_HA_UNICAST_PEERS
          #value: "10.0.148.40,10.0.160.234,10.0.199.110"
        - name: OPENSHIFT_HA_IPTABLES_CHAIN (8)
          value: "INPUT"
        #- name: OPENSHIFT_HA_NOTIFY_SCRIPT (9)
        #  value: /etc/keepalive/mynotifyscript.sh
        - name: OPENSHIFT_HA_CHECK_SCRIPT (10)
          value: "/etc/keepalive/mycheckscript.sh"
        - name: OPENSHIFT_HA_PREEMPTION (11)
          value: "preempt_delay 300"
        - name: OPENSHIFT_HA_CHECK_INTERVAL (12)
          value: "2"
        livenessProbe:
          initialDelaySeconds: 10
          exec:
            command:
            - pgrep
            - keepalived
      volumes:
      - name: lib-modules
        hostPath:
          path: /lib/modules
      - name: host-slash
        hostPath:
          path: /
      - name: etc-sysconfig
        hostPath:
          path: /etc/sysconfig
      # config-volume contains the check script
      # created with `oc create configmap keepalived-checkscript --from-file=mycheckscript.sh`
      - configMap:
          defaultMode: 0755
          name: keepalived-checkscript
        name: config-volume
      imagePullSecrets:
        - name: openshift-pull-secret (13)

1	The name of the IP failover deployment.
2	The list of IP address ranges to replicate. This must be provided. For example, `1.2.3.4-6,1.2.3.9`.
3	The number of groups to create for VRRP. If not set, a group is created for each virtual IP range specified with the `OPENSHIFT_HA_VIP_GROUPS` variable.
4	The interface name that IP failover uses to send VRRP traffic. By default, `eth0` is used.
5	The IP failover pod tries to open a TCP connection to this port on each VIP. If connection is established, the service is considered to be running. If this port is set to `0`, the test always passes. The default value is `80`.
6	The offset value used to set the virtual router IDs. Using different offset values allows multiple IP failover configurations to exist within the same cluster. The default offset is `0`, and the allowed range is `0` through `255`.
7	The number of replicas to create. This must match `spec.replicas` value in IP failover deployment configuration. The default value is `2`.
8	The name of the `iptables` chain to automatically add an `iptables` rule to allow the VRRP traffic on. If the value is not set, an `iptables` rule is not added. If the chain does not exist, it is not created, and Keepalived operates in unicast mode. The default is `INPUT`.
9	The full path name in the pod file system of a script that is run whenever the state changes.
10	The full path name in the pod file system of a script that is periodically run to verify the application is operating.
11	The strategy for handling a new higher priority host. The default value is `preempt_delay 300`, which causes a Keepalived instance to take over a VIP after 5 minutes if a lower-priority master is holding the VIP.
12	The period, in seconds, that the check script is run. The default value is `2`.
13	Create the pull secret before creating the deployment, otherwise you will get an error when creating the deployment.

About virtual IP addresses

Keepalived manages a set of virtual IP addresses (VIP). The administrator must make sure that all of these addresses:

Are accessible on the configured hosts from outside the cluster.
Are not used for any other purpose within the cluster.

Keepalived on each node determines whether the needed service is running. If it is, VIPs are supported and Keepalived participates in the negotiation to determine which node serves the VIP. For a node to participate, the service must be listening on the watch port on a VIP or the check must be disabled.

Each VIP in the set may end up being served by a different node.

Configuring check and notify scripts

Keepalived monitors the health of the application by periodically running an optional user supplied check script. For example, the script can test a web server by issuing a request and verifying the response.

When a check script is not provided, a simple default script is run that tests the TCP connection. This default test is suppressed when the monitor port is 0.

Each IP failover pod manages a Keepalived daemon that manages one or more virtual IPs (VIP) on the node where the pod is running. The Keepalived daemon keeps the state of each VIP for that node. A particular VIP on a particular node may be in master, backup, or fault state.

When the check script for that VIP on the node that is in master state fails, the VIP on that node enters the fault state, which triggers a renegotiation. During renegotiation, all VIPs on a node that are not in the fault state participate in deciding which node takes over the VIP. Ultimately, the VIP enters the master state on some node, and the VIP stays in the backup state on the other nodes.

When a node with a VIP in backup state fails, the VIP on that node enters the fault state. When the check script passes again for a VIP on a node in the fault state, the VIP on that node exits the fault state and negotiates to enter the master state. The VIP on that node may then enter either the master or the backup state.

As cluster administrator, you can provide an optional notify script, which is called whenever the state changes. Keepalived passes the following three parameters to the script:

$1 - group or instance
$2 - Name of the group or instance
$3 - The new state: master, backup, or fault

The check and notify scripts run in the IP failover pod and use the pod file system, not the host file system. However, the IP failover pod makes the host file system available under the /hosts mount path. When configuring a check or notify script, you must provide the full path to the script. The recommended approach for providing the scripts is to use a config map.

The full path names of the check and notify scripts are added to the Keepalived configuration file, _/etc/keepalived/keepalived.conf, which is loaded every time Keepalived starts. The scripts can be added to the pod with a config map as follows.

Prerequisites

You installed the OpenShift CLI (oc).
You are logged in to the cluster with a user with cluster-admin privileges.

Procedure

Create the desired script and create a config map to hold it. The script has no input arguments and must return 0 for OK and 1 for fail.

The check script, mycheckscript.sh:
```
#!/bin/bash
    # Whatever tests are needed
    # E.g., send request and verify response
exit 0
```

Create the config map:

$ oc create configmap mycustomcheck --from-file=mycheckscript.sh

Add the script to the pod. The defaultMode for the mounted config map files must able to run by using oc commands or by editing the deployment configuration. A value of 0755, 493 decimal, is typical:

$ oc set env deploy/ipfailover-keepalived \
    OPENSHIFT_HA_CHECK_SCRIPT=/etc/keepalive/mycheckscript.sh

$ oc set volume deploy/ipfailover-keepalived --add --overwrite \
    --name=config-volume \
    --mount-path=/etc/keepalive \
    --source='{"configMap": { "name": "mycustomcheck", "defaultMode": 493}}'

The oc set env command is whitespace sensitive. There must be no whitespace on either side of the = sign.

You can alternatively edit the ipfailover-keepalived deployment configuration:

$ oc edit deploy ipfailover-keepalived

    spec:
      containers:
      - env:
        - name: OPENSHIFT_HA_CHECK_SCRIPT  (1)
          value: /etc/keepalive/mycheckscript.sh
...
        volumeMounts: (2)
        - mountPath: /etc/keepalive
          name: config-volume
      dnsPolicy: ClusterFirst
...
      volumes: (3)
      - configMap:
          defaultMode: 0755 (4)
          name: customrouter
        name: config-volume
...

1	In the `spec.container.env` field, add the `OPENSHIFT_HA_CHECK_SCRIPT` environment variable to point to the mounted script file.
2	Add the `spec.container.volumeMounts` field to create the mount point.
3	Add a new `spec.volumes` field to mention the config map.
4	This sets run permission on the files. When read back, it is displayed in decimal, `493`.

Save the changes and exit the editor. This restarts ipfailover-keepalived.

Configuring VRRP preemption

When a Virtual IP (VIP) on a node leaves the fault state by passing the check script, the VIP on the node enters the backup state if it has lower priority than the VIP on the node that is currently in the master state. However, if the VIP on the node that is leaving fault state has a higher priority, the preemption strategy determines its role in the cluster.

The nopreempt strategy does not move master from the lower priority VIP on the host to the higher priority VIP on the host. With preempt_delay 300, the default, Keepalived waits the specified 300 seconds and moves master to the higher priority VIP on the host.

Prerequisites

You installed the OpenShift CLI (oc).

Procedure

To specify preemption enter oc edit deploy ipfailover-keepalived to edit the router deployment configuration:

$ oc edit deploy ipfailover-keepalived

...
    spec:
      containers:
      - env:
        - name: OPENSHIFT_HA_PREEMPTION  (1)
          value: preempt_delay 300
...

Set the OPENSHIFT_HA_PREEMPTION value:

preempt_delay 300: Keepalived waits the specified 300 seconds and moves master to the higher priority VIP on the host. This is the default value.
nopreempt: does not move master from the lower priority VIP on the host to the higher priority VIP on the host.

About VRRP ID offset

Each IP failover pod managed by the IP failover deployment configuration, 1 pod per node or replica, runs a Keepalived daemon. As more IP failover deployment configurations are configured, more pods are created and more daemons join into the common Virtual Router Redundancy Protocol (VRRP) negotiation. This negotiation is done by all the Keepalived daemons and it determines which nodes service which virtual IPs (VIP).

Internally, Keepalived assigns a unique vrrp-id to each VIP. The negotiation uses this set of vrrp-ids, when a decision is made, the VIP corresponding to the winning vrrp-id is serviced on the winning node.

Therefore, for every VIP defined in the IP failover deployment configuration, the IP failover pod must assign a corresponding vrrp-id. This is done by starting at OPENSHIFT_HA_VRRP_ID_OFFSET and sequentially assigning the vrrp-ids to the list of VIPs. The vrrp-ids can have values in the range 1..255.

When there are multiple IP failover deployment configurations, you must specify OPENSHIFT_HA_VRRP_ID_OFFSET so that there is room to increase the number of VIPs in the deployment configuration and none of the vrrp-id ranges overlap.

Configuring IP failover for more than 254 addresses

IP failover management is limited to 254 groups of Virtual IP (VIP) addresses. By default OKD assigns one IP address to each group. You can use the OPENSHIFT_HA_VIP_GROUPS variable to change this so multiple IP addresses are in each group and define the number of VIP groups available for each Virtual Router Redundancy Protocol (VRRP) instance when configuring IP failover.

Grouping VIPs creates a wider range of allocation of VIPs per VRRP in the case of VRRP failover events, and is useful when all hosts in the cluster have access to a service locally. For example, when a service is being exposed with an ExternalIP.

As a rule for failover, do not limit services, such as the router, to one specific host. Instead, services should be replicated to each host so that in the case of IP failover, the services do not have to be recreated on the new host.

If you are using OKD health checks, the nature of IP failover and groups means that all instances in the group are not checked. For that reason, the Kubernetes health checks must be used to ensure that services are live.

Prerequisites

You are logged in to the cluster with a user with cluster-admin privileges.

Procedure

To change the number of IP addresses assigned to each group, change the value for the OPENSHIFT_HA_VIP_GROUPS variable, for example:
Example Deployment YAML for IP failover configuration
```
...
    spec:
        env:
        - name: OPENSHIFT_HA_VIP_GROUPS (1)
          value: "3"
...
```
1 If OPENSHIFT_HA_VIP_GROUPS is set to 3 in an environment with seven VIPs, it creates three groups, assigning three VIPs to the first group, and two VIPs to the two remaining groups.

If the number of groups set by OPENSHIFT_HA_VIP_GROUPS is fewer than the number of IP addresses set to fail over, the group contains more than one IP address, and all of the addresses move as a single unit.

High availability For ingressIP

In non-cloud clusters, IP failover and ingressIP to a service can be combined. The result is high availability services for users that create services using ingressIP.

The approach is to specify an ingressIPNetworkCIDR range and then use the same range in creating the ipfailover configuration.

Because IP failover can support up to a maximum of 255 VIPs for the entire cluster, the ingressIPNetworkCIDR needs to be /24 or smaller.

Removing IP failover

When IP failover is initially configured, the worker nodes in the cluster are modified with an iptables rule that explicitly allows multicast packets on 224.0.0.18 for Keepalived. Because of the change to the nodes, removing IP failover requires running a job to remove the iptables rule and removing the virtual IP addresses used by Keepalived.

Procedure

Optional: Identify and delete any check and notify scripts that are stored as config maps:

Identify whether any pods for IP failover use a config map as a volume:

$ oc get pod -l ipfailover \
  -o jsonpath="\
{range .items[?(@.spec.volumes[*].configMap)]}
{'Namespace: '}{.metadata.namespace}
{'Pod:       '}{.metadata.name}
{'Volumes that use config maps:'}
{range .spec.volumes[?(@.configMap)]}  {'volume:    '}{.name}
  {'configMap: '}{.configMap.name}{'\n'}{end}
{end}"

Example output

Namespace: default
Pod:       keepalived-worker-59df45db9c-2x9mn
Volumes that use config maps:
  volume:    config-volume
  configMap: mycustomcheck

If the preceding step provided the names of config maps that are used as volumes, delete the config maps:
```
$ oc delete configmap <configmap_name>
```

Identify an existing deployment for IP failover:

$ oc get deployment -l ipfailover

Example output

NAMESPACE   NAME         READY   UP-TO-DATE   AVAILABLE   AGE
default     ipfailover   2/2     2            2           105d

Delete the deployment:

$ oc delete deployment <ipfailover_deployment_name>

Remove the ipfailover service account:
```
$ oc delete sa ipfailover
```

Run a job that removes the IP tables rule that was added when IP failover was initially configured:

Create a file such as remove-ipfailover-job.yaml with contents that are similar to the following example:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: remove-ipfailover-
  labels:
    app: remove-ipfailover
spec:
  template:
    metadata:
      name: remove-ipfailover
    spec:
      containers:
      - name: remove-ipfailover
        image: quay.io/openshift/origin-keepalived-ipfailover:4.9
        command: ["/var/lib/ipfailover/keepalived/remove-failover.sh"]
      nodeSelector:
        kubernetes.io/hostname: <host_name>  (1)
      restartPolicy: Never

1	Run the job for each node in your cluster that was configured for IP failover and replace the hostname each time.

Run the job:

$ oc create -f remove-ipfailover-job.yaml

Example output

job.batch/remove-ipfailover-2h8dm created

Verification

Confirm that the job removed the initial configuration for IP failover.

$ oc logs job/remove-ipfailover-2h8dm

Example output

remove-failover.sh: OpenShift IP Failover service terminating.
  - Removing ip_vs module ...
  - Cleaning up ...
  - Releasing VIPs  (interface eth0) ...