High Availability | Cluster Administration

Overview
Configuring IP Failover
Configuring Service ExternalIP and NodePort
High Availability For IngressIP

Overview

This topic describes setting up high availability for pods and services on your OKD cluster.

IP failover manages a pool of Virtual IP (VIP) addresses on a set of nodes. Every VIP in the set will be serviced by a node selected from the set. As long a single node is available, the VIPs will be served. There is no way to explicitly distribute the VIPs over the nodes. so there may be nodes with no VIPs and other nodes with many VIPs. If there is only one node, all VIPs will be on it.

The VIPs must be routable from outside the cluster.

IP failover monitors a port on each VIP to determine whether the port is reachable on the node. If the port is not reachable, the VIP will not be assigned to the node. If the port is set to 0, this check is suppressed. The check script does the needed testing.

IP failover uses Keepalived to host a set of externally accessible VIP addresses on a set of hosts. Each VIP is only serviced by a single host at a time. Keepalived uses the VRRP protocol to determine which host (from the set of hosts) will service which VIP. If a host becomes unavailable or if the service that Keepalived is watching does not respond, the VIP is switched to another host from the set. Thus, a VIP is always serviced as long as a host is available.

When a host running Keepalived passes the check script, the host can become in the MASTER state based on its priority and the priority of the current MASTER, as determined by the preemption strategy.

The administrator can provide a script via the --notify-script= option, which is called whenever the state changes. Keepalived is in MASTER state when it is servicing the VIP, in BACKUP state when another node is servicing the VIP, or in FAULT` state when the check script fails. The notify script is called with the new state whenever the state changes.

OKD supports creation of IP failover deployment configuration, by running the oc adm ipfailover command. The IP failover deployment configuration specifies the set of VIP addresses, and the set of nodes on which to service them. A cluster can have multiple IP failover deployment configurations, with each managing its own set of unique VIP addresses. Each node in the IP failover configuration runs an IP failover pod, and this pod runs Keepalived.

When using VIPs to access a pod with host networking (e.g. a router), the application pod should be running on all nodes that are running the ipfailover pods. This enables any of the ipfailover nodes to become the master and service the VIPs when needed. If application pods are not running on all nodes with ipfailover, either some ipfailover nodes will never service the VIPs or some application pods will never receive any traffic. Use the same selector and replication count, for both ipfailover and the application pods, to avoid this mismatch.

While using VIPs to access a service, any of the nodes can be in the ipfailover set of nodes, since the service is reachable on all nodes (no matter where the application pod is running). Any of the ipfailover nodes can become master at any time. The service can either use external IPs and a service port or it can use a nodePort.

When using external IPs in the service definition the VIPs are set to the external IPs and the ipfailover monitoring port is set to the service port. A nodePort is open on every node in the cluster and the service will load balance traffic from whatever node currently supports the VIP. In this case, the ipfailover monitoring port is set to the nodePort in the service definition.

Setting up a nodePort is a privileged operation.

Even though a service VIP is highly available, performance can still be affected. keepalived makes sure that each of the VIPs is serviced by some node in the configuration, and several VIPs may end up on the same node even when other nodes have none. Strategies that externally load balance across a set of VIPs may be thwarted when ipfailover puts multiple VIPs on the same node.

When you use ingressIP, you can set up ipfailover to have the same VIP range as the ingressIP range. You can also disable the monitoring port. In this case, all the VIPs will appear on same node in the cluster. Any user can set up a service with an ingressIP and have it highly available.

There are a maximum of 255 VIPs in the cluster.

Configuring IP Failover

Use the oc adm ipfailover command with suitable options, to create ipfailover deployment configuration.

Currently, ipfailover is not compatible with cloud infrastructures. For AWS, an Elastic Load Balancer (ELB) can be used to make OKD highly available, using the AWS console.

As an administrator, you can configure ipfailover on an entire cluster, or on a subset of nodes, as defined by the label selector. You can also configure multiple IP failover deployment configurations in your cluster, where each one is independent of the others. The oc adm ipfailover command creates an ipfailover deployment configuration which ensures that a failover pod runs on each of the nodes matching the constraints or the label used. This pod runs Keepalived which uses VRRP (Virtual Router Redundancy Protocol) among all the Keepalived daemons to ensure that the service on the watched port is available, and if it is not, Keepalived will automatically float the virtual IPs (VIPs).

For production use, make sure to use a --selector=<label> with at least two nodes to select the nodes. Also, set a --replicas=<n> value that matches the number of nodes for the given labeled selector.

The oc adm ipfailover command includes command line options that set environment variables that control Keepalived. The environment variables start with OPENSHIFT_HA_* and they can be changed as needed.

For example, the command below will create an IP failover configuration on a selection of nodes labeled router=us-west-ha (on 4 nodes with 7 virtual IPs monitoring a service listening on port 80, such as the router process).

$ oc adm ipfailover --selector="router=us-west-ha" \
    --virtual-ips="1.2.3.4,10.1.1.100-104,5.6.7.8" \
    --watch-port=80 --replicas=4 --create

Virtual IP Addresses

Keepalived manages a set of virtual IP addresses (VIPs). The administrator must make sure that all these addresses:

Are accessible on the configured hosts from outside the cluster.
Are not used for any other purpose within the cluster.

Keepalived on each node determines whether the needed service is running. If it is, VIPs are supported and Keepalived participates in the negotiation to determine which node will serve the VIP. For a node to participate, the service must be listening on the watch port on a VIP or the check must be disabled.

Each VIP in the set may end up being served by a different node.

Check and Notify Scripts

Keepalived monitors the health of the application by periodically running an optional user supplied check script. For example, the script can test a web server by issuing a request and verifying the response.

The script is provided through the --check-script=<script> option to the oc adm ipfailover command. The script must exit with 0 for PASS or 1 for FAIL.

By default, the check is done every two seconds, but can be changed using the --check-interval=<seconds> option.

When a check script is not provided, a simple default script is run that tests the TCP connection. This default test is suppressed when the monitor port is 0.

For each virtual IPs (VIPs), keepalived keeps the state of the node. The VIP on the node may be in MASTER, BACKUP, or FAULT state. All VIPs on the node that are not in the FAULT state participate in the negotiation to decide which will be MASTER for the VIP. All of the losers enter the BACKUP state. When the check script on the MASTER fails, the VIP enters the FAULT state and triggers a renegotiation. When the BACKUP fails, the VIP enters the FAULT state. When the check script passes again on a VIP in the FAULT state, it exits FAULT and negotiates for MASTER. The resulting state is either MASTER or BACKUP.

The administrator can provide an optional notify script, which is called whenever the state changes. Keepalived passes the following three parameters to the script:

$1 - "GROUP"|"INSTANCE"
$2 - Name of the group or instance
$3 - The new state ("MASTER"|"BACKUP"|"FAULT")

These scripts run in the IP failover pod and use the pod’s file system, not the host file system. The options require the full path to the script. The administrator must make the script available in the pod to extract the results from running the notify script. The recommended approach for providing the scripts is to use a ConfigMap.

The full path names of the check and notify scripts are added to the keepalived configuration file, /etc/keepalived/keepalived.conf, which is loaded every time keepalived starts. The scripts can be added to the pod with a ConfigMap as follows.

Create the desired script and create a ConfigMap to hold it. The script has no input arguments and must return 0 for OK and 1 for FAIL.

The check script, mycheckscript.sh:
```
#!/bin/bash
    # Whatever tests are needed
    # E.g., send request and verify response
exit 0
```

Create the ConfigMap:

$ oc create configmap mycustomcheck --from-file=mycheckscript.sh

There are two approaches to adding the script to the pod: use oc commands or edit the deployment configuration. In both cases, the defaultMode for the mounted configMap files must allow execution. A value of 0755 (493 decimal) is typical.

Using oc commands:

$ oc env dc/ipf-ha-router \
    OPENSHIFT_HA_CHECK_SCRIPT=/etc/keepalive/mycheckscript.sh
$ oc set volume dc/ipf-ha-router --add --overwrite \
    --name=config-volume \
    --mount-path=/etc/keepalive \
    --source='{"configMap": { "name": "mycustomcheck", "defaultMode": 493}}'

Editing the ipf-ha-router deployment configuration:

Use oc edit dc ipf-ha-router to edit the router deployment configuration with a text editor.

...
    spec:
      containers:
      - env:
        - name: OPENSHIFT_HA_CHECK_SCRIPT  (1)
          value: /etc/keepalive/mycheckscript.sh
...
        volumeMounts: (2)
        - mountPath: /etc/keepalive
          name: config-volume
      dnsPolicy: ClusterFirst
...
      volumes: (3)
      - configMap:
          defaultMode: 0755 (4)
          name: customrouter
        name: config-volume
...

1	In the `spec.container.env` field, add the `OPENSHIFT_HA_CHECK_SCRIPT` environment variable to point to the mounted script file.
2	Add the `spec.container.volumeMounts` field to create the mount point.
3	Add a new `spec.volumes` field to mention the ConfigMap.
4	This sets execute permission on the files. When read back, it will be displayed in decimal (`493`).

Save the changes and exit the editor. This restarts ipf-ha-router.

VRRP Preemption

When a host leaves the FAULT state by passing the check script, the host becomes a BACKUP if the new host has lower priority than the host currently in the MASTER state. However, if it has a higher priority, the preemption strategy determines it’s role in the cluster.

The nopreempt strategy does not move MASTER from the lower priority host to the higher priority host. With preempt 300, the default, keepalived waits the specified 300 seconds and moves MASTER to the higher priority host.

To specify preemption:

When creating ipfailover using the preemption-strategy:

$ oc adm ipfailover --preempt-strategy=nopreempt \
  ...

Setting the variable using the oc set env command:

$ oc set env dc/ipf-ha-router \
    --overwrite=true \
    OPENSHIFT_HA_PREEMPTION=nopreempt

Using oc edit dc ipf-ha-router to edit the router deployment configuration:

...
    spec:
      containers:
      - env:
        - name: OPENSHIFT_HA_PREEMPTION  (1)
          value: nopreempt
...

Keepalived Multicast

OKD’s IP failover internally uses keepalived.

Ensure that multicast is enabled on the nodes labeled above and they can accept network traffic for 224.0.0.18 (the VRRP multicast IP address).

Before starting the keepalived daemon, the startup script verifies the iptables rule that allows multicast traffic to flow. If there is no such rule, the startup script creates a new rule and adds it to the IP tables configuration. Where this new rule gets added to the IP tables configuration depends on the --iptables-chain= option. If there is an --iptables-chain= option specified, the rule gets added to the specified chain in the option. Otherwise, the rule is added to the INPUT chain.

The iptables rule must be present whenever there is one or more keepalived daemon running on the node.

The iptables rule can be removed after the last keepalived daemon terminates. The rule is not automatically removed.

You can manually manage the iptables rule on each of the nodes. It only gets created when none is present (as long as ipfailover is not created with the --iptable-chain="" option).

You must ensure that the manually added rules persist after a system restart.

Be careful since every keepalived daemon uses the VRRP protocol over multicast 224.0.0.18 to negotiate with its peers. There must be a different VRRP-id (in the range 0..255) for each VIP.

$ for node in openshift-node-{5,6,7,8,9}; do   ssh $node <<EOF

export interface=${interface:-"eth0"}
echo "Check multicast enabled ... ";
ip addr show $interface | grep -i MULTICAST

echo "Check multicast groups ... "
ip maddr show $interface | grep 224.0.0

EOF
done;

Command Line Options and Environment Variables

Table 1. Command Line Options and Environment Variables
Option	Variable Name	Default	Notes
`--watch-port`	`OPENSHIFT_HA_MONITOR_PORT`	80	The ipfailover pod tries to open a TCP connection to this port on each VIP. If connection is established, the service is considered to be running. If this port is set to 0, the test always passes.
`--interface`	`OPENSHIFT_HA_NETWORK_INTERFACE`		The interface name for ipfailover to use, to send VRRP traffic. By default, `eth0` is used.
`--replicas`	`OPENSHIFT_HA_REPLICA_COUNT`	2	Number of replicas to create. This must match `spec.replicas` value in ipfailover deployment configuration.
`--virtual-ips`	`OPENSHIFT_HA_VIRTUAL_IPS`		The list of IP address ranges to replicate. This must be provided. (For example, 1.2.3.4-6,1.2.3.9.) See this discussion for more details.
`--vrrp-id-offset`	`OPENSHIFT_HA_VRRP_ID_OFFSET`	0	See VRRP ID Offset discussion for more details.
`--virtual-ip-groups`	`OPENSHIFT_HA_VIP_GROUPS`		The number of groups to create for VRRP. If not set, a group is created for each virtual IP range specified with the `--virtual-ips` option. See Configuring IP failover for more than 254 addresses for more information.
`--iptables-chain`	`OPENSHIFT_HA_IPTABLES_CHAIN`	INPUT	The name of the iptables chain, to automatically add an `iptables` rule to allow the VRRP traffic on. If the value is not set, an `iptables` rule will not be added. If the chain does not exist, it is not created.
`--check-script`	`OPENSHIFT_HA_CHECK_SCRIPT`		Full path name in the pod file system of a script that is periodically run to verify the application is operating. See this discussion for more details.
`--check-interval`	`OPENSHIFT_HA_CHECK_INTERVAL`	2	The period, in seconds, that the check script is run.
`--notify-script`	`OPENSHIFT_HA_NOTIFY_SCRIPT`		Full path name in the pod file system of a script that is run whenever the state changes. See this discussion for more details.
`--preemption-strategy`	`OPENSHIFT_HA_PREEMPTION`	preempt 300	Strategy for handling a new higher priority host. See the VRRP Preemption section for more details.

VRRP ID Offset

Each ipfailover pod managed by the ipfailover deployment configuration (1 pod per node/replica) runs a keepalived daemon. As more ipfailover deployment configurations are configured, more pods are created and more daemons join into the common VRRP negotiation. This negotiation is done by all the keepalived daemons and it determines which nodes will service which virtual IPs (VIPs).

Internally, keepalived assigns a unique vrrp-id to each VIP. The negotiation uses this set of vrrp-ids, when a decision is made, the VIP corresponding to the winning vrrp-id is serviced on the winning node.

Therefore, for every VIP defined in the ipfailover deployment configuration, the ipfailover pod must assign a corresponding vrrp-id. This is done by starting at --vrrp-id-offset and sequentially assigning the vrrp-ids to the list of VIPs. The vrrp-ids may have values in the range 1..255.

When there are multiple ipfailover deployment configuration care must be taken to specify --vrrp-id-offset so that there is room to increase the number of VIPS in the deployment configuration and none of the vrrp-id ranges overlap.

Configuring IP failover for more than 254 addresses

IP failover management is limited to 254 groups of VIP addresses. By default OKD assigns one IP address to each group. You can use the virtual-ip-groups option to change this so multiple IP addresses are in each group and define the number of VIP groups available for each VRRP instance when configuring IP failover.

Grouping VIPs creates a wider range of allocation of VIPs per VRRP in the case of VRRP failover events, and is useful when all hosts in the cluster have access to a service locally. For example, when a service is being exposed with an ExternalIP.

As a rule for failover, do not limit services, such as the router, to one specific host. Instead, services should be replicated to each host so that in the case of IP failover, the services do not have to be recreated on the new host.

If you are using OKD health checks, the nature of IP failover and groups means that all instances in the group will not be checked. For that reason, the Kubernetes health checks must be used to ensure that services are live.

# oc adm ipfailover <ipfailover_name> --create \
    ...
    --virtual-ip-groups=<number_of_ipfailover_groups>

For example, if --virtual-ip-groups is set to 3 in an environment with seven VIPs, it creates three groups, assigning three VIPs to the first group, and two VIPs to the two remaining groups.

If the number of groups set by the --virtual-ip-groups option is fewer than the number of IP addresses set to fail over, the group will contain more than one IP address, and all of the addresses will move as a single unit.

Configuring a Highly-available Service

The following example describes how to set up highly-available router and geo-cache network services with IP failover on a set of nodes.

Label the nodes that will be used for the services. This step can be optional if you run the services on all the nodes in your OKD cluster and will use VIPs that can float within all nodes in the cluster.

The following example defines a label for nodes that are servicing traffic in the US west geography ha-svc-nodes=geo-us-west:
$ oc label nodes openshift-node-{5,6,7,8,9} "ha-svc-nodes=geo-us-west"
Create the service account. You can use ipfailover or when using a router (depending on your environment policies), you can either reuse the router service account created previously or a new ipfailover service account.

The following example creates a new service account with the name ipfailover in the default namespace:
$ oc create serviceaccount ipfailover -n default

Add the ipfailover service account in the default namespace to the privileged SCC:

$ oc adm policy add-scc-to-user privileged system:serviceaccount:default:ipfailover

Start the router and the geo-cache services.

Since the ipfailover runs on all nodes from step 1, it is recommended to also run the router/service on all the step 1 nodes.

Start the router with the nodes matching the labels used in the first step. The following example runs five instances using the ipfailover service account:

$ oc adm router ha-router-us-west --replicas=5 \
    --selector="ha-svc-nodes=geo-us-west" \
    --labels="ha-svc-nodes=geo-us-west" \
    --service-account=ipfailover

Run the geo-cache service with a replica on each of the nodes. See an example configuration for running a geo-cache service.

Make sure that you replace the myimages/geo-cache container image referenced in the file with your intended image. Change the number of replicas to the number of nodes in the geo-cache label. Check that the label matches the one used in the first step.

$ oc create -n <namespace> -f ./examples/geo-cache.json

Configure ipfailover for the router and geo-cache services. Each has its own VIPs and both use the same nodes labeled with ha-svc-nodes=geo-us-west in the first step. Ensure that the number of replicas match the number of nodes listed in the label setup, in the first step.

The router, geo-cache, and ipfailover all create deployment configuration and all must have different names.
Specify the VIPs and the port number that ipfailover should monitor on the desired instances.

The ipfailover command for the router:
$ oc adm ipfailover ipf-ha-router-us-west \ --replicas=5 --watch-port=80 \ --selector="ha-svc-nodes=geo-us-west" \ --virtual-ips="10.245.2.101-105" \ --iptables-chain="INPUT" \ --service-account=ipfailover --create
The following is the oc adm ipfailover command for the geo-cache service that is listening on port 9736. Since there are two ipfailover deployment configurations, the --vrrp-id-offset must be set so that each VIP gets its own offset. In this case, setting a value of 10 means that the ipf-ha-router-us-west can have a maximum of 10 VIPs (0-9) since ipf-ha-geo-cache is starting at 10.
$ oc adm ipfailover ipf-ha-geo-cache \ --replicas=5 --watch-port=9736 \ --selector="ha-svc-nodes=geo-us-west" \ --virtual-ips=10.245.3.101-105 \ --vrrp-id-offset=10 \ --service-account=ipfailover --create
In the commands above, there are ipfailover, router, and geo-cache pods on each node. The set of VIPs for each ipfailover configuration must not overlap and they must not be used elsewhere in the external or cloud environments. The five VIP addresses in each example, 10.245.{2,3}.101-105 are served by the two ipfailover deployment configurations. IP failover dynamically selects which address is served on which node.

The administrator sets up external DNS to point to the VIP addresses knowing that all the router VIPs point to the same router, and all the geo-cache VIPs point to the same geo-cache service. As long as one node remains running, all the VIP addresses are served.

Deploy IP Failover Pod

Deploy the ipfailover router to monitor postgresql listening on node port 32439 and the external IP address, as defined in the postgresql-ingress service:

$ oc adm ipfailover ipf-ha-postgresql \
    --replicas=1 \ (1)
    --selector="app-type=postgresql" \ (2)
    --virtual-ips=10.9.54.100 \ (3)
    --watch-port=32439 \ (4)
    --service-account=ipfailover --create

1	Specifies the number of instances to deploy.
2	Restricts where the ipfailover is deployed.
3	Virtual IP address to monitor.
4	Port on which ipfailover will monitor on each node.

Dynamically Updating Virtual IPs for a Highly-available Service

The default deployment strategy for the IP failover service is to recreate the deployment. In order to dynamically update the VIPs for a highly available routing service with minimal or no downtime, you must:

Update the IP failover service deployment configuration to use a rolling update strategy, and
Update the OPENSHIFT_HA_VIRTUAL_IPS environment variable with the updated list or sets of virtual IP addresses.

The following example shows how to dynamically update the deployment strategy and the virtual IP addresses:

Consider an IP failover configuration that was created using the following:

$ oc adm ipfailover ipf-ha-router-us-west \
    --replicas=5 --watch-port=80 \
    --selector="ha-svc-nodes=geo-us-west" \
    --virtual-ips="10.245.2.101-105" \
    --service-account=ipfailover --create

Edit the deployment configuration:
$ oc edit dc/ipf-ha-router-us-west

Update the spec.strategy.type field from Recreate to Rolling:

spec:
  replicas: 5
  selector:
    ha-svc-nodes: geo-us-west
  strategy:
    resources: {}
    rollingParams:
      maxSurge: 0
    type: Rolling (1)

1	Set to `Rolling`.

Update the OPENSHIFT_HA_VIRTUAL_IPS environment variable to contain the additional virtual IP addresses:
- name: OPENSHIFT_HA_VIRTUAL_IPS value: 10.245.2.101-105,10.245.2.110,10.245.2.201-205 (1)
1 10.245.2.110,10.245.2.201-205 have been added to the list.
Update the external DNS to match the set of VIPs.

Configuring Service ExternalIP and NodePort

The user can assign VIPs as ExternalIPs in a service. Keepalived makes sure that each VIP is served on some node in the ipfailover configuration. When a request arrives on the node, the service that is running on all nodes in the cluster, load balances the request among the service’s endpoints.

The NodePorts can be set to the ipfailover watch port so that keepalived can check the application is running. The NodePort is exposed on all nodes in the cluster, therefore it is available to keepalived on all ipfailover nodes.

High Availability For IngressIP

In non-cloud clusters, ipfailover and ingressIP to a service can be combined. The result is high availability services for users that create services using ingressIP.

The approach is to specify an ingressIPNetworkCIDR range and then use the same range in creating the ipfailover configuration.

Since, ipfailover can support up to a maximum of 255 VIPs for the entire cluster, the ingressIPNetworkCIDR needs to be /24 or less.