Preparing for the cluster update - Day 2 operations for telco core CNF clusters | Edge computing

Ensuring the host firmware is compatible with the update
Ensuring that layered products are compatible with the update
Applying MachineConfigPool labels to nodes before the update
Telco deployment environment considerations
Preparing the cluster platform for update

Typically, telco clusters run on bare-metal hardware. Often you must update the firmware to take on important security fixes, take on new functionality, or maintain compatibility with the new release of OKD.

Ensuring the host firmware is compatible with the update

You are responsible for the firmware versions that you run in your clusters. Updating host firmware is not a part of the OKD update process. It is not recommended to update firmware in conjunction with the OKD version.

Hardware vendors advise that it is best to apply the latest certified firmware version for the specific hardware that you are running. For telco use cases, always verify firmware updates in test environments before applying them in production. The high throughput nature of telco CNF workloads can be adversely affected by sub-optimal host firmware.

You should thoroughly test new firmware updates to ensure that they work as expected with the current version of OKD. Ideally, you test the latest firmware version with the target OKD update version.

Ensuring that layered products are compatible with the update

Verify that all layered products run on the version of OKD that you are updating to before you begin the update. This generally includes all Operators.

Procedure

Verify the currently installed Operators in the cluster. For example, run the following command:

$ oc get csv -A

Example output

NAMESPACE                              NAME            DISPLAY          VERSION   REPLACES                             PHASE
gitlab-operator-kubernetes.v0.17.2     GitLab                           0.17.2    gitlab-operator-kubernetes.v0.17.1   Succeeded
openshift-operator-lifecycle-manager   packageserver   Package Server   0.19.0                                         Succeeded

Check that Operators that you install with OLM are compatible with the update version.

Operators that are installed with the Operator Lifecycle Manager (OLM) are not part of the standard cluster Operators set.

Use the Operator Update Information Checker to understand if you must update an Operator after each y-stream update or if you can wait until you have fully updated to the next EUS release.

You can also use the Operator Update Information Checker to see what versions of OKD are compatible with specific releases of an Operator.
Check that Operators that you install outside of OLM are compatible with the update version.
For all OLM-installed Operators that are not directly supported by Red Hat, contact the Operator vendor to ensure release compatibility.
- Some Operators are compatible with several releases of OKD. You might not must update the Operators until after you complete the cluster update. See "Updating the worker nodes" for more information.
- See "Updating all the OLM Operators" for information about updating an Operator after performing the first y-stream control plane update.

Additional resources

Applying MachineConfigPool labels to nodes before the update

Prepare MachineConfigPool (mcp) node labels to group nodes together in groups of roughly 8 to 10 nodes. With mcp groups, you can reboot groups of nodes independently from the rest of the cluster.

You use the mcp node labels to pause and unpause the set of nodes during the update process so that you can do the update and reboot at a time of your choosing.

Staggering the cluster update

Sometimes there are problems during the update. Often the problem is related to hardware failure or nodes needing to be reset. Using mcp node labels, you can update nodes in stages by pausing the update at critical moments, tracking paused and unpaused nodes as you proceed. When a problem occurs, you use the nodes that are in an unpaused state to ensure that there are enough nodes running to keep all applications pods running.

Dividing worker nodes into MachineConfigPool groups

How you divide worker nodes into mcp groups can vary depending on how many nodes are in the cluster or how many nodes you assign to a node role. By default the 2 roles in a cluster are control plane and worker.

In clusters that run telco workloads, you can further split the worker nodes between CNF control plane and CNF data plane roles. Add mcp role labels that split the worker nodes into each of these two groups.

Larger clusters can have as many as 100 worker nodes in the CNF control plane role. No matter how many nodes there are in the cluster, keep each MachineConfigPool group to around 10 nodes. This allows you to control how many nodes are taken down at a time. With multiple MachineConfigPool groups, you can unpause several groups at a time to accelerate the update, or separate the update over 2 or more maintenance windows.

Example cluster with 15 worker nodes

Consider a cluster with 15 worker nodes:

10 worker nodes are CNF control plane nodes.
5 worker nodes are CNF data plane nodes.

Split the CNF control plane and data plane worker node roles into at least 2 mcp groups each. Having 2 mcp groups per role means that you can have one set of nodes that are not affected by the update.

Example cluster with 6 worker nodes

Consider a cluster with 6 worker nodes:

Split the worker nodes into 3 mcp groups of 2 nodes each.

Upgrade one of the mcp groups. Allow the updated nodes to sit through a day to allow for verification of CNF compatibility before completing the update on the other 4 nodes.

The process and pace at which you unpause the mcp groups is determined by your CNF applications and configuration.

If your CNF pod can handle being scheduled across nodes in a cluster, you can unpause several mcp groups at a time and set the MaxUnavailable in the mcp custom resource (CR) to as high as 50%. This allows up to half of the nodes in an mcp group to restart and get updated.

Reviewing configured cluster MachineConfigPool roles

Review the currently configured MachineConfigPool roles in the cluster.

Procedure

Get the currently configured mcp groups in the cluster:

$ oc get mcp

Example output

NAME     CONFIG                   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-bere83   True      False      False      3              3                   3                     0                      25d
worker   rendered-worker-245c4f   True      False      False      2              2                   2                     0                      25d

Compare the list of mcp roles to list of nodes in the cluster:

$ oc get nodes

Example output

NAME           STATUS   ROLES                  AGE   VERSION
ctrl-plane-0   Ready    control-plane,master   39d   v1.27.15+6147456
ctrl-plane-1   Ready    control-plane,master   39d   v1.27.15+6147456
ctrl-plane-2   Ready    control-plane,master   39d   v1.27.15+6147456
worker-0       Ready    worker                 39d   v1.27.15+6147456
worker-1       Ready    worker                 39d   v1.27.15+6147456

When you apply an mcp group change, the node roles are updated.

Determine how you want to separate the worker nodes into mcp groups.

Creating MachineConfigPool groups for the cluster

Creating mcp groups is a 2-step process:

Add an mcp label to the nodes in the cluster
Apply an mcp CR to the cluster that organizes the nodes based on their labels

Procedure

Label the nodes so that they can be put into mcp groups. Run the following commands:

$ oc label node worker-0 node-role.kubernetes.io/mcp-1=

$ oc label node worker-1 node-role.kubernetes.io/mcp-2=

The mcp-1 and mcp-2 labels are applied to the nodes. For example:

Example output

NAME           STATUS   ROLES                  AGE   VERSION
ctrl-plane-0   Ready    control-plane,master   39d   v1.27.15+6147456
ctrl-plane-1   Ready    control-plane,master   39d   v1.27.15+6147456
ctrl-plane-2   Ready    control-plane,master   39d   v1.27.15+6147456
worker-0       Ready    mcp-1,worker           39d   v1.27.15+6147456
worker-1       Ready    mcp-2,worker           39d   v1.27.15+6147456

Create YAML custom resources (CRs) that apply the labels as mcp CRs in the cluster. Save the following YAML in the mcps.yaml file:

---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: mcp-2
spec:
  machineConfigSelector:
    matchExpressions:
      - {
         key: machineconfiguration.openshift.io/role,
         operator: In,
         values: [worker,mcp-2]
        }
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/mcp-2: ""
---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: mcp-1
spec:
  machineConfigSelector:
    matchExpressions:
      - {
         key: machineconfiguration.openshift.io/role,
         operator: In,
         values: [worker,mcp-1]
        }
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/mcp-1: ""

Create the MachineConfigPool resources:

$ oc apply -f mcps.yaml

Example output

machineconfigpool.machineconfiguration.openshift.io/mcp-2 created

Verification

Monitor the MachineConfigPool resources as they are applied in the cluster. After you apply the mcp resources, the nodes are added into the new machine config pools. This takes a few minutes.

The nodes do not reboot while being added into the mcp groups. The original worker and master mcp groups remain unchanged.

Check the status of the new mcp resources:

$ oc get mcp

Example output

NAME     CONFIG                   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-be3e83   True      False      False      3              3                 3                     0                      25d
mcp-1    rendered-mcp-1-2f4c4f    False     True       True       1              0                 0                     0                      10s
mcp-2    rendered-mcp-2-2r4s1f    False     True       True       1              0                 0                     0                      10s
worker   rendered-worker-23fc4f   False     True       True       0              0                 0                     2                      25d

Eventually, the resources are fully applied:

NAME     CONFIG                   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-be3e83   True      False      False      3              3                 3                     0                      25d
mcp-1    rendered-mcp-1-2f4c4f    True      False      False      1              1                 1                     0                      7m33s
mcp-2    rendered-mcp-2-2r4s1f    True      False      False      1              1                 1                     0                      51s
worker   rendered-worker-23fc4f   True      False      False      0              0                 0                     0                      25d

Additional resources

Telco deployment environment considerations

In telco environments, most clusters are in disconnected networks. To update clusters in these environments, you must update your offline image repository.

Additional resources

Preparing the cluster platform for update

Before you update the cluster, perform some basic checks and verifications to make sure that the cluster is ready for the update.

Procedure

Verify that there are no failed or in progress pods in the cluster by running the following command:
```
$ oc get pods -A | grep -E -vi 'complete|running'
```
You might have to run this command more than once if there are pods that are in a pending state.

Verify that all nodes in the cluster are available:

$ oc get nodes

Example output

NAME           STATUS   ROLES                  AGE   VERSION
ctrl-plane-0   Ready    control-plane,master   32d   v1.27.15+6147456
ctrl-plane-1   Ready    control-plane,master   32d   v1.27.15+6147456
ctrl-plane-2   Ready    control-plane,master   32d   v1.27.15+6147456
worker-0       Ready    mcp-1,worker           32d   v1.27.15+6147456
worker-1       Ready    mcp-2,worker           32d   v1.27.15+6147456

Verify that all bare-metal nodes are provisioned and ready.

$ oc get bmh -n openshift-machine-api

Example output

NAME           STATE         CONSUMER                   ONLINE   ERROR   AGE
ctrl-plane-0   unmanaged     cnf-58879-master-0         true             33d
ctrl-plane-1   unmanaged     cnf-58879-master-1         true             33d
ctrl-plane-2   unmanaged     cnf-58879-master-2         true             33d
worker-0       unmanaged     cnf-58879-worker-0-45879   true             33d
worker-1       progressing   cnf-58879-worker-0-dszsh   false            1d (1)

1	An error occurred while provisioning the `worker-1` node.

Verification

Verify that all cluster Operators are ready:

$ oc get co

Example output

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.34   True        False         False      17h
baremetal                                  4.14.34   True        False         False      32d

...

service-ca                                 4.14.34   True        False         False      32d
storage                                    4.14.34   True        False         False      32d

Additional resources

Investigating pod issues

Preparing the telco core cluster platform for update

Ensuring the host firmware is compatible with the update

Ensuring that layered products are compatible with the update

Applying MachineConfigPool labels to nodes before the update

Staggering the cluster update

Dividing worker nodes into MachineConfigPool groups

Reviewing configured cluster MachineConfigPool roles

Creating MachineConfigPool groups for the cluster

Telco deployment environment considerations

Preparing the cluster platform for update