Completing the y-stream update - Day 2 operations for telco core CNF clusters | Edge computing

Acknowledging the Control Plane Only or y-stream update
Starting the cluster update
Monitoring the cluster update
Updating the OLM Operators
Updating the worker nodes
Verifying the health of the newly updated cluster

Follow these steps to perform the y-stream cluster update and monitor the update through to completion. Completing a y-stream update is more straightforward than a Control Plane Only update.

Acknowledging the Control Plane Only or y-stream update

When you update to all versions from 4.11 and later, you must manually acknowledge that the update can continue.

Before you acknowledge the update, verify that you are not using any of the Kubernetes APIs that are removed from the version you are updating to. For example, in OKD 4.17, there are no API removals. See "Kubernetes API removals" for more information.

Prerequisites

You have verified that APIs for all of the applications running on your cluster are compatible with the next Y-stream release of OKD. For more details about compatibility, see "Verifying cluster API versions between update versions".

Procedure

Complete the administrative acknowledgment to start the cluster update by running the following command:

$ oc adm upgrade

If the cluster update does not complete successfully, more details about the update failure are provided in the Reason and Message sections.

Example output

Cluster version is 4.15.45

Upgradeable=False

  Reason: MultipleReasons
  Message: Cluster should not be upgraded between minor versions for multiple reasons: AdminAckRequired,ResourceDeletesInProgress
  * Kubernetes 1.29 and therefore OpenShift 4.16 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7031404 for details and instructions.
  * Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=PrometheusRule "openshift-kube-apiserver/kube-apiserver-recording-rules"

ReleaseAccepted=False

  Reason: PreconditionChecks
  Message: Preconditions failed for payload loaded version="4.16.34" image="quay.io/openshift-release-dev/ocp-release@sha256:41bb08c560f6db5039ccdf242e590e8b23049b5eb31e1c4f6021d1d520b353b8": Precondition "ClusterVersionUpgradeable" failed because of "MultipleReasons": Cluster should not be upgraded between minor versions for multiple reasons: AdminAckRequired,ResourceDeletesInProgress
  * Kubernetes 1.29 and therefore OpenShift 4.16 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/7031404 for details and instructions.
  * Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=PrometheusRule "openshift-kube-apiserver/kube-apiserver-recording-rules"

Upstream is unset, so the cluster will use an appropriate default.
Channel: eus-4.16 (available channels: candidate-4.15, candidate-4.16, eus-4.16, fast-4.15, fast-4.16, stable-4.15, stable-4.16)

Recommended updates:

  VERSION     IMAGE
  4.16.34     quay.io/openshift-release-dev/ocp-release@sha256:41bb08c560f6db5039ccdf242e590e8b23049b5eb31e1c4f6021d1d520b353b8

In this example, a linked Red Hat Knowledgebase article (Preparing to upgrade to OKD 4.16) provides more detail about verifying API compatibility between releases.

Verification

Verify the update by running the following command:
$ oc get configmap admin-acks -n openshift-config -o json | jq .data
Example output
{ "ack-4.14-kube-1.28-api-removals-in-4.15": "true", "ack-4.15-kube-1.29-api-removals-in-4.16": "true" }
In this example, the cluster is updated from version 4.14 to 4.15, and then from 4.15 to 4.16 in a Control Plane Only update.

Additional resources

Starting the cluster update

When updating from one y-stream release to the next, you must ensure that the intermediate z-stream releases are also compatible.

You can verify that you are updating to a viable release by running the oc adm upgrade command. The oc adm upgrade command lists the compatible update releases.

Procedure

Start the update:

$ oc adm upgrade --to=4.15.33

Control Plane Only update: Make sure you point to the interim <y+1> release path
Y-stream update - Make sure you use the correct <y.z> release that follows the Kubernetes version skew policy.
Z-stream update - Verify that there are no problems moving to that specific release

Example output

Requested update to 4.15.33 (1)

1	The `Requested update` value changes depending on your particular update.

Additional resources

Selecting the target release

Monitoring the cluster update

You should check the cluster health often during the update. Check for the node status, cluster Operators status and failed pods.

Procedure

Monitor the cluster update. For example, to monitor the cluster update from version 4.14 to 4.15, run the following command:

$ watch "oc get clusterversion; echo; oc get co | head -1; oc get co | grep 4.14; oc get co | grep 4.15; echo; oc get no; echo; oc get po -A | grep -E -iv 'running|complete'"

Example output

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.34   True        True          4m6s    Working towards 4.15.33: 111 of 873 done (12% complete), waiting on kube-apiserver

NAME                           VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                 4.14.34   True        False         False      4d22h
baremetal                      4.14.34   True        False         False      4d23h
cloud-controller-manager       4.14.34   True        False         False      4d23h
cloud-credential               4.14.34   True        False         False      4d23h
cluster-autoscaler             4.14.34   True        False         False      4d23h
console                        4.14.34   True        False         False      4d22h

...

storage                        4.14.34   True        False         False      4d23h
config-operator                4.15.33   True        False         False      4d23h
etcd                           4.15.33   True        False         False      4d23h

NAME           STATUS   ROLES                  AGE     VERSION
ctrl-plane-0   Ready    control-plane,master   4d23h   v1.27.15+6147456
ctrl-plane-1   Ready    control-plane,master   4d23h   v1.27.15+6147456
ctrl-plane-2   Ready    control-plane,master   4d23h   v1.27.15+6147456
worker-0       Ready    mcp-1,worker           4d23h   v1.27.15+6147456
worker-1       Ready    mcp-2,worker           4d23h   v1.27.15+6147456

NAMESPACE               NAME                       READY   STATUS              RESTARTS   AGE
openshift-marketplace   redhat-marketplace-rf86t   0/1     ContainerCreating   0          0s

Verification

During the update the watch command cycles through one or several of the cluster Operators at a time, providing a status of the Operator update in the MESSAGE column.

When the cluster Operators update process is complete, each control plane nodes is rebooted, one at a time.

During this part of the update, messages are reported that state cluster Operators are being updated again or are in a degraded state. This is because the control plane node is offline while it reboots nodes.

As soon as the last control plane node reboot is complete, the cluster version is displayed as updated.

When the control plane update is complete a message such as the following is displayed. This example shows an update completed to the intermediate y-stream release.

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.33   True        False         28m     Cluster version is 4.15.33

NAME                         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication               4.15.33   True        False         False	  5d
baremetal                    4.15.33   True        False         False	  5d
cloud-controller-manager     4.15.33   True        False         False	  5d1h
cloud-credential             4.15.33   True        False         False	  5d1h
cluster-autoscaler           4.15.33   True        False         False	  5d
config-operator              4.15.33   True        False         False	  5d
console                      4.15.33   True        False         False	  5d

...

service-ca                   4.15.33   True        False         False	  5d
storage                      4.15.33   True        False         False	  5d

NAME           STATUS   ROLES                  AGE   VERSION
ctrl-plane-0   Ready    control-plane,master   5d    v1.28.13+2ca1a23
ctrl-plane-1   Ready    control-plane,master   5d    v1.28.13+2ca1a23
ctrl-plane-2   Ready    control-plane,master   5d    v1.28.13+2ca1a23
worker-0       Ready    mcp-1,worker           5d    v1.28.13+2ca1a23
worker-1       Ready    mcp-2,worker           5d    v1.28.13+2ca1a23

Updating the OLM Operators

In telco environments, software needs to vetted before it is loaded onto a production cluster. Production clusters are also configured in a disconnected network, which means that they are not always directly connected to the internet. Because the clusters are in a disconnected network, the OpenShift Operators are configured for manual update during installation so that new versions can be managed on a cluster-by-cluster basis. Perform the following procedure to move the Operators to the newer versions.

Procedure

Check to see which Operators need to be updated:

$ oc get installplan -A | grep -E 'APPROVED|false'

Example output

NAMESPACE           NAME            CSV                                               APPROVAL   APPROVED
metallb-system      install-nwjnh   metallb-operator.v4.16.0-202409202304             Manual     false
openshift-nmstate   install-5r7wr   kubernetes-nmstate-operator.4.16.0-202409251605   Manual     false

Patch the InstallPlan resources for those Operators:

$ oc patch installplan -n metallb-system install-nwjnh --type merge --patch \
'{"spec":{"approved":true}}'

Example output

installplan.operators.coreos.com/install-nwjnh patched

Monitor the namespace by running the following command:

$ oc get all -n metallb-system

Example output

NAME                                                       READY   STATUS              RESTARTS   AGE
pod/metallb-operator-controller-manager-69b5f884c-8bp22    0/1     ContainerCreating   0          4s
pod/metallb-operator-controller-manager-77895bdb46-bqjdx   1/1     Running             0          4m1s
pod/metallb-operator-webhook-server-5d9b968896-vnbhk       0/1     ContainerCreating   0          4s
pod/metallb-operator-webhook-server-d76f9c6c8-57r4w        1/1     Running             0          4m1s

...

NAME                                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/metallb-operator-controller-manager-69b5f884c    1         1         0       4s
replicaset.apps/metallb-operator-controller-manager-77895bdb46   1         1         1       4m1s
replicaset.apps/metallb-operator-controller-manager-99b76f88     0         0         0       4m40s
replicaset.apps/metallb-operator-webhook-server-5d9b968896       1         1         0       4s
replicaset.apps/metallb-operator-webhook-server-6f7dbfdb88       0         0         0       4m40s
replicaset.apps/metallb-operator-webhook-server-d76f9c6c8        1         1         1       4m1s

When the update is complete, the required pods should be in a Running state, and the required ReplicaSet resources should be ready:

NAME                                                      READY   STATUS    RESTARTS   AGE
pod/metallb-operator-controller-manager-69b5f884c-8bp22   1/1     Running   0          25s
pod/metallb-operator-webhook-server-5d9b968896-vnbhk      1/1     Running   0          25s

...

NAME                                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/metallb-operator-controller-manager-69b5f884c    1         1         1       25s
replicaset.apps/metallb-operator-controller-manager-77895bdb46   0         0         0       4m22s
replicaset.apps/metallb-operator-webhook-server-5d9b968896       1         1         1       25s
replicaset.apps/metallb-operator-webhook-server-d76f9c6c8        0         0         0       4m22s

Verification

Verify that the Operators do not need to be updated for a second time:
```
$ oc get installplan -A | grep -E 'APPROVED|false'
```
There should be no output returned.

Sometimes you have to approve an update twice because some Operators have interim z-stream release versions that need to be installed before the final version.

Additional resources

Updating the worker nodes

Updating the worker nodes

You upgrade the worker nodes after you have updated the control plane by unpausing the relevant mcp groups you created. Unpausing the mcp group starts the upgrade process for the worker nodes in that group. Each of the worker nodes in the cluster reboot to upgrade to the new EUS, y-stream or z-stream version as required.

In the case of Control Plane Only upgrades note that when a worker node is updated it will only require one reboot and will jump <y+2>-release versions. This is a feature that was added to decrease the amount of time that it takes to upgrade large bare-metal clusters.

This is a potential holding point. You can have a cluster version that is fully supported to run in production with the control plane that is updated to a new EUS release while the worker nodes are at a <y-2>-release. This allows large clusters to upgrade in steps across several maintenance windows.

You can check how many nodes are managed in an mcp group. Run the following command to get the list of mcp groups:

$ oc get mcp

Example output

NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-c9a52144456dbff9c9af9c5a37d1b614   True      False      False      3              3                   3                     0                      36d
mcp-1    rendered-mcp-1-07fe50b9ad51fae43ed212e84e1dcc8e    False     False      False      1              0                   0                     0                      47h
mcp-2    rendered-mcp-2-07fe50b9ad51fae43ed212e84e1dcc8e    False     False      False      1              0                   0                     0                      47h
worker   rendered-worker-f1ab7b9a768e1b0ac9290a18817f60f0   True      False      False      0              0                   0                     0                      36d

You decide how many mcp groups to upgrade at a time. This depends on how many CNF pods can be taken down at a time and how your pod disruption budget and anti-affinity settings are configured.

Get the list of nodes in the cluster:

$ oc get nodes

Example output

NAME           STATUS   ROLES                  AGE    VERSION
ctrl-plane-0   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
ctrl-plane-1   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
ctrl-plane-2   Ready    control-plane,master   5d8h   v1.29.8+f10c92d
worker-0       Ready    mcp-1,worker           5d8h   v1.27.15+6147456
worker-1       Ready    mcp-2,worker           5d8h   v1.27.15+6147456

Confirm the MachineConfigPool groups that are paused:

$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

Example output

MCP     Paused
---     ------
master  false
mcp-1   true
mcp-2   true

Each MachineConfigPool can be unpaused independently. Therefore, if a maintenance window runs out of time other MCPs do not need to be unpaused immediately. The cluster is supported to run with some worker nodes still at <y-2>-release version.

Unpause the required mcp group to begin the upgrade:

$ oc patch mcp/mcp-1 --type merge --patch '{"spec":{"paused":false}}'

Example output

machineconfigpool.machineconfiguration.openshift.io/mcp-1 patched

Confirm that the required mcp group is unpaused:

$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

Example output

MCP     Paused
---     ------
master  false
mcp-1   false
mcp-2   true

As each mcp group is upgraded, continue to unpause and upgrade the remaining nodes.

$ oc get nodes

Example output

NAME           STATUS                        ROLES                  AGE    VERSION
ctrl-plane-0   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
ctrl-plane-1   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
ctrl-plane-2   Ready                         control-plane,master   5d8h   v1.29.8+f10c92d
worker-0       Ready                         mcp-1,worker           5d8h   v1.29.8+f10c92d
worker-1       NotReady,SchedulingDisabled   mcp-2,worker           5d8h   v1.27.15+6147456

Verifying the health of the newly updated cluster

Run the following commands after updating the cluster to verify that the cluster is back up and running.

Procedure

Check the cluster version by running the following command:

$ oc get clusterversion

Example output

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.14   True        False         4h38m   Cluster version is 4.16.14

This should return the new cluster version and the PROGRESSING column should return False.

Check that all nodes are ready:

$ oc get nodes

Example output

NAME           STATUS   ROLES                  AGE    VERSION
ctrl-plane-0   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
ctrl-plane-1   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
ctrl-plane-2   Ready    control-plane,master   5d9h   v1.29.8+f10c92d
worker-0       Ready    mcp-1,worker           5d9h   v1.29.8+f10c92d
worker-1       Ready    mcp-2,worker           5d9h   v1.29.8+f10c92d

All nodes in the cluster should be in a Ready status and running the same version.

Check that there are no paused mcp resources in the cluster:

$ oc get mcp -o json | jq -r '["MCP","Paused"], ["---","------"], (.items[] | [(.metadata.name), (.spec.paused)]) | @tsv' | grep -v worker

Example output

MCP     Paused
---     ------
master  false
mcp-1   false
mcp-2   false

Check that all cluster Operators are available:

$ oc get co

Example output

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.16.14   True        False         False      5d9h
baremetal                                  4.16.14   True        False         False      5d9h
cloud-controller-manager                   4.16.14   True        False         False      5d10h
cloud-credential                           4.16.14   True        False         False      5d10h
cluster-autoscaler                         4.16.14   True        False         False      5d9h
config-operator                            4.16.14   True        False         False      5d9h
console                                    4.16.14   True        False         False      5d9h
control-plane-machine-set                  4.16.14   True        False         False      5d9h
csi-snapshot-controller                    4.16.14   True        False         False      5d9h
dns                                        4.16.14   True        False         False      5d9h
etcd                                       4.16.14   True        False         False      5d9h
image-registry                             4.16.14   True        False         False      85m
ingress                                    4.16.14   True        False         False      5d9h
insights                                   4.16.14   True        False         False      5d9h
kube-apiserver                             4.16.14   True        False         False      5d9h
kube-controller-manager                    4.16.14   True        False         False      5d9h
kube-scheduler                             4.16.14   True        False         False      5d9h
kube-storage-version-migrator              4.16.14   True        False         False      4h48m
machine-api                                4.16.14   True        False         False      5d9h
machine-approver                           4.16.14   True        False         False      5d9h
machine-config                             4.16.14   True        False         False      5d9h
marketplace                                4.16.14   True        False         False      5d9h
monitoring                                 4.16.14   True        False         False      5d9h
network                                    4.16.14   True        False         False      5d9h
node-tuning                                4.16.14   True        False         False      5d7h
openshift-apiserver                        4.16.14   True        False         False      5d9h
openshift-controller-manager               4.16.14   True        False         False      5d9h
openshift-samples                          4.16.14   True        False         False      5h24m
operator-lifecycle-manager                 4.16.14   True        False         False      5d9h
operator-lifecycle-manager-catalog         4.16.14   True        False         False      5d9h
operator-lifecycle-manager-packageserver   4.16.14   True        False         False      5d9h
service-ca                                 4.16.14   True        False         False      5d9h
storage                                    4.16.14   True        False         False      5d9h

All cluster Operators should report True in the AVAILABLE column.

Check that all pods are healthy:
```
$ oc get po -A | grep -E -iv 'complete|running'
```
This should not return any pods.

You might see a few pods still moving after the update. Watch this for a while to make sure all pods are cleared.

Completing the y-stream cluster update

Acknowledging the Control Plane Only or y-stream update

Starting the cluster update

Monitoring the cluster update

Updating the OLM Operators

Updating the worker nodes

Verifying the health of the newly updated cluster