×

Prerequisites

  • Have access to the cluster as a user with admin privileges. See Using RBAC to define and apply permissions.

  • Have a recent etcd backup in case your upgrade fails and you must restore your cluster to a previous state.

  • Ensure all Operators previously installed through Operator Lifecycle Manager (OLM) are updated to their latest version in their latest channel. Updating the Operators ensures they have a valid upgrade path when the default OperatorHub catalogs switch from the current minor version to the next during a cluster upgrade. See Upgrading installed Operators for more information.

  • Ensure that all machine config pools (MCPs) are running and not paused. Nodes associated with a paused MCP are skipped during the update process. You can pause the MCPs if you are performing a canary rollout update strategy.

  • If your cluster uses manually maintained credentials, ensure that the Cloud Credential Operator (CCO) is in an upgradeable state. For more information, see Upgrading clusters with manually maintained credentials for AWS, Azure, or GCP.

  • If your cluster uses manually maintained credentials with the AWS Secure Token Service (STS), obtain a copy of the ccoctl utility from the release image being upgraded to and use it to process any updated credentials. For more information, see Upgrading an OpenShift Container Platform cluster configured for manual mode with STS.

  • Ensure that you address all Upgradeable=False conditions so the cluster allows an upgrade to the next minor version. You can run the oc adm upgrade command for an output of all Upgradeable=False conditions and the condition reasoning to help you prepare for a minor version upgrade.

Using the unsupportedConfigOverrides section to modify the configuration of an Operator is unsupported and might block cluster upgrades. You must remove this setting before you can upgrade your cluster.

If you are running cluster monitoring with an attached PVC for Prometheus, you might experience OOM kills during cluster upgrade. When persistent storage is in use for Prometheus, Prometheus memory usage doubles during cluster upgrade and for several hours after upgrade is complete. To avoid the OOM kill issue, allow worker nodes with double the size of memory that was available prior to the upgrade. For example, if you are running monitoring on the minimum recommended nodes, which is 2 cores with 8 GB of RAM, increase memory to 16 GB. For more information, see BZ#1925061.

About the OpenShift Update Service

The OpenShift Update Service (OSUS) provides over-the-air updates to OKD, including Fedora CoreOS (FCOS). It provides a graph, or diagram, that contains the vertices of component Operators and the edges that connect them. The edges in the graph show which versions you can safely update to. The vertices are update payloads that specify the intended state of the managed cluster components.

The Cluster Version Operator (CVO) in your cluster checks with the OpenShift Update Service to see the valid updates and update paths based on current component versions and information in the graph. When you request an update, the CVO uses the release image for that update to upgrade your cluster. The release artifacts are hosted in Quay as container images.

To allow the OpenShift Update Service to provide only compatible updates, a release verification pipeline drives automation. Each release artifact is verified for compatibility with supported cloud platforms and system architectures, as well as other component packages. After the pipeline confirms the suitability of a release, the OpenShift Update Service notifies you that it is available.

The OpenShift Update Service displays all recommended updates for your current cluster. If an upgrade path is not recommended by the OpenShift Update Service, it might be because of a known issue with the update or the target release.

Two controllers run during continuous update mode. The first controller continuously updates the payload manifests, applies the manifests to the cluster, and outputs the controlled rollout status of the Operators to indicate whether they are available, upgrading, or failed. The second controller polls the OpenShift Update Service to determine if updates are available.

Only upgrading to a newer version is supported. Reverting or rolling back your cluster to a previous version is not supported. If your upgrade fails, contact Red Hat support.

During the upgrade process, the Machine Config Operator (MCO) applies the new configuration to your cluster machines. The MCO cordons the number of nodes as specified by the maxUnavailable field on the machine configuration pool and marks them as unavailable. By default, this value is set to 1. The MCO then applies the new configuration and reboots the machine.

If you use Fedora machines as workers, the MCO does not update the kubelet because you must update the OpenShift API on the machines first.

With the specification for the new version applied to the old kubelet, the Fedora machine cannot return to the Ready state. You cannot complete the update until the machines are available. However, the maximum number of unavailable nodes is set to ensure that normal cluster operations can continue with that number of machines out of service.

The OpenShift Update Service is composed of an Operator and one or more application instances.

OKD upgrade channels and releases

In OKD 4.1, Red Hat introduced the concept of channels for recommending the appropriate release versions for cluster upgrades. By controlling the pace of upgrades, these upgrade channels allow you to choose an upgrade strategy. Upgrade channels are tied to a minor version of OKD. For instance, OKD 4.9 upgrade channels recommend upgrades to 4.9 and upgrades within 4.9. They also recommend upgrades within 4.8 and from 4.8 to 4.9, to allow clusters on 4.8 to eventually upgrade to 4.9. They do not recommend upgrades to 4.10 or later releases. This strategy ensures that administrators explicitly decide to upgrade to the next minor version of OKD.

Upgrade channels control only release selection and do not impact the version of the cluster that you install; the openshift-install binary file for a specific version of OKD always installs that version.

OKD 4 offers the following upgrade channel:

  • stable-4

stable-4 channel

Releases are added to the stable-4 channel after passing all tests.

You can use the stable-4 channel to upgrade from a previous minor version of OKD.

Upgrade version paths

OKD maintains an upgrade recommendation service that understands the version of OKD you have installed as well as the path to take within the channel you choose to get you to the next release.

You can imagine seeing the following in the stable-4 channel:

  • 4.0

  • 4.1

  • 4.3

  • 4.4

The service recommends only upgrades that have been tested and have no serious issues. It will not suggest updating to a version of OKD that contains known vulnerabilities. For example, if your cluster is on 4.1 and OKD suggests 4.4, then it is safe for you to update from 4.1 to 4.4. Do not rely on consecutive patch numbers. In this example, 4.2 is not and never was available in the channel.

The presence of an update recommendation in the stable-4 channel at any point is a declaration that the update is supported. While releases will never be removed from the channel, update recommendations that exhibit serious issues will be removed from the channel. Updates initiated after the update recommendation has been removed are still supported.

Restricted network clusters

If you manage the container images for your OKD clusters yourself, you must consult the Red Hat errata that is associated with product releases and note any comments that impact upgrades. During upgrade, the user interface might warn you about switching between these versions, so you must ensure that you selected an appropriate version before you bypass those warnings.

Pausing a MachineHealthCheck resource

During the upgrade process, nodes in the cluster might become temporarily unavailable. In the case of worker nodes, the machine health check might identify such nodes as unhealthy and reboot them. To avoid rebooting such nodes, pause all the MachineHealthCheck resources before updating the cluster.

Prerequisites
  • Install the OpenShift CLI (oc).

Procedure
  1. To list all the available MachineHealthCheck resources that you want to pause, run the following command:

    $ oc get machinehealthcheck -n openshift-machine-api
  2. To pause the machine health checks, add the cluster.x-k8s.io/paused="" annotation to the MachineHealthCheck resource. Run the following command:

    $ oc -n openshift-machine-api annotate mhc <mhc-name> cluster.x-k8s.io/paused=""

    The annotated MachineHealthCheck resource resembles the following YAML file:

    apiVersion: machine.openshift.io/v1beta1
    kind: MachineHealthCheck
    metadata:
      name: example
      namespace: openshift-machine-api
      annotations:
        cluster.x-k8s.io/paused: ""
    spec:
      selector:
        matchLabels:
          role: worker
      unhealthyConditions:
      - type:    "Ready"
        status:  "Unknown"
        timeout: "300s"
      - type:    "Ready"
        status:  "False"
        timeout: "300s"
      maxUnhealthy: "40%"
    status:
      currentHealthy: 5
      expectedMachines: 5

    Resume the machine health checks after updating the cluster. To resume the check, remove the pause annotation from the MachineHealthCheck resource by running the following command:

    $ oc -n openshift-machine-api annotate mhc <mhc-name> cluster.x-k8s.io/paused-

About updating single node OKD

You can update, or upgrade, a single-node OKD cluster by using either the console or CLI.

However, note the following limitations:

  • The prerequisite to pause the MachineHealthCheck resources is not required because there is no other node to perform the health check.

  • Restoring a single-node OKD cluster using an etcd backup is not officially supported. However, it is good practice to perform the etcd backup in case your upgrade fails. If your control plane is healthy, you might be able to restore your cluster to a previous state by using the backup.

  • Updating a single-node OKD cluster requires downtime and can include an automatic reboot. The amount of downtime depends on the update payload, as described in the following scenarios:

    • If the update payload contains an operating system update, which requires a reboot, the downtime is significant and impacts cluster management and user workloads.

    • If the update contains machine configuration changes that do not require a reboot, the downtime is less, and the impact on the cluster management and user workloads is lessened. In this case, the node draining step is skipped with single-node OKD because there is no other node in the cluster to reschedule the workloads to.

    • If the update payload does not contain an operating system update or machine configuration changes, a short API outage occurs and resolves quickly.

There are conditions, such as bugs in an updated package, that can cause the single node to not restart after a reboot. In this case, the update does not rollback automatically.

Additional resources

For information on which machine configuration changes require a reboot, see the note in Understanding the Machine Config Operator.

Updating a cluster by using the CLI

If updates are available, you can update your cluster by using the OpenShift CLI (oc).

You can find information about available OKD advisories and updates in the errata section of the Customer Portal.

Prerequisites
  • Install the OpenShift CLI (oc) that matches the version for your updated version.

  • Log in to the cluster as user with cluster-admin privileges.

  • Install the jq package.

  • Pause all MachineHealthCheck resources.

Procedure
  1. Ensure that your cluster is available:

    $ oc get clusterversion
    Example output
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.8.13     True        False         158m    Cluster version is 4.8.13
  2. Review the current update channel information and confirm that your channel is set to stable-4.9:

    $ oc get clusterversion -o json|jq ".items[0].spec"
    Example output
    {
      "channel": "stable-4.9",
      "clusterID": "990f7ab8-109b-4c95-8480-2bd1deec55ff"
    }

    For production clusters, you must subscribe to a stable-* or fast-* channel.

  3. View the available updates and note the version number of the update that you want to apply:

    $ oc adm upgrade
    Example output
    Cluster version is 4.8.13
    
    Updates:
    
    VERSION IMAGE
    4.9.0   quay.io/openshift-release-dev/ocp-release@sha256:9c5f0df8b192a0d7b46cd5f6a4da2289c155fd5302dec7954f8f06c878160b8b
  4. Apply an update:

    • To update to the latest version:

      $ oc adm upgrade --to-latest=true (1)
    • To update to a specific version:

      $ oc adm upgrade --to=<version> (1)
      1 <version> is the update version that you obtained from the output of the previous command.
  5. Review the status of the Cluster Version Operator:

    $ oc get clusterversion -o json|jq ".items[0].spec"
    Example output
    {
      "channel": "stable-4.9",
      "clusterID": "990f7ab8-109b-4c95-8480-2bd1deec55ff",
      "desiredUpdate": {
        "force": false,
        "image": "quay.io/openshift-release-dev/ocp-release@sha256:9c5f0df8b192a0d7b46cd5f6a4da2289c155fd5302dec7954f8f06c878160b8b",
        "version": "4.9.0" (1)
      }
    }
    1 If the version number in the desiredUpdate stanza matches the value that you specified, the update is in progress.
  6. Review the cluster version status history to monitor the status of the update. It might take some time for all the objects to finish updating.

    $ oc get clusterversion -o json|jq ".items[0].status.history"
    Example output
    [
      {
        "completionTime": null,
        "image": "quay.io/openshift-release-dev/ocp-release@sha256:b8fa13e09d869089fc5957c32b02b7d3792a0b6f36693432acc0409615ab23b7",
        "startedTime": "2021-01-28T20:30:50Z",
        "state": "Partial",
        "verified": true,
        "version": "4.9.0"
      },
      {
        "completionTime": "2021-01-28T20:30:50Z",
        "image": "quay.io/openshift-release-dev/ocp-release@sha256:b8fa13e09d869089fc5957c32b02b7d3792a0b6f36693432acc0409615ab23b7",
        "startedTime": "2021-01-28T17:38:10Z",
        "state": "Completed",
        "verified": false,
        "version": "4.8.13"
      }
    ]

    The history contains a list of the most recent versions applied to the cluster. This value is updated when the CVO applies an update. The list is ordered by date, where the newest update is first in the list. Updates in the history have state Completed if the rollout completed and Partial if the update failed or did not complete.

    If an upgrade fails, the Operator stops and reports the status of the failing component. Rolling your cluster back to a previous version is not supported. If your upgrade fails, contact Red Hat support.

  7. After the update completes, you can confirm that the cluster version has updated to the new version:

    $ oc get clusterversion
    Example output
    NAME      VERSION     AVAILABLE   PROGRESSING   SINCE     STATUS
    version   4.9.0       True        False         2m        Cluster version is 4.9.0
  8. If you are upgrading your cluster to the next minor version, like version 4.y to 4.(y+1), it is recommended to confirm your nodes are upgraded before deploying workloads that rely on a new feature:

    $ oc get nodes
    Example output
    NAME                           STATUS   ROLES    AGE   VERSION
    ip-10-0-168-251.ec2.internal   Ready    master   82m   v1.22.1
    ip-10-0-170-223.ec2.internal   Ready    master   82m   v1.22.1
    ip-10-0-179-95.ec2.internal    Ready    worker   70m   v1.22.1
    ip-10-0-182-134.ec2.internal   Ready    worker   70m   v1.22.1
    ip-10-0-211-16.ec2.internal    Ready    master   82m   v1.22.1
    ip-10-0-250-100.ec2.internal   Ready    worker   69m   v1.22.1

Changing the update server by using the CLI

Changing the update server is optional. If you have an OpenShift Update Service (OSUS) installed and configured locally, you must set the URL for the server as the upstream to use the local server during updates. The default value for upstream is https://api.openshift.com/api/upgrades_info/v1/graph.

Procedure
  • Change the upstream parameter value in the cluster version:

    $ oc patch clusterversion/version --patch '{"spec":{"upstream":"<update-server-url>"}}' --type=merge

    The <update-server-url> variable specifies the URL for the update server.

    Example output
    clusterversion.config.openshift.io/version patched