Performing an image-based upgrade for single-node OpenShift clusters with the Lifecycle Agent

Moving to the Prep stage of the image-based upgrade with Lifecycle Agent
Moving to the Upgrade stage of the image-based upgrade with Lifecycle Agent
Moving to the Rollback stage of the image-based upgrade with Lifecycle Agent
Troubleshooting image-based upgrades with Lifecycle Agent

You can use the Lifecycle Agent to do a manual image-based upgrade of a single-node OpenShift cluster.

When you deploy the Lifecycle Agent on a cluster, an ImageBasedUpgrade CR is automatically created. You update this CR to specify the image repository of the seed image and to move through the different stages.

Moving to the Prep stage of the image-based upgrade with Lifecycle Agent

When you deploy the Lifecycle Agent on a cluster, the Lifecycle Agent automatically creates an ImageBasedUpgrade custom resource (CR).

After you create all the resources that you need during the upgrade, you can move on to the Prep stage. For more information, see the "Creating ConfigMap objects for the image-based upgrade with Lifecycle Agent" section.

In a disconnected environment, if the seed cluster’s release image registry is different from the target cluster’s release image registry, you must create an ImageDigestMirrorSet (IDMS) resource to configure alternative mirrored repository locations. For more information, see "Configuring image registry repository mirroring".

You can retrieve the release registry used in the seed image by running the following command:

$ skopeo inspect docker://<imagename> | jq -r '.Labels."com.openshift.lifecycle-agent.seed_cluster_info" | fromjson | .release_registry'

Prerequisites

You have created resources to back up and restore your clusters.

Procedure

Check that you have patched your ImageBasedUpgrade CR:
```
apiVersion: lca.openshift.io/v1
kind: ImageBasedUpgrade
metadata:
  name: upgrade
spec:
  stage: Idle
  seedImageRef:
    version: <target_version>
    image: <seed_container_image>
    pullSecretRef: <seed_pull_secret>
  autoRollbackOnFailure: {}
    initMonitorTimeoutSeconds: <initMonitorTimeoutSeconds>
  extraManifests:
  - name: example-extra-manifests-cm
    namespace: openshift-lifecycle-agent
  - name: example-catalogsources-cm
    namespace: openshift-lifecycle-agent
  oadpContent:
  - name: oadp-cm-example
    namespace: openshift-adp
```
where:

<target_version>

Target platform version. The value must match the version of the seed image.

<seed_container_image>

Repository where the target cluster can pull the seed image from.

<seed_pull_secret>

Reference to a secret with credentials to pull container images if the images are in a private registry.

<initMonitorTimeoutSeconds>

Optional: Time frame in seconds to roll back if the upgrade does not complete within that time frame after the first reboot. If not defined or set to 0, the Lifecycle Agent uses the default value of 1800 seconds (30 minutes).

extraManifests

Optional: List of ConfigMap resources. These resources contain your custom catalog sources to retain after the upgrade and any extra manifests that the seed image does not include.

oadpContent

List of ConfigMap resources that contain the OADP Backup and Restore CRs.
To start the Prep stage, change the value of the stage field to Prep in the ImageBasedUpgrade CR by running the following command:
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Prep"}}' --type=merge -n openshift-lifecycle-agent
If you include ConfigMap objects for OADP resources and extra manifests, Lifecycle Agent validates the specified ConfigMap objects during the Prep stage.

The following issues might occur:
- Validation warnings or errors if the Lifecycle Agent detects any issues with the extraManifests parameters.
- Validation errors if the Lifecycle Agent detects any issues with the oadpContent parameters.
Validation warnings do not block the Upgrade stage but you must decide if it is safe to proceed with the upgrade. These warnings, for example missing custom resource definitions (CRDs), namespaces, or dry run failures, update the status.conditions for the Prep stage and annotation fields in the ImageBasedUpgrade CR with details about the warning. The following example shows these details:
# ... metadata: annotations: extra-manifest.lca.openshift.io/validation-warning: '...' # ...
However, validation errors, such as adding MachineConfig or Operator manifests to extra manifests, cause the Prep stage to fail and block the Upgrade stage.

When the validations pass, the cluster creates a new ostree stateroot, which involves pulling and unpacking the seed image, and running host-level commands. Finally, the Lifecycle Agent precaches all the required images on the target cluster.

Verification

Check the status of the ImageBasedUpgrade CR by running the following command:

$ oc get ibu -o yaml

The following example shows a successful Prep stage:

  conditions:
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: In progress
    observedGeneration: 13
    reason: InProgress
    status: "False"
    type: Idle
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: Prep completed
    observedGeneration: 13
    reason: Completed
    status: "False"
    type: PrepInProgress
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: Prep stage completed successfully
    observedGeneration: 13
    reason: Completed
    status: "True"
    type: PrepCompleted
  observedGeneration: 13
  validNextStages:
  - Idle
  - Upgrade

Additional resources

Moving to the Upgrade stage of the image-based upgrade with Lifecycle Agent

After you generate the seed image and complete the Prep stage, you can upgrade the target cluster. During the upgrade process, the OADP Operator creates a backup of the artifacts specified in the OADP custom resources (CRs), then the Lifecycle Agent upgrades the cluster.

If the upgrade fails or stops, the Lifecycle Agent initiates an automatic rollback. If you have an issue after the upgrade, you can perform a manual rollback. For more information about manual rollback, see "Moving to the Rollback stage of the image-based upgrade with Lifecycle Agent".

Prerequisites

You have completed the Prep stage.

Procedure

To move to the Upgrade stage, change the value of the stage field to Upgrade in the ImageBasedUpgrade CR by running the following command:
```
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Upgrade"}}' --type=merge
```

Check the status of the ImageBasedUpgrade CR by running the following command:

$ oc get ibu -o yaml

The following example shows an upgrade in progress:

status:
  conditions:
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: In progress
    observedGeneration: 5
    reason: InProgress
    status: "False"
    type: Idle
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: Prep completed
    observedGeneration: 5
    reason: Completed
    status: "False"
    type: PrepInProgress
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: Prep completed successfully
    observedGeneration: 5
    reason: Completed
    status: "True"
    type: PrepCompleted
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: |-
      Waiting for system to stabilize: one or more health checks failed
        - one or more ClusterOperators not yet ready: authentication
        - one or more MachineConfigPools not yet ready: master
        - one or more ClusterServiceVersions not yet ready: sriov-fec.v2.8.0
    observedGeneration: 1
    reason: InProgress
    status: "True"
    type: UpgradeInProgress
  observedGeneration: 1
  rollbackAvailabilityExpiration: "2024-05-19T14:01:52Z"
  validNextStages:
  - Rollback

The OADP Operator creates a backup of the data specified in the OADP Backup and Restore CRs and the target cluster reboots.

Monitor the status of the CR by running the following command:
```
$ oc get ibu -o yaml
```
After the upgrade, complete the changes by patching the value of the stage field to Idle in the ImageBasedUpgrade CR by running the following command:
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Idle"}}' --type=merge
You cannot roll back the changes once you move to the Idle stage after an upgrade.

The Lifecycle Agent deletes all resources created during the upgrade process.
You can remove the OADP Operator and its configuration files after a successful upgrade. For more information, see "Deleting Operators from a cluster".

Verification

Check the status of the ImageBasedUpgrade CR by running the following command:

$ oc get ibu -o yaml

The following example shows a completed upgrade:

status:
  conditions:
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: In progress
    observedGeneration: 5
    reason: InProgress
    status: "False"
    type: Idle
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: Prep completed
    observedGeneration: 5
    reason: Completed
    status: "False"
    type: PrepInProgress
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: Prep completed successfully
    observedGeneration: 5
    reason: Completed
    status: "True"
    type: PrepCompleted
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: Upgrade completed
    observedGeneration: 1
    reason: Completed
    status: "False"
    type: UpgradeInProgress
  - lastTransitionTime: "2024-01-01T09:00:00Z"
    message: Upgrade completed
    observedGeneration: 1
    reason: Completed
    status: "True"
    type: UpgradeCompleted
  observedGeneration: 1
  rollbackAvailabilityExpiration: "2024-01-01T09:00:00Z"
  validNextStages:
  - Idle
  - Rollback

Check the status of the cluster restoration by running the following command:

$ oc get restores -n openshift-adp -o custom-columns=NAME:.metadata.name,Status:.status.phase,Reason:.status.failureReason

The following example shows completed restores:

NAME             Status      Reason
acm-klusterlet   Completed   <none>
apache-app       Completed   <none>
localvolume      Completed   <none>

The acm-klusterlet is specific to RHACM environments only.

Additional resources

Moving to the Rollback stage of the image-based upgrade with Lifecycle Agent

The Lifecycle Agent initiates an automatic rollback if the upgrade does not complete within the time frame specified in the initMonitorTimeoutSeconds field after rebooting, as shown in the following example:

apiVersion: lca.openshift.io/v1
kind: ImageBasedUpgrade
metadata:
  name: upgrade
spec:
  stage: Idle
  seedImageRef:
    version: 4.15.2
    image: <seed_container_image>
  autoRollbackOnFailure: {}
#    initMonitorTimeoutSeconds: <initMonitorTimeoutSeconds>
# ...

where:

<initMonitorTimeoutSeconds>: Optional: The time frame in seconds to roll back if the upgrade does not complete within that time frame after the first reboot. If not defined or set to 0, the Lifecycle Agent uses the default value of 1800 seconds (30 minutes).

You can manually roll back the changes if you see unresolvable issues after an upgrade.

Prerequisites

You have logged into the hub cluster as a user with cluster-admin privileges.
You ensured that the control plane certificates on the original stateroot are valid. If the certificates expired, see "Recovering from expired control plane certificates".

If you choose to upgrade a recently installed single-node OpenShift cluster for example, for testing purposes, you have a limited rollback time frame of 24 hours or less. You can verify the rollback time by checking the rollbackAvailabilityExpiration field of the ImageBasedUpgrade custom resource.

Procedure

To move to the rollback stage, patch the value of the stage field to Rollback in the ImageBasedUpgrade CR by running the following command:
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Rollback"}}' --type=merge
The Lifecycle Agent reboots the cluster with the previously installed version of OKD and restores the applications.
After reviewing the changes, complete the rollback by patching the value of the stage field to Idle in the ImageBasedUpgrade CR by running the following command:
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Idle"}}' --type=merge -n openshift-lifecycle-agent
If you move to the Idle stage after a rollback, the Lifecycle Agent cleans up resources that you can use to troubleshoot a failed upgrade.

Additional resources

Recovering from expired control plane certificates

Troubleshooting image-based upgrades with Lifecycle Agent

Perform troubleshooting steps on the managed clusters to resolve any issues.

If you are using the ImageBasedGroupUpgrade CR to upgrade your clusters, ensure that you update the lcm.openshift.io/ibgu-<stage>-completed or lcm.openshift.io/ibgu-<stage>-failed cluster labels properly after performing troubleshooting or recovery steps on the managed clusters. This ensures that the TALM continues to manage the image-based upgrade for the cluster.

Collecting logs

You can use the oc adm must-gather CLI to collect information for debugging and troubleshooting.

To collect data about the Operators, run the following command:

$  oc adm must-gather \
  --dest-dir=must-gather/tmp \
  --image=$(oc -n openshift-lifecycle-agent get deployment.apps/lifecycle-agent-controller-manager -o jsonpath='{.spec.template.spec.containers[?(@.name == "manager")].image}') \
  --image=<oadp_must_gather_image> \
  --image=<origin_must_gather_image>

where:

<oadp_must_gather_image>: Optional: Add this option, for example quay.io/konveyor/oadp-must-gather:latest, if you need to gather more information from the OADP Operator.
<origin_must_gather_image>: Optional: Add this option, for example quay.io/openshift/origin-must-gather:latest, if you need to gather more information from the SR-IOV Operator.

`AbortFailed` or `FinalizeFailed` error

Issue

During the finalization stage or when you stop the process at the Prep stage, Lifecycle Agent cleans up the following resources:

Stateroot that is no longer required
Precaching resources
OADP CRs
ImageBasedUpgrade CR

If the Lifecycle Agent fails to clean up these resources, it transitions to the AbortFailed or FinalizeFailed states. The condition message and log show the steps that failed, as shown in the following example:

message: failed to delete all the backup CRs. Perform cleanup manually then add 'lca.openshift.io/manual-cleanup-done' annotation to ibu CR to transition back to Idle
      observedGeneration: 5
      reason: AbortFailed
      status: "False"
      type: Idle

Resolution

Inspect the logs to find the reason for failure.
To prompt Lifecycle Agent to retry the cleanup, add the lca.openshift.io/manual-cleanup-done annotation to the ImageBasedUpgrade CR.

After observing this annotation, Lifecycle Agent retries the cleanup and, if it is successful, the ImageBasedUpgrade stage transitions to Idle.

If the cleanup fails again, you can manually clean up the resources.

Cleaning up stateroot manually

Issue

Stopping at the Prep stage, Lifecycle Agent cleans up the new stateroot. When finalizing after a successful upgrade or a rollback, Lifecycle Agent cleans up the old stateroot. If this step fails, you must inspect the logs to decide why the failure occurred.

Resolution

Check if there are any existing deployments in the stateroot by running the following command:
```
$ ostree admin status
```
If there are any, clean up the existing deployment by running the following command:
```
$ ostree admin undeploy <index_of_deployment>
```
After cleaning up all the deployments of the stateroot, wipe the stateroot directory by running the following commands:

Ensure that the booted deployment is not in this stateroot.
```
$ stateroot="<stateroot_to_delete>"
```
```
$ unshare -m /bin/sh -c "mount -o remount,rw /sysroot && rm -rf /sysroot/ostree/deploy/${stateroot}"
```

Cleaning up OADP resources manually

Issue: Automatic cleanup of OADP resources can fail due to connection issues between Lifecycle Agent and the S3 backend. By restoring the connection and adding the lca.openshift.io/manual-cleanup-done annotation, the Lifecycle Agent can successfully cleanup backup resources.
Resolution

Check the backend connectivity by running the following command:

$ oc get backupstoragelocations.velero.io -n openshift-adp

The following example shows successful backend connectivity:

NAME                          PHASE       LAST VALIDATED   AGE   DEFAULT
dataprotectionapplication-1   Available   33s              8d    true

Remove all backup resources and then add the lca.openshift.io/manual-cleanup-done annotation to the ImageBasedUpgrade CR.

LVM Storage volume contents not restored

When you use LVM Storage to configure dynamic persistent volume storage, LVM Storage might not restore the persistent volume contents if you have configured it incorrectly.

Missing LVM Storage-related fields in Backup CR

Issue

Your Backup CRs might be missing fields that you need to restore your persistent volumes. You can check for events in your application pod to decide if you have this issue by running the following:

$ oc describe pod <your_app_name>

The following example output shows a pod failing due to missing LVM Storage-related fields in the Backup CR:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  58s (x2 over 66s)  default-scheduler  0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Normal   Scheduled         56s                default-scheduler  Successfully assigned default/db-1234 to sno1.example.lab
  Warning  FailedMount       24s (x7 over 55s)  kubelet            MountVolume.SetUp failed for volume "pvc-1234" : rpc error: code = Unknown desc = VolumeID is not found

Resolution

You must include logicalvolumes.topolvm.io in the application Backup CR. Without this resource, the application restores its persistent volume claims and persistent volume manifests correctly, however, the logicalvolume associated with this persistent volume is not restored properly after pivot. The following example shows a correctly configured Backup CR:

apiVersion: velero.io/v1
kind: Backup
metadata:
  labels:
    velero.io/storage-location: default
  name: small-app
  namespace: openshift-adp
spec:
  includedNamespaces:
  - test
  includedNamespaceScopedResources:
  - secrets
  - persistentvolumeclaims
  - deployments
  - statefulsets
  includedClusterScopedResources:
  - persistentVolumes
  - volumesnapshotcontents
  - logicalvolumes.topolvm.io

To restore the persistent volumes for your application, you must configure the includedClusterScopedResources section as shown.

Missing LVM Storage-related fields in Restore CR

Issue

LVM Storage restores the expected resources for the applications but it does not preserve the persistent volume contents after upgrading.

List the persistent volumes for you applications by running the following command before pivot:

$ oc get pv,pvc,logicalvolumes.topolvm.io -A

The following shows the output before pivot:

NAME                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   REASON   AGE
persistentvolume/pvc-1234   1Gi        RWO            Retain           Bound    default/pvc-db   lvms-vg1                4h45m

NAMESPACE   NAME                           STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
default     persistentvolumeclaim/pvc-db   Bound    pvc-1234   1Gi        RWO            lvms-vg1       4h45m

NAMESPACE   NAME                                AGE
            logicalvolume.topolvm.io/pvc-1234   4h45m

List the persistent volumes for you applications by running the following command after pivot:

$ oc get pv,pvc,logicalvolumes.topolvm.io -A

The following shows the output after pivot:

NAME                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   REASON   AGE
persistentvolume/pvc-1234   1Gi        RWO            Delete           Bound    default/pvc-db   lvms-vg1                19s

NAMESPACE   NAME                           STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
default     persistentvolumeclaim/pvc-db   Bound    pvc-1234   1Gi        RWO            lvms-vg1       19s

NAMESPACE   NAME                                AGE
            logicalvolume.topolvm.io/pvc-1234   18s

Resolution

The reason for this issue is that the logicalvolume status is not preserved in the Restore CR. This status is important because Velero requires this status to reference the volumes that you must preserve after pivoting. You must include the following fields in the application Restore CR, as shown in the following example:

apiVersion: velero.io/v1
kind: Restore
metadata:
  name: sample-vote-app
  namespace: openshift-adp
  labels:
    velero.io/storage-location: default
  annotations:
    lca.openshift.io/apply-wave: "3"
spec:
  backupName:
    sample-vote-app
  restorePVs: <restore_pvs>
  restoreStatus:
    includedResources:
      - logicalvolumes

where:

<restore_pvs>

To preserve the persistent volumes for your application, you must set restorePVs to true.

restoreStatus

To preserve the persistent volumes for your application, you must configure this field as shown.

Debugging failed Backup and Restore CRs

Issue

The backup or restoration of artifacts failed.

Resolution

You can debug Backup and Restore CRs and retrieve logs with the Velero CLI tool. The Velero CLI tool offers more detailed information than the OpenShift CLI tool.

Describe the Backup CR that has errors by running the following command:

$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero describe backup -n openshift-adp backup-acm-klusterlet --details

Describe the Restore CR that has errors by running the following command:

$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero describe restore -n openshift-adp restore-acm-klusterlet --details

Download the backed up resources to a local directory by running the following command:

$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero backup download -n openshift-adp backup-acm-klusterlet -o ~/backup-acm-klusterlet.tar.gz

Moving to the Prep stage of the image-based upgrade with Lifecycle Agent

Moving to the Upgrade stage of the image-based upgrade with Lifecycle Agent

Moving to the Rollback stage of the image-based upgrade with Lifecycle Agent

Troubleshooting image-based upgrades with Lifecycle Agent

Collecting logs

AbortFailed or FinalizeFailed error

Cleaning up stateroot manually

Cleaning up OADP resources manually

LVM Storage volume contents not restored

Missing LVM Storage-related fields in Backup CR

Missing LVM Storage-related fields in Restore CR

Debugging failed Backup and Restore CRs

`AbortFailed` or `FinalizeFailed` error