apiVersion: lca.openshift.io/v1
kind: ImageBasedUpgrade
metadata:
name: upgrade
spec:
stage: Idle
You can upgrade your managed single-node OpenShift cluster with the image-based upgrade through GitOps Zero Touch Provisioning (ZTP).
When you deploy the Lifecycle Agent on a cluster, an ImageBasedUpgrade
CR is automatically created.
You update this CR to specify the image repository of the seed image and to move through the different stages.
When you deploy the Lifecycle Agent on a cluster, an ImageBasedUpgrade
CR is automatically created. You update this CR to specify the image repository of the seed image and to move through the different stages.
apiVersion: lca.openshift.io/v1
kind: ImageBasedUpgrade
metadata:
name: upgrade
spec:
stage: Idle
Create policies and ConfigMap
objects for resources used in the image-based upgrade. For more information, see "Creating ConfigMap objects for the image-based upgrade with GitOps ZTP.
Add policies for the Prep
, Upgrade
, and Idle
stages to your existing group PolicyGenTemplate
called ibu-upgrade-ranGen.yaml
:
apiVersion: ran.openshift.io/v1
kind: PolicyGenTemplate
metadata:
name: example-group-ibu
namespace: "ztp-group"
spec:
bindingRules:
group-du-sno: ""
mcp: "master"
evaluationInterval: (1)
compliant: 10s
noncompliant: 10s
sourceFiles:
- fileName: ConfigMapGeneric.yaml
complianceType: mustonlyhave
policyName: "oadp-cm-policy"
metadata:
name: oadp-cm
namespace: openshift-adp
- fileName: ibu/ImageBasedUpgrade.yaml
policyName: "prep-stage-policy"
spec:
stage: Prep
seedImageRef: (2)
version: "4.15.0"
image: "quay.io/user/lca-seed:4.15.0"
pullSecretRef:
name: "<seed_pull_secret>"
oadpContent: (3)
- name: "oadp-cm"
namespace: "openshift-adp"
status:
conditions:
- reason: Completed
status: "True"
type: PrepCompleted
message: "Prep stage completed successfully"
- fileName: ibu/ImageBasedUpgrade.yaml
policyName: "upgrade-stage-policy"
spec:
stage: Upgrade
status:
conditions:
- reason: Completed
status: "True"
type: UpgradeCompleted
- fileName: ibu/ImageBasedUpgrade.yaml
policyName: "finalize-stage-policy"
complianceType: mustonlyhave
spec:
stage: Idle
- fileName: ibu/ImageBasedUpgrade.yaml
policyName: "finalize-stage-policy"
status:
conditions:
- reason: Idle
status: "True"
type: Idle
1 | The policy evaluation interval for compliant and non-compliant policies. Set them to 10s to ensure that the policies status accurately reflects the current upgrade status. |
2 | Define the seed image, OKD version, and pull secret for the upgrade in the Prep stage. |
3 | Define the OADP ConfigMap resources required for backup and restore. |
Verify that the policies required for an image-based upgrade are created by running the following command:
$ oc get policies -n spoke1 | grep -E "example-group-ibu"
ztp-group.example-group-ibu-oadp-cm-policy inform NonCompliant 31h
ztp-group.example-group-ibu-prep-stage-policy inform NonCompliant 31h
ztp-group.example-group-ibu-upgrade-stage-policy inform NonCompliant 31h
ztp-group.example-group-ibu-finalize-stage-policy inform NonCompliant 31h
ztp-group.example-group-ibu-rollback-stage-policy inform NonCompliant 31h
Update the du-profile
cluster label to the target platform version or the corresponding policy-binding label in the SiteConfig
CR.
apiVersion: ran.openshift.io/v1
kind: SiteConfig
[...]
spec:
[...]
clusterLabels:
du-profile: "4.15.0"
Updating the labels to the target platform version unbinds the existing set of policies. |
Commit and push the updated SiteConfig
CR to the Git repository.
When you are ready to move to the Prep
stage, create the ClusterGroupUpgrade
CR on the target hub cluster with the Prep
and OADP ConfigMap
policies:
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-ibu-prep
namespace: default
spec:
clusters:
- spoke1
enable: true
managedPolicies:
- example-group-ibu-oadp-cm-policy
- example-group-ibu-prep-stage-policy
remediationStrategy:
canaries:
- spoke1
maxConcurrency: 1
timeout: 240
Apply the Prep
policy by running the following command:
$ oc apply -f cgu-ibu-prep.yml
If you provide ConfigMap
objects for OADP resources and extra manifests, Lifecycle Agent validates the specified ConfigMap
objects during the Prep
stage.
You might encounter the following issues:
Validation warnings or errors if the Lifecycle Agent detects any issues with extraManifests
Validation errors if the Lifecycle Agent detects any issues with oadpContent
Validation warnings do not block the Upgrade
stage but you must decide if it is safe to proceed with the upgrade.
These warnings, for example missing CRDs, namespaces or dry run failures, update the status.conditions
in the Prep
stage and annotation
fields in the ImageBasedUpgrade
CR with details about the warning.
[...]
metadata:
annotations:
extra-manifest.lca.openshift.io/validation-warning: '...'
[...]
However, validation errors, such as adding MachineConfig
or Operator manifests to extra manifests, cause the Prep
stage to fail and block the Upgrade
stage.
When the validations pass, the cluster creates a new ostree
stateroot, which involves pulling and unpacking the seed image, and running host level commands.
Finally, all the required images are precached on the target cluster.
Monitor the status and wait for the cgu-ibu-prep
ClusterGroupUpgrade
to report Completed
by running the following command:
$ oc get cgu -n default
NAME AGE STATE DETAILS
cgu-ibu-prep 31h Completed All clusters are compliant with all the managed policies
After you completed the Prep
stage, you can upgrade the target cluster. During the upgrade process, the OADP Operator creates a backup of the artifacts specified in the OADP CRs, then the Lifecycle Agent upgrades the cluster.
If the upgrade fails or stops, an automatic rollback is initiated. If you have an issue after the upgrade, you can initiate a manual rollback. For more information about manual rollback, see "(Optional) Initiating a rollback with Lifecycle Agent and GitOps ZTP".
Complete the Prep
stage.
When you are ready to move to the Upgrade
stage, create the ClusterGroupUpgrade
CR on the target hub cluster that references the Upgrade
policy:
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-ibu-upgrade
namespace: default
spec:
actions:
beforeEnable:
addClusterAnnotations:
import.open-cluster-management.io/disable-auto-import: "true" (1)
afterCompletion:
removeClusterAnnotations:
- import.open-cluster-management.io/disable-auto-import (2)
clusters:
- spoke1
enable: true
managedPolicies:
- example-group-ibu-upgrade-stage-policy
remediationStrategy:
canaries:
- spoke1
maxConcurrency: 1
timeout: 240
1 | Applies the disable-auto-import annotation to the managed cluster before starting the upgrade. This annotation ensures the automatic importing of managed cluster is disabled during the upgrade stage until the cluster is ready. |
2 | Removes the disable-auto-import annotation after the upgrade is complete. |
Apply the Upgrade
policy by running the following command:
$ oc apply -f cgu-ibu-upgrade.yml
Monitor the status by running the following command and wait for the cgu-ibu-upgrade
ClusterGroupUpgrade
to report Completed
:
$ oc get cgu -n default
NAME AGE STATE DETAILS
cgu-ibu-prep 31h Completed All clusters are compliant with all the managed policies
cgu-ibu-upgrade 31h Completed All clusters are compliant with all the managed policies
When you are satisfied with the changes and ready to finalize the upgrade, create a ClusterGroupUpgrade
CR on target hub cluster that references the policy that finalizes the upgrade:
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-ibu-finalize
namespace: default
spec:
actions:
beforeEnable:
removeClusterAnnotations:
- import.open-cluster-management.io/disable-auto-import
clusters:
- spoke1
enable: true
managedPolicies:
- example-group-ibu-finalize-stage-policy
remediationStrategy:
canaries:
- spoke1
maxConcurrency: 1
timeout: 240
Ensure that no other |
Apply the policy by running the following command:
$ oc apply -f cgu-ibu-finalize.yaml
Monitor the status and wait for the cgu-ibu-finalize
ClusterGroupUpgrade
to report Completed
by running the following command:
$ oc get cgu -n default
NAME AGE STATE DETAILS
cgu-ibu-finalize 30h Completed All clusters are compliant with all the managed policies
cgu-ibu-prep 31h Completed All clusters are compliant with all the managed policies
cgu-ibu-upgrade 31h Completed All clusters are compliant with all the managed policies
You can remove the OADP Operator and its configuration files after a successful upgrade.
Change the complianceType
to mustnothave
for the OADP Operator namespace, Operator group, and subscription in the common-ranGen.yaml
file.
[...]
- fileName: OadpSubscriptionNS.yaml
policyName: "subscriptions-policy"
complianceType: mustnothave
- fileName: OadpSubscriptionOperGroup.yaml
policyName: "subscriptions-policy"
complianceType: mustnothave
- fileName: OadpSubscription.yaml
policyName: "subscriptions-policy"
complianceType: mustnothave
- fileName: OadpOperatorStatus.yaml
policyName: "subscriptions-policy"
complianceType: mustnothave
[...]
Change the complianceType
to mustnothave
for the OADP Operator namespace, Operator group, and subscription in the site PolicyGenTemplate
file.
- fileName: OadpSecret.yaml
policyName: "config-policy"
complianceType: mustnothave
- fileName: OadpBackupStorageLocationStatus.yaml
policyName: "config-policy"
complianceType: mustnothave
- fileName: DataProtectionApplication.yaml
policyName: "config-policy"
complianceType: mustnothave
Merge the changes with your custom site repository and wait for the ArgoCD application to synchronize the change to the hub cluster. The status of the common-subscriptions-policy
and the example-cnf-config-policy
policies change to Non-Compliant
.
Apply the change to your target clusters by using the Topology Aware Lifecycle Manager. For more information about rolling out configuration changes, see "Update policies on managed clusters".
Monitor the process. When the status of the common-subscriptions-policy
and the example-cnf-config-policy
policies for a target cluster are Compliant
, the OADP Operator has been removed from the cluster. Get the status of the policies by running the following commands:
$ oc get policy -n ztp-common common-subscriptions-policy
$ oc get policy -n ztp-site example-cnf-config-policy
Delete the OADP Operator namespace, Operator group and subscription, and configuration CRs from spec.sourceFiles
in the common-ranGen.yaml
and the site PolicyGenTemplate
files.
Merge the changes with your custom site repository and wait for the ArgoCD application to synchronize the change to the hub cluster. The policy remains compliant.
If you encounter an issue after upgrade, you can start a manual rollback.
Ensure that the control plane certificates on the original stateroot are valid. If the certificates expired, see "Recovering from expired control plane certificates".
Revert the du-profile
or the corresponding policy-binding label to the original platform version in the SiteConfig
CR:
apiVersion: ran.openshift.io/v1
kind: SiteConfig
[...]
spec:
[...]
clusterLabels:
du-profile: "4.14.x"
When you are ready to initiate the rollback, add the Rollback
policy to your existing group PolicyGenTemplate
CR:
[...]
- fileName: ibu/ImageBasedUpgrade.yaml
policyName: "rollback-stage-policy"
spec:
stage: Rollback
status:
conditions:
- message: Rollback completed
reason: Completed
status: "True"
type: RollbackCompleted
Create a ClusterGroupUpgrade
CR on target hub cluster that references the Rollback
policy:
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-ibu-rollback
namespace: default
spec:
actions:
beforeEnable:
removeClusterAnnotations:
- import.open-cluster-management.io/disable-auto-import
clusters:
- spoke1
enable: true
managedPolicies:
- example-group-ibu-rollback-stage-policy
remediationStrategy:
canaries:
- spoke1
maxConcurrency: 1
timeout: 240
Apply the Rollback
policy by running the following command:
$ oc apply -f cgu-ibu-rollback.yml
When you are satisfied with the changes and you are ready to finalize the rollback, create a ClusterGroupUpgrade
CR on target hub cluster that references the policy that finalizes the rollback:
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-ibu-finalize
namespace: default
spec:
actions:
beforeEnable:
removeClusterAnnotations:
- import.open-cluster-management.io/disable-auto-import
clusters:
- spoke1
enable: true
managedPolicies:
- example-group-ibu-finalize-stage-policy
remediationStrategy:
canaries:
- spoke1
maxConcurrency: 1
timeout: 240
Apply the policy by running the following command:
$ oc apply -f cgu-ibu-finalize.yml
You can encounter issues during the image-based upgrade.
You can use the oc adm must-gather
CLI to collect information for debugging and troubleshooting.
Collect data about the Operators by running the following command:
$ oc adm must-gather \
--dest-dir=must-gather/tmp \
--image=$(oc -n openshift-lifecycle-agent get deployment.apps/lifecycle-agent-controller-manager -o jsonpath='{.spec.template.spec.containers[?(@.name == "manager")].image}') \
--image=quay.io/konveyor/oadp-must-gather:latest \(1)
--image=quay.io/openshift/origin-must-gather:latest (2)
1 | (Optional) You can add this options if you need to gather more information from the OADP Operator. |
2 | (Optional) You can add this options if you need to gather more information from the SR-IOV Operator. |
During the finalize stage or when you stop the process at the Prep
stage, Lifecycle Agent cleans up the following resources:
Stateroot that is no longer required
Precaching resources
OADP CRs
ImageBasedUpgrade
CR
If the Lifecycle Agent fails to perform the above steps, it transitions to the AbortFailed
or FinalizeFailed
states.
The condition message and log show which steps failed.
message: failed to delete all the backup CRs. Perform cleanup manually then add 'lca.openshift.io/manual-cleanup-done' annotation to ibu CR to transition back to Idle
observedGeneration: 5
reason: AbortFailed
status: "False"
type: Idle
Inspect the logs to determine why the failure occurred.
To prompt Lifecycle Agent to retry the cleanup, add the lca.openshift.io/manual-cleanup-done
annotation to the ImageBasedUpgrade
CR.
After observing this annotation, Lifecycle Agent retries the cleanup and, if it is successful, the ImageBasedUpgrade
stage transitions to Idle
.
If the cleanup fails again, you can manually clean up the resources.
Stopping at the Prep
stage, Lifecycle Agent cleans up the new stateroot. When finalizing after a successful upgrade or a rollback, Lifecycle Agent cleans up the old stateroot.
If this step fails, it is recommended that you inspect the logs to determine why the failure occurred.
Check if there are any existing deployments in the stateroot by running the following command:
$ ostree admin status
If there are any, clean up the existing deployment by running the following command:
$ ostree admin undeploy <index_of_deployment>
After cleaning up all the deployments of the stateroot, wipe the stateroot directory by running the following commands:
Ensure that the booted deployment is not in this stateroot. |
$ stateroot="<stateroot_to_delete>"
$ unshare -m /bin/sh -c "mount -o remount,rw /sysroot && rm -rf /sysroot/ostree/deploy/${stateroot}"
Automatic cleanup of OADP resources can fail due to connection issues between Lifecycle Agent and the S3 backend. By restoring the connection and adding the lca.openshift.io/manual-cleanup-done
annotation, the Lifecycle Agent can successfully cleanup backup resources.
Check the backend connectivity by running the following command:
$ oc get backupstoragelocations.velero.io -n openshift-adp
NAME PHASE LAST VALIDATED AGE DEFAULT
dataprotectionapplication-1 Available 33s 8d true
Remove all backup resources and then add the lca.openshift.io/manual-cleanup-done
annotation to the ImageBasedUpgrade
CR.
When LVM Storage is used to provide dynamic persistent volume storage, LVM Storage might not restore the persistent volume contents if it is configured incorrectly.
Your Backup
CRs might be missing fields that are needed to restore your persistent volumes.
You can check for events in your application pod to determine if you have this issue by running the following:
$ oc describe pod <your_app_name>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 58s (x2 over 66s) default-scheduler 0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Normal Scheduled 56s default-scheduler Successfully assigned default/db-1234 to sno1.example.lab
Warning FailedMount 24s (x7 over 55s) kubelet MountVolume.SetUp failed for volume "pvc-1234" : rpc error: code = Unknown desc = VolumeID is not found
You must include logicalvolumes.topolvm.io
in the application Backup
CR.
Without this resource, the application restores its persistent volume claims and persistent volume manifests correctly, however, the logicalvolume
associated with this persistent volume is not restored properly after pivot.
apiVersion: velero.io/v1
kind: Backup
metadata:
labels:
velero.io/storage-location: default
name: small-app
namespace: openshift-adp
spec:
includedNamespaces:
- test
includedNamespaceScopedResources:
- secrets
- persistentvolumeclaims
- deployments
- statefulsets
includedClusterScopedResources: (1)
- persistentVolumes
- volumesnapshotcontents
- logicalvolumes.topolvm.io
1 | To restore the persistent volumes for your application, you must configure this section as shown. |
The expected resources for the applications are restored but the persistent volume contents are not preserved after upgrading.
List the persistent volumes for you applications by running the following command before pivot:
$ oc get pv,pvc,logicalvolumes.topolvm.io -A
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/pvc-1234 1Gi RWO Retain Bound default/pvc-db lvms-vg1 4h45m
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
default persistentvolumeclaim/pvc-db Bound pvc-1234 1Gi RWO lvms-vg1 4h45m
NAMESPACE NAME AGE
logicalvolume.topolvm.io/pvc-1234 4h45m
List the persistent volumes for you applications by running the following command after pivot:
$ oc get pv,pvc,logicalvolumes.topolvm.io -A
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/pvc-1234 1Gi RWO Delete Bound default/pvc-db lvms-vg1 19s
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
default persistentvolumeclaim/pvc-db Bound pvc-1234 1Gi RWO lvms-vg1 19s
NAMESPACE NAME AGE
logicalvolume.topolvm.io/pvc-1234 18s
The reason for this issue is that the logicalvolume
status is not preserved in the Restore
CR.
This status is important because it is required for Velero to reference the volumes that must be preserved after pivoting.
You must include the following fields in the application Restore
CR:
apiVersion: velero.io/v1
kind: Restore
metadata:
name: sample-vote-app
namespace: openshift-adp
labels:
velero.io/storage-location: default
annotations:
lca.openshift.io/apply-wave: "3"
spec:
backupName:
sample-vote-app
restorePVs: true (1)
restoreStatus: (2)
includedResources:
- logicalvolumes
1 | To preserve the persistent volumes for your application, you must set restorePVs to true . |
2 | To preserve the persistent volumes for your application, you must configure this section as shown. |
The backup or restoration of artifacts failed.
You can debug Backup
and Restore
CRs and retrieve logs with the Velero CLI tool.
The Velero CLI tool provides more detailed information than the OpenShift CLI tool.
Describe the Backup
CR that contains errors by running the following command:
$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero describe backup -n openshift-adp backup-acm-klusterlet --details
Describe the Restore
CR that contains errors by running the following command:
$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero describe restore -n openshift-adp restore-acm-klusterlet --details
Download the backed up resources to a local directory by running the following command:
$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero backup download -n openshift-adp backup-acm-klusterlet -o ~/backup-acm-klusterlet.tar.gz