Disaster recovery for a hosted cluster by using OADP - High availability for hosted control planes | Hosted control planes

Prerequisites
Preparing AWS to use OADP
Preparing bare metal to use OADP
Backing up the data plane workload
Backing up the control plane workload
Restoring a hosted cluster by using OADP
Observing the backup and restore process
Using the velero CLI to describe the Backup and Restore resources

You can use the OpenShift API for Data Protection (OADP) Operator to perform disaster recovery on Amazon Web Services (AWS) and bare metal.

The disaster recovery process with OpenShift API for Data Protection (OADP) involves the following steps:

Preparing your platform, such as Amazon Web Services or bare metal, to use OADP
Backing up the data plane workload
Backing up the control plane workload
Restoring a hosted cluster by using OADP

Prerequisites

You must meet the following prerequisites on the management cluster:

You installed the OADP Operator.
You created a storage class.
You have access to the cluster with cluster-admin privileges.
You have access to the OADP subscription through a catalog source.
You have access to a cloud storage provider that is compatible with OADP, such as S3, Microsoft Azure, Google Cloud Platform, or MinIO.
In a disconnected environment, you have access to a self-hosted storage provider, for example Red Hat OpenShift Data Foundation or MinIO, that is compatible with OADP.
Your hosted control planes pods are up and running.

Preparing AWS to use OADP

To perform disaster recovery for a hosted cluster, you can use OpenShift API for Data Protection (OADP) on Amazon Web Services (AWS) S3 compatible storage. After creating the DataProtectionApplication object, new velero deployment and node-agent pods are created in the openshift-adp namespace.

To prepare AWS to use OADP, see "Configuring the OpenShift API for Data Protection with Multicloud Object Gateway".

Additional resources

Configuring the OpenShift API for Data Protection with Multicloud Object Gateway

Next steps

Backing up the data plane workload
Backing up the control plane workload

Preparing bare metal to use OADP

To perform disaster recovery for a hosted cluster, you can use OpenShift API for Data Protection (OADP) on bare metal. After creating the DataProtectionApplication object, new velero deployment and node-agent pods are created in the openshift-adp namespace.

To prepare bare metal to use OADP, see "Configuring the OpenShift API for Data Protection with AWS S3 compatible storage".

Additional resources

Configuring the OpenShift API for Data Protection with AWS S3 compatible storage

Next steps

Backing up the data plane workload
Backing up the control plane workload

Backing up the data plane workload

If the data plane workload is not important, you can skip this procedure. To back up the data plane workload by using the OADP Operator, see "Backing up applications".

Additional resources

Backing up applications

Next steps

Restoring a hosted cluster by using OADP

Backing up the control plane workload

You can back up the control plane workload by creating the Backup custom resource (CR).

To monitor and observe the backup process, see "Observing the backup and restore process".

Procedure

Pause the reconciliation of the HostedCluster resource by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  patch hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
  --type json -p '[{"op": "add", "path": "/spec/pausedUntil", "value": "true"}]'

Get the infrastructure ID of your hosted cluster by running the following command:
```
$ oc get hostedcluster -n local-cluster <hosted_cluster_name> -o=jsonpath="{.spec.infraID}"
```
Note the infrastructure ID to use in the next step.

Pause the reconciliation of the cluster.cluster.x-k8s.io resource by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  patch cluster.cluster.x-k8s.io \
  -n local-cluster-<hosted_cluster_name> <hosted_cluster_infra_id> \
  --type json -p '[{"op": "add", "path": "/spec/paused", "value": true}]'

Pause the reconciliation of the NodePool resource by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  patch nodepool -n <hosted_cluster_namespace> <node_pool_name> \
  --type json -p '[{"op": "add", "path": "/spec/pausedUntil", "value": "true"}]'

Pause the reconciliation of the AgentCluster resource by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  annotate agentcluster -n <hosted_control_plane_namespace>  \
  cluster.x-k8s.io/paused=true --all'

Pause the reconciliation of the AgentMachine resource by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  annotate agentmachine -n <hosted_control_plane_namespace>  \
  cluster.x-k8s.io/paused=true --all'

Annotate the HostedCluster resource to prevent the deletion of the hosted control plane namespace by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  annotate hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
  hypershift.openshift.io/skip-delete-hosted-controlplane-namespace=true

Create a YAML file that defines the Backup CR:

Example backup-control-plane.yaml file

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: <backup_resource_name> (1)
  namespace: openshift-adp
  labels:
    velero.io/storage-location: default
spec:
  hooks: {}
  includedNamespaces: (2)
  - <hosted_cluster_namespace> (3)
  - <hosted_control_plane_namespace> (4)
  includedResources:
  - sa
  - role
  - rolebinding
  - pod
  - pvc
  - pv
  - bmh
  - configmap
  - infraenv (5)
  - priorityclasses
  - pdb
  - agents
  - hostedcluster
  - nodepool
  - secrets
  - hostedcontrolplane
  - cluster
  - agentcluster
  - agentmachinetemplate
  - agentmachine
  - machinedeployment
  - machineset
  - machine
  excludedResources: []
  storageLocation: default
  ttl: 2h0m0s
  snapshotMoveData: true (6)
  datamover: "velero" (6)
  defaultVolumesToFsBackup: true (7)

1	Replace `backup_resource_name` with the name of your `Backup` resource.
2	Selects specific namespaces to back up objects from them. You must include your hosted cluster namespace and the hosted control plane namespace.
3	Replace `<hosted_cluster_namespace>` with the name of the hosted cluster namespace, for example, `clusters`.
4	Replace `<hosted_control_plane_namespace>` with the name of the hosted control plane namespace, for example, `clusters-hosted`.
5	You must create the `infraenv` resource in a separate namespace. Do not delete the `infraenv` resource during the backup process.
6	Enables the CSI volume snapshots and uploads the control plane workload automatically to the cloud storage.
7	Sets the `fs-backup` backing up method for persistent volumes (PVs) as default. This setting is useful when you use a combination of Container Storage Interface (CSI) volume snapshots and the `fs-backup` method.

If you want to use CSI volume snapshots, you must add the backup.velero.io/backup-volumes-excludes=<pv_name> annotation to your PVs.

Apply the Backup CR by running the following command:
```
$ oc apply -f backup-control-plane.yaml
```

Verification

Verify if the value of the status.phase is Completed by running the following command:

$ oc get backups.velero.io <backup_resource_name> -n openshift-adp \
  -o jsonpath='{.status.phase}'

Next steps

Restoring a hosted cluster by using OADP

Restoring a hosted cluster by using OADP

You can restore the hosted cluster by creating the Restore custom resource (CR).

If you are using an in-place update, InfraEnv does not need spare nodes. You need to re-provision the worker nodes from the new management cluster.
If you are using a replace update, you need some spare nodes for InfraEnv to deploy the worker nodes.

After you back up your hosted cluster, you must destroy it to initiate the restoring process. To initiate node provisioning, you must back up workloads in the data plane before deleting the hosted cluster.

Prerequisites

You completed the steps in Removing a cluster by using the console to delete your hosted cluster.
You completed the steps in Removing remaining resources after removing a cluster.

To monitor and observe the backup process, see "Observing the backup and restore process".

Procedure

Verify that no pods and persistent volume claims (PVCs) are present in the hosted control plane namespace by running the following command:
```
$ oc get pod pvc -n <hosted_control_plane_namespace>
```
Expected output
```
No resources found
```

Create a YAML file that defines the Restore CR:

Example restore-hosted-cluster.yaml file

apiVersion: velero.io/v1
kind: Restore
metadata:
  name: <restore_resource_name> (1)
  namespace: openshift-adp
spec:
  backupName: <backup_resource_name> (2)
  restorePVs: true (3)
  existingResourcePolicy: update (4)
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io

1	Replace `<restore_resource_name>` with the name of your `Restore` resource.
2	Replace `<backup_resource_name>` with the name of your `Backup` resource.
3	Initiates the recovery of persistent volumes (PVs) and its pods.
4	Ensures that the existing objects are overwritten with the backed up content.

You must create the infraenv resource in a separate namespace. Do not delete the infraenv resource during the restore process. The infraenv resource is mandatory for the new nodes to be reprovisioned.

Apply the Restore CR by running the following command:
```
$ oc apply -f restore-hosted-cluster.yaml
```

Verify if the value of the status.phase is Completed by running the following command:

$ oc get hostedcluster <hosted_cluster_name> -n <hosted_cluster_namespace> \
  -o jsonpath='{.status.phase}'

After the restore process is complete, start the reconciliation of the HostedCluster and NodePool resources that you paused during backing up of the control plane workload:

Start the reconciliation of the HostedCluster resource by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  patch hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
  --type json \
  -p '[{"op": "add", "path": "/spec/pausedUntil", "value": "false"}]'

Start the reconciliation of the NodePool resource by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  patch nodepool -n <hosted_cluster_namespace> <node_pool_name> \
  --type json \
  -p '[{"op": "add", "path": "/spec/pausedUntil", "value": "false"}]'

Start the reconciliation of the Agent provider resources that you paused during backing up of the control plane workload:

Start the reconciliation of the AgentCluster resource by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  annotate agentcluster -n <hosted_control_plane_namespace>  \
  cluster.x-k8s.io/paused- --overwrite=true --all

Start the reconciliation of the AgentMachine resource by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  annotate agentmachine -n <hosted_control_plane_namespace>  \
  cluster.x-k8s.io/paused- --overwrite=true --all

Remove the hypershift.openshift.io/skip-delete-hosted-controlplane-namespace- annotation in the HostedCluster resource to avoid manually deleting the hosted control plane namespace by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  annotate hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
  hypershift.openshift.io/skip-delete-hosted-controlplane-namespace- \
  --overwrite=true --all

Scale the NodePool resource to the desired number of replicas by running the following command:

$ oc --kubeconfig <management_cluster_kubeconfig_file> \
  scale nodepool -n <hosted_cluster_namespace> <node_pool_name> \
  --replicas <replica_count> (1)

1	Replace `<replica_count>` by an integer value, for example, `3`.

Observing the backup and restore process

When using OpenShift API for Data Protection (OADP) to backup and restore a hosted cluster, you can monitor and observe the process.

Procedure

Observe the backup process by running the following command:

$ watch "oc get backups.velero.io -n openshift-adp <backup_resource_name> -o jsonpath='{.status}'"

Observe the restore process by running the following command:

$ watch "oc get restores.velero.io -n openshift-adp <backup_resource_name> -o jsonpath='{.status}'"

Observe the Velero logs by running the following command:
```
$ oc logs -n openshift-adp -ldeploy=velero -f
```

Observe the progress of all of the OADP objects by running the following command:

$ watch "echo BackupRepositories:;echo;oc get backuprepositories.velero.io -A;echo; echo BackupStorageLocations: ;echo; oc get backupstoragelocations.velero.io -A;echo;echo DataUploads: ;echo;oc get datauploads.velero.io -A;echo;echo DataDownloads: ;echo;oc get datadownloads.velero.io -n openshift-adp; echo;echo VolumeSnapshotLocations: ;echo;oc get volumesnapshotlocations.velero.io -A;echo;echo Backups:;echo;oc get backup -A; echo;echo Restores:;echo;oc get restore -A"

Using the velero CLI to describe the Backup and Restore resources

When using OpenShift API for Data Protection, you can get more details of the Backup and Restore resources by using the velero command-line interface (CLI).

Procedure

Create an alias to use the velero CLI from a container by running the following command:

$ alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero'

Get details of your Restore custom resource (CR) by running the following command:
```
$ velero restore describe <restore_resource_name> --details (1)
```
1 Replace <restore_resource_name> with the name of your Restore resource.
Get details of your Backup CR by running the following command:
```
$ velero restore describe <backup_resource_name> --details (1)
```
1 Replace <backup_resource_name> with the name of your Backup resource.