Disaster recovery

About disaster recovery methods
- Metro-DR
- Regional-DR
Defining applications for disaster recovery
- Best practices when defining an RHACM-managed VM
- Best practices when defining an RHACM-discovered VM
VM behavior during disaster recovery scenarios
- Relocate
- Failover
Disaster recovery solutions for Red Hat managed clusters
- Metro-DR for Red Hat OpenShift Data Foundation
- Regional-DR for Red Hat OpenShift Data Foundation
Additional resources

OKD Virtualization supports using disaster recovery (DR) solutions to ensure that your environment can recover after a site outage. To use these methods, you must plan your OKD Virtualization deployment in advance.

About disaster recovery methods

The two primary DR methods for OKD Virtualization are Metropolitan Disaster Recovery (Metro-DR) and Regional-DR.

For an overview of disaster recovery (DR) concepts, architecture, and planning considerations, see the Red Hat OKD Virtualization disaster recovery guide in the Red Hat Knowledgebase.

Metro-DR

Metro-DR uses synchronous replication. It writes to storage at both the primary and secondary sites so that the data is always synchronized between sites. Because the storage provider is responsible for ensuring that the synchronization succeeds, the environment must meet the throughput and latency requirements of the storage provider.

Regional-DR

Regional-DR uses asynchronous replication. The data in the primary site is synchronized with the secondary site at regular intervals. For this type of replication, you can have a higher latency connection between the primary and secondary sites.

Defining applications for disaster recovery

Define applications for disaster recovery by using VMs that Red Hat Advanced Cluster Management (RHACM) manages or discovers.

Best practices when defining an RHACM-managed VM

When creating an RHACM-managed application that includes a VM, you must use a GitOps workflow and create an RHACM application or ApplicationSet resource.

You can take several actions to improve your experience and chance of success when defining an RHACM-managed VM.

Use a PVC and populator to define storage for the VM: Because data volumes create persistent volume claims (PVCs) implicitly, data volumes and VMs with data volume templates do not fit as neatly into the GitOps model.
Use the import method when choosing a population source for your VM disk: Select a Fedora image from the software catalog to use the import method. Red Hat recommends using a specific version of the image rather than a floating tag for consistent results. The KubeVirt community maintains container disks for other operating systems in a Quay repository.
Use pullMethod: node: Use the pod pullMethod: node when creating a data volume from a registry source to take advantage of the OKD pull secret, which is required to pull container images from the Red Hat registry.

Best practices when defining an RHACM-discovered VM

You can configure any VM in the cluster that is not an RHACM-managed application as an RHACM-discovered application. This includes VMs imported by using the Migration Toolkit for Virtualization (MTV), VMs created by using the OKD web console, or VMs created by any other means, such as the CLI.

You can take several actions to improve your experience and chance of success when defining an RHACM-discovered VM.

Protecting the VM when using MTV, the OKD web console, or a custom VM

Because automatic labeling is not currently available, the application owner must manually label the components of the VM application when using MTV, the OKD web console, or a custom VM.

After creating the VM, apply a common label to the following resources associated with the VM: VirtualMachine, DataVolume, PersistentVolumeClaim, Service, Route, Secret and ConfigMap. If the VM uses an instance type or preference, you must also label the ControllerRevision copy of these objects referenced by the spec or status of the VM. Do not label virtual machine instances (VMIs) or pods; OKD Virtualization creates and manages these automatically.

You must apply the common label to everything in the namespace that you want to protect, including objects that you added to the VM that are not listed here.

Including more than the VirtualMachine object in the VM

Working VMs typically also contain data volumes, persistent volume claims (PVCs), services, routes, secrets, ConfigMap objects, and VirtualMachineSnapshot objects.

Including the VM as part of a larger logical application

This includes other pod-based workloads and VMs.

VM behavior during disaster recovery scenarios

VMs typically act similarly to pod-based workloads during both relocate and failover disaster recovery flows.

Relocate

Use relocate to move an application from the primary environment to the secondary environment when the primary environment is still accessible. During relocate, the VM is gracefully terminated, any unreplicated data is synchronized to the secondary environment, and the VM starts in the secondary environment.

Because the VM terminates gracefully, there is no data loss. Therefore, the VM operating system will not perform crash recovery.

Failover

Use failover when there is a critical failure in the primary environment that makes it impractical or impossible to use relocation to move the workload to a secondary environment. When failover is executed, the storage is fenced from the primary environment, the I/O to the VM disks is abruptly halted, and the VM restarts in the secondary environment using the replicated data.

You should expect data loss due to failover. The extent of loss depends on whether you use Metro-DR, which uses synchronous replication, or Regional-DR, which uses asynchronous replication. Because Regional-DR uses snapshot-based replication intervals, the window of data loss is proportional to the replication interval length. When the VM restarts, the operating system might perform crash recovery.

Disaster recovery solutions for Red Hat managed clusters

The following DR solutions combine Red Hat Advanced Cluster Management (RHACM), Red Hat Ceph Storage, and OpenShift Data Foundation components. You can use them to failover applications from the primary to the secondary site, and to relocate the applications back to the primary site after you restore the disaster site.

Metro-DR for Red Hat OpenShift Data Foundation

OKD Virtualization supports the Metro-DR solution for OpenShift Data Foundation, which provides two-way synchronous data replication between managed OKD Virtualization clusters installed on primary and secondary sites.

Metro-DR differences

This synchronous solution is only available to metropolitan distance data centers with a network round-trip latency of 10 milliseconds or less.
Multiple disk VMs are supported.
To prevent data corruption, you must ensure that storage is fenced during failover.

Fencing means isolating a node so that workloads do not run on it.

For more information about using the Metro-DR solution for OpenShift Data Foundation with OKD Virtualization, see IBM’s OpenShift Data Foundation Metro-DR documentation.

Regional-DR for Red Hat OpenShift Data Foundation

OKD Virtualization supports the Regional-DR solution for OpenShift Data Foundation, which provides asynchronous data replication at regular intervals between managed OKD Virtualization clusters installed on primary and secondary sites.

Regional-DR differences

Regional-DR supports higher network latency between the primary and secondary sites.
Regional-DR uses RBD snapshots to replicate data asynchronously. Currently, your applications must be resilient to small variances between VM disks. You can prevent these variances by using single disk VMs.
Using the import method when selecting a population source for your VM disk is recommended. However, you can protect VMs that use cloned PVCs if you select a VolumeReplicationClass that enables image flattening. For more information, see the OpenShift Data Foundation documentation.

For more information about using the Regional-DR solution for OpenShift Data Foundation with OKD Virtualization, see IBM’s OpenShift Data Foundation Regional-DR documentation.