Control plane backup and restore operations

As a cluster administrator, you might need to stop an OKD cluster for a period and restart it later. Some reasons for restarting a cluster are that you need to perform maintenance on a cluster or want to reduce resource costs. In OKD, you can perform a graceful shutdown of a cluster so that you can easily restart the cluster later.

You must back up etcd data before shutting down a cluster; etcd is the key-value store for OKD, which persists the state of all resource objects. An etcd backup plays a crucial role in disaster recovery. In OKD, you can also replace an unhealthy etcd member.

When you want to get your cluster running again, restart the cluster gracefully.

A cluster’s certificates expire one year after the installation date. You can shut down a cluster and expect it to restart gracefully while the certificates are still valid. Although the cluster automatically retrieves the expired control plane certificates, you must still approve the certificate signing requests (CSRs).

You might run into several situations where OKD does not work as expected, such as:

  • You have a cluster that is not functional after the restart because of unexpected conditions, such as node failure, or network connectivity issues.

  • You have deleted something critical in the cluster by mistake.

  • You have lost the majority of your control plane hosts, leading to etcd quorum loss.

You can always recover from a disaster situation by restoring your cluster to its previous state using the saved etcd snapshots.