$ oc rsh -n <hosted_control_plane_namespace> -c etcd <etcd_pod_name>
In a highly available control plane, three etcd pods run as a part of a stateful set in an etcd cluster. To recover an etcd cluster, identify unhealthy etcd pods by checking the etcd cluster health.
You can check the status of the etcd cluster health by logging into any etcd pod.
Log in to an etcd pod by entering the following command:
$ oc rsh -n <hosted_control_plane_namespace> -c etcd <etcd_pod_name>
Print the health status of an etcd cluster by entering the following command:
sh-4.4$ etcdctl endpoint health --cluster -w table
ENDPOINT HEALTH TOOK ERROR
https://etcd-0.etcd-discovery.clusters-hosted.svc:2379 true 9.117698ms
Each etcd pod of a 3-node cluster has its own persistent volume claim (PVC) to store its data. An etcd pod might fail because of corrupted or missing data. You can recover a failing etcd pod and its PVC.
To confirm that the etcd pod is failing, enter the following command:
$ oc get pods -l app=etcd -n <hosted_control_plane_namespace>
NAME READY STATUS RESTARTS AGE
etcd-0 2/2 Running 0 64m
etcd-1 2/2 Running 0 45m
etcd-2 1/2 CrashLoopBackOff 1 (5s ago) 64m
The failing etcd pod might have the CrashLoopBackOff
or Error
status.
Delete the failing pod and its PVC by entering the following command:
$ oc delete pvc/<etcd_pvc_name> pod/<etcd_pod_name> --wait=false
Verify that a new etcd pod is up and running by entering the following command:
$ oc get pods -l app=etcd -n <hosted_control_plane_namespace>
NAME READY STATUS RESTARTS AGE
etcd-0 2/2 Running 0 67m
etcd-1 2/2 Running 0 48m
etcd-2 2/2 Running 0 2m2s