Restic issues - OADP Application backup and restore | Backup and restore

Table of Contents

Troubleshooting Restic permission errors for NFS data volumes
Troubleshooting Restic Backup CR issue that cannot be re-created after bucket is emptied
Troubleshooting restic restore partially failed issue on OpenShift Container Platform 4.14 onward due to changed PSA policy

Troubleshoot common Restic issues during application backups and restores to maintain reliable data protection. Common Restic issues include NFS permission errors, backup custom resource re-creation failures, and restore failures caused by pod security admission policy changes.

Troubleshooting Restic permission errors for NFS data volumes

Create a supplemental group and add its group ID to the DataProtectionApplication customer resource CR to resolve Restic permission errors on NFS data volumes with root_squash enabled. This helps you to restore backup functionality for NFS volumes without disabling root squash.

If your NFS data volumes have the root_squash parameter enabled, Restic maps set to the nfsnobody value, and do not have permission to create backups, the Restic pod log displays the following error message:

controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied".

Procedure

Create a supplemental group for Restic on the NFS data volume.
Set the setgid bit on the NFS directories so that group ownership is inherited.

Add the spec.configuration.nodeAgent.supplementalGroups parameter and the group ID to the DataProtectionApplication manifest, as shown in the following example:

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
# ...
spec:
  configuration:
    nodeAgent:
      enable: true
      uploaderType: restic
      supplementalGroups:
      - <group_id>
# ...

where:

<group_id>: Specifies the supplemental group ID.

Wait for the Restic pods to restart so that the changes are applied.

Troubleshooting Restic Backup CR issue that cannot be re-created after bucket is emptied

Resolve the Backup custom resource (CR) re-creation failure that occurs after you empty the object storage bucket. This failure occurs because Velero does not automatically re-create the Restic repository from the ResticRepository manifest.

For more information, see Velero issue 4421.

The velero pod log displays the following error message:

stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?

Procedure

Remove the related Restic repository from the namespace by running the following command:

$ oc delete resticrepository openshift-adp <name_of_the_restic_repository>

In the following error log, mysql-persistent is the problematic Restic repository. The name of the repository is displayed in italics for clarity.

 time="2021-12-29T18:29:14Z" level=info msg="1 errors
 encountered backup up item" backup=velero/backup65
 logSource="pkg/backup/backup.go:431" name=mysql-7d99fc949-qbkds
 time="2021-12-29T18:29:14Z" level=error msg="Error backing up item"
 backup=velero/backup65 error="pod volume backup failed: error running
 restic backup, stderr=Fatal: unable to open config file: Stat: The
 specified key does not exist.\nIs there a repository at the following
 location?\ns3:http://minio-minio.apps.mayap-oadp-
 veleo-1234.qe.devcluster.openshift.com/mayapvelerooadp2/velero1/
 restic/mysql-persistent\n: exit status 1" error.file="/remote-source/
 src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:184"
 error.function="github.com/vmware-tanzu/velero/
 pkg/restic.(*backupper).BackupPodVolumes"
 logSource="pkg/backup/backup.go:435" name=mysql-7d99fc949-qbkds

Troubleshooting restic restore partially failed issue on OpenShift Container Platform 4.14 onward due to changed PSA policy

Resolve a partial failure of Restic restore on OpenShift Container Platform 4.14 onward caused by Pod Security Admission (PSA) policy enforcement by adjusting the restore-resource-priorities field in your DataProtectionApplication (DPA) custom resource (CR). By doing so, you ensure that SecurityContextConstraints (SCC) resources are restored before pods. This helps you to complete restore operations successfully when PSA policies deny pod admission due to Velero resource restore order.

From 4.14 onward, OpenShift Container Platform enforces a PSA policy that can hinder the readiness of pods during a Restic restore process. If an SCC resource is not found when a pod is created, and the PSA policy on the pod is not set up to meet the required standards, pod admission is denied.

Review the following example error:

\"level=error\" in line#2273: time=\"2023-06-12T06:50:04Z\"
level=error msg=\"error restoring mysql-869f9f44f6-tp5lv: pods\\\
"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity\\\
"restricted:v1.24\\\": privil eged (container \\\"mysql\\\
" must not set securityContext.privileged=true),
allowPrivilegeEscalation != false (containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\
"RuntimeDefault\\\" or \\\"Localhost\\\")\" logSource=\"/remote-source/velero/app/pkg/restore/restore.go:1388\" restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n
velero container contains \"level=error\" in line#2447: time=\"2023-06-12T06:50:05Z\"
level=error msg=\"Namespace todolist-mariadb,
resource restore error: error restoring pods/todolist-mariadb/mysql-869f9f44f6-tp5lv: pods \\\
"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity \\\"restricted:v1.24\\\": privileged (container \\\
"mysql\\\" must not set securityContext.privileged=true),
allowPrivilegeEscalation != false (containers \\\
"restic-wait\\\",\\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\
"RuntimeDefault\\\" or \\\"Localhost\\\")\"
logSource=\"/remote-source/velero/app/pkg/controller/restore_controller.go:510\"
restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n]",

Procedure

In your DPA custom resource (CR), check or set the restore-resource-priorities field on the Velero server to ensure that securitycontextconstraints is listed in order before pods in the list of resources:

$ oc get dpa -o yaml

# ...
configuration:
  restic:
    enable: true
  velero:
    args:
      restore-resource-priorities: 'securitycontextconstraints,customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io'
    defaultPlugins:
    - gcp
    - openshift

where:

restore-resource-priorities: If you have an existing restore resource priority list, ensure you combine that existing list with the complete list.

Ensure that the security standards for the application pods are aligned, as provided in Fixing PodSecurity Admission warnings for deployments, to prevent deployment warnings. If the application is not aligned with security standards, an error can occur regardless of the SCC.

This solution is temporary, and ongoing discussions are in progress to address it.

Additional resources

Fixing PodSecurity Admission warnings for deployments