$ alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero'
You can debug Velero custom resources (CRs) by using the OpenShift CLI tool or the Velero CLI tool. The Velero CLI tool provides more detailed logs and information.
You can check installation issues, backup and restore CR issues, and Restic issues.
You can collect logs and CR information by using the must-gather
tool.
You can obtain the Velero CLI tool by:
Downloading the Velero CLI tool
Accessing the Velero binary in the Velero deployment in the cluster
You can download and install the Velero CLI tool by following the instructions on the Velero documentation page.
The page includes instructions for:
macOS by using Homebrew
GitHub
Windows by using Chocolatey
You have access to a Kubernetes cluster, v1.16 or later, with DNS and container networking enabled.
You have installed kubectl
locally.
Open a browser and navigate to "Install the CLI" on the Velero website.
Follow the appropriate procedure for macOS, GitHub, or Windows.
Download the Velero version appropriate for your version of OADP and OKD.
OADP version | Velero version | OKD version |
---|---|---|
1.1.0 |
4.9 and later |
|
1.1.1 |
4.9 and later |
|
1.1.2 |
4.9 and later |
|
1.1.3 |
4.9 and later |
|
1.1.4 |
4.9 and later |
|
1.1.5 |
4.9 and later |
|
1.1.6 |
4.11 and later |
|
1.1.7 |
4.11 and later |
|
1.2.0 |
4.11 and later |
|
1.2.1 |
4.11 and later |
|
1.2.2 |
4.11 and later |
|
1.2.3 |
4.11 and later |
|
1.3.0 |
4.10 - 4.15 |
|
1.3.1 |
4.10 - 4.15 |
|
1.3.2 |
4.10 - 4.15 |
|
1.3.3 |
4.10 - 4.15 |
|
1.4.0 |
4.14 and later |
|
1.4.1 |
4.14 and later |
You can use a shell command to access the Velero binary in the Velero deployment in the cluster.
Your DataProtectionApplication
custom resource has a status of Reconcile complete
.
Enter the following command to set the needed alias:
$ alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero'
You can debug a failed backup or restore by checking Velero custom resources (CRs) and the Velero
pod log with the OpenShift CLI tool.
Use the oc describe
command to retrieve a summary of warnings and errors associated with a Backup
or Restore
CR:
$ oc describe <velero_cr> <cr_name>
Use the oc logs
command to retrieve the Velero
pod logs:
$ oc logs pod/<velero>
You can specify the Velero log level in the DataProtectionApplication
resource as shown in the following example.
This option is available starting from OADP 1.0.3. |
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: velero-sample
spec:
configuration:
velero:
logLevel: warning
The following logLevel
values are available:
trace
debug
info
warning
error
fatal
panic
It is recommended to use debug
for most logs.
You can debug Backup
and Restore
custom resources (CRs) and retrieve logs with the Velero CLI tool.
The Velero CLI tool provides more detailed information than the OpenShift CLI tool.
Use the oc exec
command to run a Velero CLI command:
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
<backup_restore_cr> <command> <cr_name>
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
Use the velero --help
option to list all Velero CLI commands:
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
--help
Use the velero describe
command to retrieve a summary of warnings and errors associated with a Backup
or Restore
CR:
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
<backup_restore_cr> describe <cr_name>
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
The following types of restore errors and warnings are shown in the output of a velero describe
request:
Velero
: A list of messages related to the operation of Velero itself, for example, messages related to connecting to the cloud, reading a backup file, and so on
Cluster
: A list of messages related to backing up or restoring cluster-scoped resources
Namespaces
: A list of list of messages related to backing up or restoring resources stored in namespaces
One or more errors in one of these categories results in a Restore
operation receiving the status of PartiallyFailed
and not Completed
. Warnings do not lead to a change in the completion status.
|
Use the velero logs
command to retrieve the logs of a Backup
or Restore
CR:
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
<backup_restore_cr> logs <cr_name>
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf
If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources.
You can use the configuration.velero.podConfig.resourceAllocations
specification field in the oadp_v1alpha1_dpa.yaml
file to set specific resource requests for a Velero
pod.
Set the cpu
and memory
resource requests in the YAML file:
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
...
configuration:
velero:
podConfig:
resourceAllocations: (1)
requests:
cpu: 200m
memory: 256Mi
1 | The resourceAllocations listed are for average usage. |
You can use the configuration.restic.podConfig.resourceAllocations
specification field to set specific resource requests for a Restic
pod.
Set the cpu
and memory
resource requests in the YAML file:
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
...
configuration:
restic:
podConfig:
resourceAllocations: (1)
requests:
cpu: 1000m
memory: 16Gi
1 | The resourceAllocations listed are for average usage. |
The values for the resource request fields must follow the same format as Kubernetes resource requirements.
Also, if you do not specify
|
The restore operation fails when there is more than one volume during a NFS restore by using Restic
or Kopia
. PodVolumeRestore
either fails with the following error or keeps trying to restore before finally failing.
Velero: pod volume restore failed: data path restore failed: \
Failed to run kopia restore: Failed to copy snapshot data to the target: \
restore error: copy file: error creating file: \
open /host_pods/b4d...6/volumes/kubernetes.io~nfs/pvc-53...4e5/userdata/base/13493/2681: \
no such file or directory
The NFS mount path is not unique for the two volumes to restore. As a result, the velero
lock files use the same file on the NFS server during the restore, causing the PodVolumeRestore
to fail.
You can resolve this issue by setting up a unique pathPattern
for each volume, while defining the StorageClass
for nfs-subdir-external-provisioner
in the deploy/class.yaml
file. Use the following nfs-subdir-external-provisioner
StorageClass
example:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-client
provisioner: k8s-sigs.io/nfs-subdir-external-provisioner
parameters:
pathPattern: "${.PVC.namespace}/${.PVC.annotations.nfs.io/storage-path}" (1)
onDelete: delete
1 | Specifies a template for creating a directory path by using PVC metadata such as labels, annotations, name, or namespace. To specify metadata, use ${.PVC.<metadata>} . For example, to name a folder: <pvc-namespace>-<pvc-name> , use ${.PVC.namespace}-${.PVC.name} as pathPattern . |
Velero has limited abilities to resolve admission webhook issues during a restore. If you have workloads with admission webhooks, you might need to use an additional Velero plugin or make changes to how you restore the workload.
Typically, workloads with admission webhooks require you to create a resource of a specific kind first. This is especially true if your workload has child resources because admission webhooks typically block child resources.
For example, creating or restoring a top-level object such as service.serving.knative.dev
typically creates child resources automatically. If you do this first, you will not need to use Velero to create and restore these resources. This avoids the problem of child resources being blocked by an admission webhook that Velero might use.
This section describes the additional steps required to restore resources for several types of Velero backups that use admission webhooks.
You might encounter problems using Velero to back up Knative resources that use admission webhooks.
You can avoid such problems by restoring the top level Service
resource first whenever you back up and restore Knative resources that use admission webhooks.
Restore the top level service.serving.knavtive.dev Service
resource:
$ velero restore <restore_name> \
--from-backup=<backup_name> --include-resources \
service.serving.knavtive.dev
If you experience issues when you use Velero to a restore an IBM AppConnect resource that has an admission webhook, you can run the checks in this procedure.
Check if you have any mutating admission plugins of kind: MutatingWebhookConfiguration
in the cluster:
$ oc get mutatingwebhookconfigurations
Examine the YAML file of each kind: MutatingWebhookConfiguration
to ensure that none of its rules block creation of the objects that are experiencing issues. For more information, see the official Kubernetes documentation.
Check that any spec.version
in type: Configuration.appconnect.ibm.com/v1beta1
used at backup time is supported by the installed Operator.
The following section describes known issues in OpenShift API for Data Protection (OADP) plugins:
When the backup and the Backup Storage Location (BSL) are managed outside the scope of the Data Protection Application (DPA), the OADP controller, meaning the DPA reconciliation does not create the relevant oadp-<bsl_name>-<bsl_provider>-registry-secret
.
When the backup is run, the OpenShift Velero plugin panics on the imagestream backup, with the following panic error:
024-02-27T10:46:50.028951744Z time="2024-02-27T10:46:50Z" level=error msg="Error backing up item"
backup=openshift-adp/<backup name> error="error executing custom action (groupResource=imagestreams.image.openshift.io,
namespace=<BSL Name>, name=postgres): rpc error: code = Aborted desc = plugin panicked:
runtime error: index out of range with length 1, stack trace: goroutine 94…
To avoid the Velero plugin panic error, perform the following steps:
Label the custom BSL with the relevant label:
$ oc label backupstoragelocations.velero.io <bsl_name> app.kubernetes.io/component=bsl
After the BSL is labeled, wait until the DPA reconciles.
You can force the reconciliation by making any minor change to the DPA itself. |
When the DPA reconciles, confirm that the relevant oadp-<bsl_name>-<bsl_provider>-registry-secret
has been created and that the correct registry data has been populated into it:
$ oc -n openshift-adp get secret/oadp-<bsl_name>-<bsl_provider>-registry-secret -o json | jq -r '.data'
If you configure a DPA with both cloudstorage
and restic
enabled, the openshift-adp-controller-manager
pod crashes and restarts indefinitely until the pod fails with a crash loop segmentation fault.
You can have either velero
or cloudstorage
defined, because they are mutually exclusive fields.
If you have both velero
and cloudstorage
defined, the openshift-adp-controller-manager
fails.
If you have neither velero
nor cloudstorage
defined, the openshift-adp-controller-manager
fails.
For more information about this issue, see OADP-1054.
Velero plugins are started as separate processes. After the Velero operation has completed, either successfully or not, they exit. Receiving a |
You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application.
The Velero
pod log displays the error message, Backup storage contains invalid top-level directories
.
The object storage contains top-level directories that are not Velero directories.
If the object storage is not dedicated to Velero, you must specify a prefix for the bucket by setting the spec.backupLocations.velero.objectStorage.prefix
parameter in the DataProtectionApplication
manifest.
The oadp-aws-registry
pod log displays the error message, InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.
The Velero
pod log displays the error message, NoCredentialProviders: no valid providers in chain
.
The credentials-velero
file used to create the Secret
object is incorrectly formatted.
Ensure that the credentials-velero
file is correctly formatted, as in the following example:
credentials-velero
file[default] (1) aws_access_key_id=AKIAIOSFODNN7EXAMPLE (2) aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
1 | AWS default profile. |
2 | Do not enclose the values with quotation marks (" , ' ). |
The OpenShift API for Data Protection (OADP) Operator might encounter issues caused by problems it is not able to resolve.
The S3 buckets of an OADP Operator might be empty, but when you run the command oc get po -n <OADP_Operator_namespace>
, you see that the Operator has a status of Running
. In such a case, the Operator is said to have failed silently because it incorrectly reports that it is running.
The problem is caused when cloud credentials provide insufficient permissions.
Retrieve a list of backup storage locations (BSLs) and check the manifest of each BSL for credential issues.
Run one of the following commands to retrieve a list of BSLs:
Using the OpenShift CLI:
$ oc get backupstoragelocations.velero.io -A
Using the Velero CLI:
$ velero backup-location get -n <OADP_Operator_namespace>
Using the list of BSLs, run the following command to display the manifest of each BSL, and examine each manifest for an error.
$ oc get backupstoragelocations.velero.io -n <namespace> -o yaml
apiVersion: v1
items:
- apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
creationTimestamp: "2023-11-03T19:49:04Z"
generation: 9703
name: example-dpa-1
namespace: openshift-adp-operator
ownerReferences:
- apiVersion: oadp.openshift.io/v1alpha1
blockOwnerDeletion: true
controller: true
kind: DataProtectionApplication
name: example-dpa
uid: 0beeeaff-0287-4f32-bcb1-2e3c921b6e82
resourceVersion: "24273698"
uid: ba37cd15-cf17-4f7d-bf03-8af8655cea83
spec:
config:
enableSharedConfig: "true"
region: us-west-2
credential:
key: credentials
name: cloud-credentials
default: true
objectStorage:
bucket: example-oadp-operator
prefix: example
provider: aws
status:
lastValidationTime: "2023-11-10T22:06:46Z"
message: "BackupStorageLocation \"example-dpa-1\" is unavailable: rpc
error: code = Unknown desc = WebIdentityErr: failed to retrieve credentials\ncaused
by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus
code: 403, request id: d3f2e099-70a0-467b-997e-ff62345e3b54"
phase: Unavailable
kind: List
metadata:
resourceVersion: ""
Extending a timeout allows complex or resource-intensive processes to complete successfully without premature termination. This configuration can reduce the likelihood of errors, retries, or failures.
Ensure that you balance timeout extensions in a logical manner so that you do not configure excessively long timeouts that might hide underlying issues in the process. Carefully consider and monitor an appropriate timeout value that meets the needs of the process and the overall system performance.
The following are various OADP timeouts, with instructions of how and when to implement these parameters:
The spec.configuration.nodeAgent.timeout
parameter defines the Restic timeout. The default value is 1h
.
Use the Restic timeout
parameter in the nodeAgent
section for the following scenarios:
For Restic backups with total PV data usage that is greater than 500GB.
If backups are timing out with the following error:
level=error msg="Error backing up item" backup=velero/monitoring error="timed out waiting for all PodVolumeBackups to complete"
Edit the values in the spec.configuration.nodeAgent.timeout
block of the DataProtectionApplication
custom resource (CR) manifest, as shown in the following example:
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: <dpa_name>
spec:
configuration:
nodeAgent:
enable: true
uploaderType: restic
timeout: 1h
# ...
resourceTimeout
defines how long to wait for several Velero resources before timeout occurs, such as Velero custom resource definition (CRD) availability, volumeSnapshot
deletion, and repository availability. The default is 10m
.
Use the resourceTimeout
for the following scenarios:
For backups with total PV data usage that is greater than 1TB. This parameter is used as a timeout value when Velero tries to clean up or delete the Container Storage Interface (CSI) snapshots, before marking the backup as complete.
A sub-task of this cleanup tries to patch VSC and this timeout can be used for that task.
To create or ensure a backup repository is ready for filesystem based backups for Restic or Kopia.
To check if the Velero CRD is available in the cluster before restoring the custom resource (CR) or resource from the backup.
Edit the values in the spec.configuration.velero.resourceTimeout
block of the DataProtectionApplication
CR manifest, as in the following example:
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: <dpa_name>
spec:
configuration:
velero:
resourceTimeout: 10m
# ...
timeout
is a user-supplied timeout to complete VolumeSnapshotBackup
and VolumeSnapshotRestore
. The default value is 10m
.
Use the Data Mover timeout
for the following scenarios:
If creation of VolumeSnapshotBackups
(VSBs) and VolumeSnapshotRestores
(VSRs), times out after 10 minutes.
For large scale environments with total PV data usage that is greater than 500GB. Set the timeout for 1h
.
With the VolumeSnapshotMover
(VSM) plugin.
Only with OADP 1.1.x.
Edit the values in the spec.features.dataMover.timeout
block of the DataProtectionApplication
CR manifest, as in the following example:
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: <dpa_name>
spec:
features:
dataMover:
timeout: 10m
# ...
CSISnapshotTimeout
specifies the time during creation to wait until the CSI VolumeSnapshot
status becomes ReadyToUse
, before returning error as timeout. The default value is 10m
.
Use the CSISnapshotTimeout
for the following scenarios:
With the CSI plugin.
For very large storage volumes that may take longer than 10 minutes to snapshot. Adjust this timeout if timeouts are found in the logs.
Typically, the default value for |
Edit the values in the spec.csiSnapshotTimeout
block of the Backup
CR manifest, as in the following example:
apiVersion: velero.io/v1
kind: Backup
metadata:
name: <backup_name>
spec:
csiSnapshotTimeout: 10m
# ...
defaultItemOperationTimeout
defines how long to wait on asynchronous BackupItemActions
and RestoreItemActions
to complete before timing out. The default value is 1h
.
Use the defaultItemOperationTimeout
for the following scenarios:
Only with Data Mover 1.2.x.
To specify the amount of time a particular backup or restore should wait for the Asynchronous actions to complete. In the context of OADP features, this value is used for the Asynchronous actions involved in the Container Storage Interface (CSI) Data Mover feature.
When defaultItemOperationTimeout
is defined in the Data Protection Application (DPA) using the defaultItemOperationTimeout
, it applies to both backup and restore operations. You can use itemOperationTimeout
to define only the backup or only the restore of those CRs, as described in the following "Item operation timeout - restore", and "Item operation timeout - backup" sections.
Edit the values in the spec.configuration.velero.defaultItemOperationTimeout
block of the DataProtectionApplication
CR manifest, as in the following example:
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: <dpa_name>
spec:
configuration:
velero:
defaultItemOperationTimeout: 1h
# ...
ItemOperationTimeout
specifies the time that is used to wait for RestoreItemAction
operations. The default value is 1h
.
Use the restore ItemOperationTimeout
for the following scenarios:
Only with Data Mover 1.2.x.
For Data Mover uploads and downloads to or from the BackupStorageLocation
. If the restore action is not completed when the timeout is reached, it will be marked as failed. If Data Mover operations are failing due to timeout issues, because of large storage volume sizes, then this timeout setting may need to be increased.
Edit the values in the Restore.spec.itemOperationTimeout
block of the Restore
CR manifest, as in the following example:
apiVersion: velero.io/v1
kind: Restore
metadata:
name: <restore_name>
spec:
itemOperationTimeout: 1h
# ...
ItemOperationTimeout
specifies the time used to wait for asynchronous
BackupItemAction
operations. The default value is 1h
.
Use the backup ItemOperationTimeout
for the following scenarios:
Only with Data Mover 1.2.x.
For Data Mover uploads and downloads to or from the BackupStorageLocation
. If the backup action is not completed when the timeout is reached, it will be marked as failed. If Data Mover operations are failing due to timeout issues, because of large storage volume sizes, then this timeout setting may need to be increased.
Edit the values in the Backup.spec.itemOperationTimeout
block of the Backup
CR manifest, as in the following example:
apiVersion: velero.io/v1
kind: Backup
metadata:
name: <backup_name>
spec:
itemOperationTimeout: 1h
# ...
You might encounter these common issues with Backup
and Restore
custom resources (CRs).
The Backup
CR displays the error message, InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist
.
The persistent volume (PV) and the snapshot locations are in different regions.
Edit the value of the spec.snapshotLocations.velero.config.region
key in the DataProtectionApplication
manifest so that the snapshot location is in the same region as the PV.
Create a new Backup
CR.
The status of a Backup
CR remains in the InProgress
phase and does not complete.
If a backup is interrupted, it cannot be resumed.
Retrieve the details of the Backup
CR:
$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
backup describe <backup>
Delete the Backup
CR:
$ oc delete backups.velero.io <backup> -n openshift-adp
You do not need to clean up the backup location because a Backup
CR in progress has not uploaded files to object storage.
Create a new Backup
CR.
View the Velero backup details
$ velero backup describe <backup-name> --details
The status of a Backup
CR without Restic in use remains in the PartiallyFailed
phase and does not complete. A snapshot of the affiliated PVC is not created.
If the backup is created based on the CSI snapshot class, but the label is missing, CSI snapshot plugin fails to create a snapshot. As a result, the Velero
pod logs an error similar to the following:
+
time="2023-02-17T16:33:13Z" level=error msg="Error backing up item" backup=openshift-adp/user1-backup-check5 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=busy1, name=pvc1-user1): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass ocs-storagecluster-ceph-rbd: failed to get volumesnapshotclass for provisioner openshift-storage.rbd.csi.ceph.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=busybox-79799557b5-vprq
Delete the Backup
CR:
$ oc delete backups.velero.io <backup> -n openshift-adp
If required, clean up the stored data on the BackupStorageLocation
to free up space.
Apply label velero.io/csi-volumesnapshot-class=true
to the VolumeSnapshotClass
object:
$ oc label volumesnapshotclass/<snapclass_name> velero.io/csi-volumesnapshot-class=true
Create a new Backup
CR.
You might encounter these issues when you back up applications with Restic.
The Restic
pod log displays the error message: controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied"
.
If your NFS data volumes have root_squash
enabled, Restic
maps to nfsnobody
and does not have permission to create backups.
You can resolve this issue by creating a supplemental group for Restic
and adding the group ID to the DataProtectionApplication
manifest:
Create a supplemental group for Restic
on the NFS data volume.
Set the setgid
bit on the NFS directories so that group ownership is inherited.
Add the spec.configuration.nodeAgent.supplementalGroups
parameter and the group ID to the DataProtectionApplication
manifest, as shown in the following example:
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
# ...
spec:
configuration:
nodeAgent:
enable: true
uploaderType: restic
supplementalGroups:
- <group_id> (1)
# ...
1 | Specify the supplemental group ID. |
Wait for the Restic
pods to restart so that the changes are applied.
If you create a Restic Backup
CR for a namespace, empty the object storage bucket, and then recreate the Backup
CR for the same namespace, the recreated Backup
CR fails.
The velero
pod log displays the following error message: stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?
.
Velero does not recreate or update the Restic repository from the ResticRepository
manifest if the Restic directories are deleted from object storage. See Velero issue 4421 for more information.
Remove the related Restic repository from the namespace by running the following command:
$ oc delete resticrepository openshift-adp <name_of_the_restic_repository>
In the following error log, mysql-persistent
is the problematic Restic repository. The name of the repository appears in italics for clarity.
time="2021-12-29T18:29:14Z" level=info msg="1 errors
encountered backup up item" backup=velero/backup65
logSource="pkg/backup/backup.go:431" name=mysql-7d99fc949-qbkds
time="2021-12-29T18:29:14Z" level=error msg="Error backing up item"
backup=velero/backup65 error="pod volume backup failed: error running
restic backup, stderr=Fatal: unable to open config file: Stat: The
specified key does not exist.\nIs there a repository at the following
location?\ns3:http://minio-minio.apps.mayap-oadp-
veleo-1234.qe.devcluster.openshift.com/mayapvelerooadp2/velero1/
restic/mysql-persistent\n: exit status 1" error.file="/remote-source/
src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:184"
error.function="github.com/vmware-tanzu/velero/
pkg/restic.(*backupper).BackupPodVolumes"
logSource="pkg/backup/backup.go:435" name=mysql-7d99fc949-qbkds
You can collect logs, metrics, and information about OADP custom resources by using the must-gather
tool.
The must-gather
data must be attached to all customer cases.
You can run the must-gather
tool with the following data collection options:
Full must-gather
data collection collects Prometheus metrics, pod logs, and Velero CR information for all namespaces where the OADP Operator is installed.
Essential must-gather
data collection collects pod logs and Velero CR information for a specific duration of time, for example, one hour or 24 hours. Prometheus metrics and duplicate logs are not included.
must-gather
data collection with timeout. Data collection can take a long time if there are many failed Backup
CRs. You can improve performance by setting a timeout value.
Prometheus metrics data dump downloads an archive file containing the metrics data collected by Prometheus.
You must be logged in to the OKD cluster as a user with the cluster-admin
role.
You must have the OpenShift CLI (oc
) installed.
You must use Fedora 8.x with OADP 1.2.
You must use Fedora 35 with OADP 1.3.
Navigate to the directory where you want to store the must-gather
data.
Run the oc adm must-gather
command for one of the following data collection options:
Full must-gather
data collection, including Prometheus metrics:
For OADP 1.2, run the following command:
$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.2
For OADP 1.3, run the following command:
$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3
The data is saved as must-gather/must-gather.tar.gz
. You can upload this file to a support case on the Red Hat Customer Portal.
Essential must-gather
data collection, without Prometheus metrics, for a specific time duration:
$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.1 \
-- /usr/bin/gather_<time>_essential (1)
1 | Specify the time in hours. Allowed values are 1h , 6h , 24h , 72h , or all , for example, gather_1h_essential or gather_all_essential . |
must-gather
data collection with timeout:
$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.1 \
-- /usr/bin/gather_with_timeout <timeout> (1)
1 | Specify a timeout value in seconds. |
Prometheus metrics data dump:
For OADP 1.2, run the following command:
$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.2 -- /usr/bin/gather_metrics_dump
For OADP 1.3, run the following command:
$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3 -- /usr/bin/gather_metrics_dump
This operation can take a long time. The data is saved as must-gather/metrics/prom_data.tar.gz
.
If a custom CA certificate is used, the must-gather
pod fails to grab the output for velero logs/describe
. To use the must-gather
tool with insecure TLS connections, you can pass the gather_without_tls
flag to the must-gather
command.
Pass the gather_without_tls
flag, with value set to true
, to the must-gather
tool by using the following command:
$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3 -- /usr/bin/gather_without_tls <true/false>
By default, the flag value is set to false
. Set the value to true
to allow insecure TLS connections.
Currently, it is not possible to combine must-gather scripts, for example specifying a timeout threshold while permitting insecure TLS connections. In some situations, you can get around this limitation by setting up internal variables on the must-gather command line, such as the following example:
oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3 -- skip_tls=true /usr/bin/gather_with_timeout <timeout_value_in_seconds>
In this example, set the skip_tls
variable before running the gather_with_timeout
script. The result is a combination of gather_with_timeout
and gather_without_tls
.
The only other variables that you can specify this way are the following:
logs_since
, with a default value of 72h
request_timeout
, with a default value of 0s
If DataProtectionApplication
custom resource (CR) is configured with s3Url
and insecureSkipTLS: true
, the CR does not collect the necessary logs because of a missing CA certificate. To collect those logs, run the must-gather
command with the following option:
$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel9:v1.3 -- /usr/bin/gather_without_tls true
The OKD provides a monitoring stack that allows users and administrators to effectively monitor and manage their clusters, as well as monitor and analyze the workload performance of user applications and services running on the clusters, including receiving alerts if an event occurs.
The OADP Operator leverages an OpenShift User Workload Monitoring provided by the OpenShift Monitoring Stack for retrieving metrics from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics by using the OpenShift Metrics query front end.
With enabled User Workload Monitoring, it is possible to configure and use any Prometheus-compatible third-party UI, such as Grafana, to visualize Velero metrics.
Monitoring metrics requires enabling monitoring for the user-defined projects and creating a ServiceMonitor
resource to scrape those metrics from the already enabled OADP service endpoint that resides in the openshift-adp
namespace.
You have access to an OKD cluster using an account with cluster-admin
permissions.
You have created a cluster monitoring config map.
Edit the cluster-monitoring-config
ConfigMap
object in the openshift-monitoring
namespace:
$ oc edit configmap cluster-monitoring-config -n openshift-monitoring
Add or enable the enableUserWorkload
option in the data
section’s config.yaml
field:
apiVersion: v1
data:
config.yaml: |
enableUserWorkload: true (1)
kind: ConfigMap
metadata:
# ...
1 | Add this option or set to true |
Wait a short period of time to verify the User Workload Monitoring Setup by checking if the following components are up and running in the openshift-user-workload-monitoring
namespace:
$ oc get pods -n openshift-user-workload-monitoring
NAME READY STATUS RESTARTS AGE
prometheus-operator-6844b4b99c-b57j9 2/2 Running 0 43s
prometheus-user-workload-0 5/5 Running 0 32s
prometheus-user-workload-1 5/5 Running 0 32s
thanos-ruler-user-workload-0 3/3 Running 0 32s
thanos-ruler-user-workload-1 3/3 Running 0 32s
Verify the existence of the user-workload-monitoring-config
ConfigMap in the openshift-user-workload-monitoring
. If it exists, skip the remaining steps in this procedure.
$ oc get configmap user-workload-monitoring-config -n openshift-user-workload-monitoring
Error from server (NotFound): configmaps "user-workload-monitoring-config" not found
Create a user-workload-monitoring-config
ConfigMap
object for the User Workload Monitoring, and save it under the 2_configure_user_workload_monitoring.yaml
file name:
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
Apply the 2_configure_user_workload_monitoring.yaml
file:
$ oc apply -f 2_configure_user_workload_monitoring.yaml
configmap/user-workload-monitoring-config created
OADP provides an openshift-adp-velero-metrics-svc
service which is created when the DPA is configured. The service monitor used by the user workload monitoring must point to the defined service.
Get details about the service by running the following commands:
Ensure the openshift-adp-velero-metrics-svc
service exists. It should contain app.kubernetes.io/name=velero
label, which will be used as selector for the ServiceMonitor
object.
$ oc get svc -n openshift-adp -l app.kubernetes.io/name=velero
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
openshift-adp-velero-metrics-svc ClusterIP 172.30.38.244 <none> 8085/TCP 1h
Create a ServiceMonitor
YAML file that matches the existing service label, and save the file as 3_create_oadp_service_monitor.yaml
. The service monitor is created in the openshift-adp
namespace where the openshift-adp-velero-metrics-svc
service resides.
ServiceMonitor
objectapiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app: oadp-service-monitor
name: oadp-service-monitor
namespace: openshift-adp
spec:
endpoints:
- interval: 30s
path: /metrics
targetPort: 8085
scheme: http
selector:
matchLabels:
app.kubernetes.io/name: "velero"
Apply the 3_create_oadp_service_monitor.yaml
file:
$ oc apply -f 3_create_oadp_service_monitor.yaml
servicemonitor.monitoring.coreos.com/oadp-service-monitor created
Confirm that the new service monitor is in an Up state by using the Administrator perspective of the OKD web console:
Navigate to the Observe → Targets page.
Ensure the Filter is unselected or that the User source is selected and type openshift-adp
in the Text
search field.
Verify that the status for the Status for the service monitor is Up.
The OKD monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring.
Create a PrometheusRule
YAML file with the sample OADPBackupFailing
alert and save it as 4_create_oadp_alert_rule.yaml
.
OADPBackupFailing
alertapiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: sample-oadp-alert
namespace: openshift-adp
spec:
groups:
- name: sample-oadp-backup-alert
rules:
- alert: OADPBackupFailing
annotations:
description: 'OADP had {{$value | humanize}} backup failures over the last 2 hours.'
summary: OADP has issues creating backups
expr: |
increase(velero_backup_failure_total{job="openshift-adp-velero-metrics-svc"}[2h]) > 0
for: 5m
labels:
severity: warning
In this sample, the Alert displays under the following conditions:
There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes.
If the time of the first increase is less than 5 minutes, the Alert will be in a Pending
state, after which it will turn into a Firing
state.
Apply the 4_create_oadp_alert_rule.yaml
file, which creates the PrometheusRule
object in the openshift-adp
namespace:
$ oc apply -f 4_create_oadp_alert_rule.yaml
prometheusrule.monitoring.coreos.com/sample-oadp-alert created
After the Alert is triggered, you can view it in the following ways:
In the Developer perspective, select the Observe menu.
In the Administrator perspective under the Observe → Alerting menu, select User in the Filter box. Otherwise, by default only the Platform Alerts are displayed.
These are the list of metrics provided by the OADP together with their Types.
Metric name | Description | Type |
---|---|---|
|
Number of bytes retrieved from the cache |
Counter |
|
Number of times content was retrieved from the cache |
Counter |
|
Number of times malformed content was read from the cache |
Counter |
|
Number of times content was not found in the cache and fetched |
Counter |
|
Number of bytes retrieved from the underlying storage |
Counter |
|
Number of times content could not be found in the underlying storage |
Counter |
|
Number of times content could not be saved in the cache |
Counter |
|
Number of bytes retrieved using |
Counter |
|
Number of times |
Counter |
|
Number of times |
Counter |
|
Number of times |
Counter |
|
Number of bytes passed to |
Counter |
|
Number of times |
Counter |
|
Total number of attempted backups |
Counter |
|
Total number of attempted backup deletions |
Counter |
|
Total number of failed backup deletions |
Counter |
|
Total number of successful backup deletions |
Counter |
|
Time taken to complete backup, in seconds |
Histogram |
|
Total number of failed backups |
Counter |
|
Total number of errors encountered during backup |
Gauge |
|
Total number of items backed up |
Gauge |
|
Last status of the backup. A value of 1 is success, 0. |
Gauge |
|
Last time a backup ran successfully, Unix timestamp in seconds |
Gauge |
|
Total number of partially failed backups |
Counter |
|
Total number of successful backups |
Counter |
|
Size, in bytes, of a backup |
Gauge |
|
Current number of existent backups |
Gauge |
|
Total number of validation failed backups |
Counter |
|
Total number of warned backups |
Counter |
|
Total number of CSI attempted volume snapshots |
Counter |
|
Total number of CSI failed volume snapshots |
Counter |
|
Total number of CSI successful volume snapshots |
Counter |
|
Total number of attempted restores |
Counter |
|
Total number of failed restores |
Counter |
|
Total number of partially failed restores |
Counter |
|
Total number of successful restores |
Counter |
|
Current number of existent restores |
Gauge |
|
Total number of failed restores failing validations |
Counter |
|
Total number of attempted volume snapshots |
Counter |
|
Total number of failed volume snapshots |
Counter |
|
Total number of successful volume snapshots |
Counter |
You can view metrics in the OKD web console from the Administrator or Developer perspective, which must have access to the openshift-adp
project.
Navigate to the Observe → Metrics page:
If you are using the Developer perspective, follow these steps:
Select Custom query, or click on the Show PromQL link.
Type the query and click Enter.
If you are using the Administrator perspective, type the expression in the text field and select Run Queries.