×

By using the OKD monitoring stack, users and administrators can effectively perform the following tasks:

  • Monitor and manage clusters

  • Analyze the workload performance of user applications

  • Monitor services running on the clusters

  • Receive alerts if an event occurs

Additional resources

OADP monitoring setup

The OADP Operator leverages an OpenShift User Workload Monitoring provided by the OpenShift Monitoring Stack for retrieving metrics from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics by using the OpenShift Metrics query front end.

With enabled User Workload Monitoring, it is possible to configure and use any Prometheus-compatible third-party UI, such as Grafana, to visualize Velero metrics.

Monitoring metrics requires enabling monitoring for the user-defined projects and creating a ServiceMonitor resource to scrape those metrics from the already enabled OADP service endpoint that resides in the openshift-adp namespace.

Prerequisites
  • You have access to an OKD cluster using an account with cluster-admin permissions.

  • You have created a cluster monitoring config map.

Procedure
  1. Edit the cluster-monitoring-config ConfigMap object in the openshift-monitoring namespace:

    $ oc edit configmap cluster-monitoring-config -n openshift-monitoring
  2. Add or enable the enableUserWorkload option in the data section’s config.yaml field:

    apiVersion: v1
    data:
      config.yaml: |
        enableUserWorkload: true (1)
    kind: ConfigMap
    metadata:
    # ...
    1 Add this option or set to true
  3. Wait a short period of time to verify the User Workload Monitoring Setup by checking if the following components are up and running in the openshift-user-workload-monitoring namespace:

    $ oc get pods -n openshift-user-workload-monitoring
    Example output

    NAME                                   READY   STATUS    RESTARTS   AGE
    prometheus-operator-6844b4b99c-b57j9   2/2     Running   0          43s
    prometheus-user-workload-0             5/5     Running   0          32s
    prometheus-user-workload-1             5/5     Running   0          32s
    thanos-ruler-user-workload-0           3/3     Running   0          32s
    thanos-ruler-user-workload-1           3/3     Running   0          32s
  4. Verify the existence of the user-workload-monitoring-config ConfigMap in the openshift-user-workload-monitoring. If it exists, skip the remaining steps in this procedure.

    $ oc get configmap user-workload-monitoring-config -n openshift-user-workload-monitoring
    Example output

    Error from server (NotFound): configmaps "user-workload-monitoring-config" not found
  5. Create a user-workload-monitoring-config ConfigMap object for the User Workload Monitoring, and save it under the 2_configure_user_workload_monitoring.yaml file name:

    Example output

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: user-workload-monitoring-config
      namespace: openshift-user-workload-monitoring
    data:
      config.yaml: |
  6. Apply the 2_configure_user_workload_monitoring.yaml file:

    $ oc apply -f 2_configure_user_workload_monitoring.yaml
    configmap/user-workload-monitoring-config created

Creating OADP service monitor

OADP provides an openshift-adp-velero-metrics-svc service which is created when the DPA is configured. The service monitor used by the user workload monitoring must point to the defined service.

Get details about the service by running the following commands:

Procedure
  1. Ensure the openshift-adp-velero-metrics-svc service exists. It should contain app.kubernetes.io/name=velero label, which will be used as selector for the ServiceMonitor object.

    $ oc get svc -n openshift-adp -l app.kubernetes.io/name=velero
    Example output

    NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
    openshift-adp-velero-metrics-svc   ClusterIP   172.30.38.244   <none>        8085/TCP   1h
  2. Create a ServiceMonitor YAML file that matches the existing service label, and save the file as 3_create_oadp_service_monitor.yaml. The service monitor is created in the openshift-adp namespace where the openshift-adp-velero-metrics-svc service resides.

    Example ServiceMonitor object

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      labels:
        app: oadp-service-monitor
      name: oadp-service-monitor
      namespace: openshift-adp
    spec:
      endpoints:
      - interval: 30s
        path: /metrics
        targetPort: 8085
        scheme: http
      selector:
        matchLabels:
          app.kubernetes.io/name: "velero"
  3. Apply the 3_create_oadp_service_monitor.yaml file:

    $ oc apply -f 3_create_oadp_service_monitor.yaml
    Example output

    servicemonitor.monitoring.coreos.com/oadp-service-monitor created
Verification
  • Confirm that the new service monitor is in an Up state by using the Administrator perspective of the OKD web console:

    1. Navigate to the Observe → Targets page.

    2. Ensure the Filter is unselected or that the User source is selected and type openshift-adp in the Text search field.

    3. Verify that the status for the Status for the service monitor is Up.

      OADP metrics targets
      Figure 1. OADP metrics targets

Creating an alerting rule

The OKD monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring.

Procedure
  1. Create a PrometheusRule YAML file with the sample OADPBackupFailing alert and save it as 4_create_oadp_alert_rule.yaml.

    Sample OADPBackupFailing alert

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: sample-oadp-alert
      namespace: openshift-adp
    spec:
      groups:
      - name: sample-oadp-backup-alert
        rules:
        - alert: OADPBackupFailing
          annotations:
            description: 'OADP had {{$value | humanize}} backup failures over the last 2 hours.'
            summary: OADP has issues creating backups
          expr: |
            increase(velero_backup_failure_total{job="openshift-adp-velero-metrics-svc"}[2h]) > 0
          for: 5m
          labels:
            severity: warning

    In this sample, the Alert displays under the following conditions:

    • There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes.

    • If the time of the first increase is less than 5 minutes, the Alert will be in a Pending state, after which it will turn into a Firing state.

  2. Apply the 4_create_oadp_alert_rule.yaml file, which creates the PrometheusRule object in the openshift-adp namespace:

    $ oc apply -f 4_create_oadp_alert_rule.yaml
    Example output

    prometheusrule.monitoring.coreos.com/sample-oadp-alert created
Verification
  • After the Alert is triggered, you can view it in the following ways:

    • In the Developer perspective, select the Observe menu.

    • In the Administrator perspective under the Observe → Alerting menu, select User in the Filter box. Otherwise, by default only the Platform Alerts are displayed.

      OADP backup failing alert
      Figure 2. OADP backup failing alert

List of available metrics

These are the list of metrics provided by the OADP together with their Types.

Metric name Description Type

kopia_content_cache_hit_bytes

Number of bytes retrieved from the cache

Counter

kopia_content_cache_hit_count

Number of times content was retrieved from the cache

Counter

kopia_content_cache_malformed

Number of times malformed content was read from the cache

Counter

kopia_content_cache_miss_count

Number of times content was not found in the cache and fetched

Counter

kopia_content_cache_missed_bytes

Number of bytes retrieved from the underlying storage

Counter

kopia_content_cache_miss_error_count

Number of times content could not be found in the underlying storage

Counter

kopia_content_cache_store_error_count

Number of times content could not be saved in the cache

Counter

kopia_content_get_bytes

Number of bytes retrieved using GetContent()

Counter

kopia_content_get_count

Number of times GetContent() was called

Counter

kopia_content_get_error_count

Number of times GetContent() was called and the result was an error

Counter

kopia_content_get_not_found_count

Number of times GetContent() was called and the result was not found

Counter

kopia_content_write_bytes

Number of bytes passed to WriteContent()

Counter

kopia_content_write_count

Number of times WriteContent() was called

Counter

velero_backup_attempt_total

Total number of attempted backups

Counter

velero_backup_deletion_attempt_total

Total number of attempted backup deletions

Counter

velero_backup_deletion_failure_total

Total number of failed backup deletions

Counter

velero_backup_deletion_success_total

Total number of successful backup deletions

Counter

velero_backup_duration_seconds

Time taken to complete backup, in seconds

Histogram

velero_backup_failure_total

Total number of failed backups

Counter

velero_backup_items_errors

Total number of errors encountered during backup

Gauge

velero_backup_items_total

Total number of items backed up

Gauge

velero_backup_last_status

Last status of the backup. A value of 1 is success, 0.

Gauge

velero_backup_last_successful_timestamp

Last time a backup ran successfully, Unix timestamp in seconds

Gauge

velero_backup_partial_failure_total

Total number of partially failed backups

Counter

velero_backup_success_total

Total number of successful backups

Counter

velero_backup_tarball_size_bytes

Size, in bytes, of a backup

Gauge

velero_backup_total

Current number of existent backups

Gauge

velero_backup_validation_failure_total

Total number of validation failed backups

Counter

velero_backup_warning_total

Total number of warned backups

Counter

velero_csi_snapshot_attempt_total

Total number of CSI attempted volume snapshots

Counter

velero_csi_snapshot_failure_total

Total number of CSI failed volume snapshots

Counter

velero_csi_snapshot_success_total

Total number of CSI successful volume snapshots

Counter

velero_restore_attempt_total

Total number of attempted restores

Counter

velero_restore_failed_total

Total number of failed restores

Counter

velero_restore_partial_failure_total

Total number of partially failed restores

Counter

velero_restore_success_total

Total number of successful restores

Counter

velero_restore_total

Current number of existent restores

Gauge

velero_restore_validation_failed_total

Total number of failed restores failing validations

Counter

velero_volume_snapshot_attempt_total

Total number of attempted volume snapshots

Counter

velero_volume_snapshot_failure_total

Total number of failed volume snapshots

Counter

velero_volume_snapshot_success_total

Total number of successful volume snapshots

Counter

Viewing metrics using the Observe UI

You can view metrics in the OKD web console from the Administrator or Developer perspective, which must have access to the openshift-adp project.

Procedure
  • Navigate to the Observe → Metrics page:

    • If you are using the Developer perspective, follow these steps:

      1. Select Custom query, or click on the Show PromQL link.

      2. Type the query and click Enter.

    • If you are using the Administrator perspective, type the expression in the text field and select Run Queries.

      OADP metrics query
      Figure 3. OADP metrics query