$ oc edit configmap cluster-monitoring-config -n openshift-monitoring
By using the OKD monitoring stack, users and administrators can effectively perform the following tasks:
Monitor and manage clusters
Analyze the workload performance of user applications
Monitor services running on the clusters
Receive alerts if an event occurs
The OADP Operator leverages an OpenShift User Workload Monitoring provided by the OpenShift Monitoring Stack for retrieving metrics from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics by using the OpenShift Metrics query front end.
With enabled User Workload Monitoring, it is possible to configure and use any Prometheus-compatible third-party UI, such as Grafana, to visualize Velero metrics.
Monitoring metrics requires enabling monitoring for the user-defined projects and creating a ServiceMonitor
resource to scrape those metrics from the already enabled OADP service endpoint that resides in the openshift-adp
namespace.
You have access to an OKD cluster using an account with cluster-admin
permissions.
You have created a cluster monitoring config map.
Edit the cluster-monitoring-config
ConfigMap
object in the openshift-monitoring
namespace:
$ oc edit configmap cluster-monitoring-config -n openshift-monitoring
Add or enable the enableUserWorkload
option in the data
section’s config.yaml
field:
apiVersion: v1
data:
config.yaml: |
enableUserWorkload: true (1)
kind: ConfigMap
metadata:
# ...
1 | Add this option or set to true |
Wait a short period of time to verify the User Workload Monitoring Setup by checking if the following components are up and running in the openshift-user-workload-monitoring
namespace:
$ oc get pods -n openshift-user-workload-monitoring
NAME READY STATUS RESTARTS AGE
prometheus-operator-6844b4b99c-b57j9 2/2 Running 0 43s
prometheus-user-workload-0 5/5 Running 0 32s
prometheus-user-workload-1 5/5 Running 0 32s
thanos-ruler-user-workload-0 3/3 Running 0 32s
thanos-ruler-user-workload-1 3/3 Running 0 32s
Verify the existence of the user-workload-monitoring-config
ConfigMap in the openshift-user-workload-monitoring
. If it exists, skip the remaining steps in this procedure.
$ oc get configmap user-workload-monitoring-config -n openshift-user-workload-monitoring
Error from server (NotFound): configmaps "user-workload-monitoring-config" not found
Create a user-workload-monitoring-config
ConfigMap
object for the User Workload Monitoring, and save it under the 2_configure_user_workload_monitoring.yaml
file name:
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
Apply the 2_configure_user_workload_monitoring.yaml
file:
$ oc apply -f 2_configure_user_workload_monitoring.yaml
configmap/user-workload-monitoring-config created
OADP provides an openshift-adp-velero-metrics-svc
service which is created when the DPA is configured. The service monitor used by the user workload monitoring must point to the defined service.
Get details about the service by running the following commands:
Ensure the openshift-adp-velero-metrics-svc
service exists. It should contain app.kubernetes.io/name=velero
label, which will be used as selector for the ServiceMonitor
object.
$ oc get svc -n openshift-adp -l app.kubernetes.io/name=velero
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
openshift-adp-velero-metrics-svc ClusterIP 172.30.38.244 <none> 8085/TCP 1h
Create a ServiceMonitor
YAML file that matches the existing service label, and save the file as 3_create_oadp_service_monitor.yaml
. The service monitor is created in the openshift-adp
namespace where the openshift-adp-velero-metrics-svc
service resides.
ServiceMonitor
objectapiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app: oadp-service-monitor
name: oadp-service-monitor
namespace: openshift-adp
spec:
endpoints:
- interval: 30s
path: /metrics
targetPort: 8085
scheme: http
selector:
matchLabels:
app.kubernetes.io/name: "velero"
Apply the 3_create_oadp_service_monitor.yaml
file:
$ oc apply -f 3_create_oadp_service_monitor.yaml
servicemonitor.monitoring.coreos.com/oadp-service-monitor created
Confirm that the new service monitor is in an Up state by using the Administrator perspective of the OKD web console:
Navigate to the Observe → Targets page.
Ensure the Filter is unselected or that the User source is selected and type openshift-adp
in the Text
search field.
Verify that the status for the Status for the service monitor is Up.
The OKD monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring.
Create a PrometheusRule
YAML file with the sample OADPBackupFailing
alert and save it as 4_create_oadp_alert_rule.yaml
.
OADPBackupFailing
alertapiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: sample-oadp-alert
namespace: openshift-adp
spec:
groups:
- name: sample-oadp-backup-alert
rules:
- alert: OADPBackupFailing
annotations:
description: 'OADP had {{$value | humanize}} backup failures over the last 2 hours.'
summary: OADP has issues creating backups
expr: |
increase(velero_backup_failure_total{job="openshift-adp-velero-metrics-svc"}[2h]) > 0
for: 5m
labels:
severity: warning
In this sample, the Alert displays under the following conditions:
There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes.
If the time of the first increase is less than 5 minutes, the Alert will be in a Pending
state, after which it will turn into a Firing
state.
Apply the 4_create_oadp_alert_rule.yaml
file, which creates the PrometheusRule
object in the openshift-adp
namespace:
$ oc apply -f 4_create_oadp_alert_rule.yaml
prometheusrule.monitoring.coreos.com/sample-oadp-alert created
After the Alert is triggered, you can view it in the following ways:
In the Developer perspective, select the Observe menu.
In the Administrator perspective under the Observe → Alerting menu, select User in the Filter box. Otherwise, by default only the Platform Alerts are displayed.
These are the list of metrics provided by the OADP together with their Types.
Metric name | Description | Type |
---|---|---|
|
Number of bytes retrieved from the cache |
Counter |
|
Number of times content was retrieved from the cache |
Counter |
|
Number of times malformed content was read from the cache |
Counter |
|
Number of times content was not found in the cache and fetched |
Counter |
|
Number of bytes retrieved from the underlying storage |
Counter |
|
Number of times content could not be found in the underlying storage |
Counter |
|
Number of times content could not be saved in the cache |
Counter |
|
Number of bytes retrieved using |
Counter |
|
Number of times |
Counter |
|
Number of times |
Counter |
|
Number of times |
Counter |
|
Number of bytes passed to |
Counter |
|
Number of times |
Counter |
|
Total number of attempted backups |
Counter |
|
Total number of attempted backup deletions |
Counter |
|
Total number of failed backup deletions |
Counter |
|
Total number of successful backup deletions |
Counter |
|
Time taken to complete backup, in seconds |
Histogram |
|
Total number of failed backups |
Counter |
|
Total number of errors encountered during backup |
Gauge |
|
Total number of items backed up |
Gauge |
|
Last status of the backup. A value of 1 is success, 0. |
Gauge |
|
Last time a backup ran successfully, Unix timestamp in seconds |
Gauge |
|
Total number of partially failed backups |
Counter |
|
Total number of successful backups |
Counter |
|
Size, in bytes, of a backup |
Gauge |
|
Current number of existent backups |
Gauge |
|
Total number of validation failed backups |
Counter |
|
Total number of warned backups |
Counter |
|
Total number of CSI attempted volume snapshots |
Counter |
|
Total number of CSI failed volume snapshots |
Counter |
|
Total number of CSI successful volume snapshots |
Counter |
|
Total number of attempted restores |
Counter |
|
Total number of failed restores |
Counter |
|
Total number of partially failed restores |
Counter |
|
Total number of successful restores |
Counter |
|
Current number of existent restores |
Gauge |
|
Total number of failed restores failing validations |
Counter |
|
Total number of attempted volume snapshots |
Counter |
|
Total number of failed volume snapshots |
Counter |
|
Total number of successful volume snapshots |
Counter |
You can view metrics in the OKD web console from the Administrator or Developer perspective, which must have access to the openshift-adp
project.
Navigate to the Observe → Metrics page:
If you are using the Developer perspective, follow these steps:
Select Custom query, or click on the Show PromQL link.
Type the query and click Enter.
If you are using the Administrator perspective, type the expression in the text field and select Run Queries.