OKD generates a large amount of data, such as performance metrics and logs from both the platform and the workloads running on it. As an administrator, you can use various tools to collect and analyze all the data available. What follows is an outline of best practices for system engineers, architects, and administrators configuring the observability stack.
Unless explicitly stated, the material in this document refers to both Edge and Core deployments.
The monitoring stack uses the following components:
Prometheus collects and analyzes metrics from OKD components and from workloads, if configured to do so.
Alertmanager is a component of Prometheus that handles routing, grouping, and silencing of alerts.
Thanos handles long term storage of metrics.
|
For single-node OpenShift clusters, disable Alertmanager and Thanos because the clusters sends all metrics to the hub cluster for analysis and retention. |
Depending on your system, you can have hundreds of available measurements.
Consider the following key metrics:
etcd response times
API response times
Pod restarts and scheduling
Resource usage
OVN health
Overall cluster operator health
If a metric is important, set up an alert for it.
|
You can check the available metrics by running the following command: +
|
Using the OKD console, you can explore the following queries in the metrics query browser.
|
The URL for the console is https://<OpenShift Console FQDN>/monitoring/query-browser. You can get the Openshift Console FQDN by running the following command: +
|
| Metric | Query |
|---|---|
CPU % requests by node |
|
Overall cluster CPU % utilization |
|
Memory % requests by node |
|
Overall cluster memory % utilization |
|
| Metric | Query |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Combined |
|
| Metric | Query |
|---|---|
|
|
|
|
Leader elections |
|
Network latency |
|
| Metric | Query |
|---|---|
Degraded operators |
|
Total degraded operators per cluster |
|
By default, Prometheus does not back up saved metrics with persistent storage. If you restart the Prometheus pods, all metrics data are lost. You must configure the monitoring stack to use the back-end storage that is available on the platform. To meet the high IO demands of Prometheus, use local storage.
For smaller clusters, you can use the Local Storage Operator for persistent storage for Prometheus. Red Hat OpenShift Data Foundation (ODF), which deploys a ceph cluster for block, file, and object storage, is suitable for larger clusters.
To keep system resource requirements low on a single-node OpenShift cluster, do not provision back-end storage for the monitoring stack. Such clusters forward all metrics to the hub cluster where you can provision a third party monitoring platform.
OKD clusters at the edge must keep the footprint of the platform components to a minimum. The following procedure is an example of how to configure a single-node OpenShift or a node at the far edge network with a small monitoring footprint.
For environments that use Red Hat Advanced Cluster Management (RHACM), you have enabled the Observability service.
The hub cluster is running Red Hat OpenShift Data Foundation (ODF).
Create a ConfigMap CR, and save it as monitoringConfigMap.yaml, as in the following example:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
alertmanagerMain:
enabled: false
telemeterClient:
enabled: false
prometheusK8s:
retention: 24h
Apply the ConfigMap CR by running the following command on the single-node OpenShift cluster:
$ oc apply -f monitoringConfigMap.yaml
Create a Namespace CR, and save it as monitoringNamespace.yaml, as in the following example:
apiVersion: v1
kind: Namespace
metadata:
name: open-cluster-management-observability
Apply the Namespace CR by running the following command on the hub cluster :
$ oc apply -f monitoringNamespace.yaml
Create an ObjectBucketClaim CR, and save it as monitoringObjectBucketClaim.yaml, as in the following example:
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
name: multi-cloud-observability
namespace: open-cluster-management-observability
spec:
storageClassName: openshift-storage.noobaa.io
generateBucketName: acm-multi
Apply the ObjectBucketClaim CR by running the following command on the hub cluster:
$ oc apply -f monitoringObjectBucketClaim.yaml
Create a Secret CR, and save it as monitoringSecret.yaml, as in the following example:
apiVersion: v1
kind: Secret
metadata:
name: multiclusterhub-operator-pull-secret
namespace: open-cluster-management-observability
stringData:
.dockerconfigjson: 'PULL_SECRET'
Apply the Secret CR by running the following command in the hub cluster:
$ oc apply -f monitoringSecret.yaml
Get the keys for the NooBaa service and the back-end bucket name from the hub cluster by running the following commands:
$ NOOBAA_ACCESS_KEY=$(oc get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_ACCESS_KEY_ID|@base64d')
$ NOOBAA_SECRET_KEY=$(oc get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_SECRET_ACCESS_KEY|@base64d')
$ OBJECT_BUCKET=$(oc get objectbucketclaim -n open-cluster-management-observability multi-cloud-observability -o json | jq -r .spec.bucketName)
Create a Secret CR for bucket storage and save it as monitoringBucketSecret.yaml, as in the following example:
apiVersion: v1
kind: Secret
metadata:
name: thanos-object-storage
namespace: open-cluster-management-observability
type: Opaque
stringData:
thanos.yaml: |
type: s3
config:
bucket: ${OBJECT_BUCKET}
endpoint: s3.openshift-storage.svc
insecure: true
access_key: ${NOOBAA_ACCESS_KEY}
secret_key: ${NOOBAA_SECRET_KEY}
Apply the Secret CR by running the following command on the hub cluster:
$ oc apply -f monitoringBucketSecret.yaml
Create the MultiClusterObservability CR and save it as monitoringMultiClusterObservability.yaml, as in the following example:
apiVersion: observability.open-cluster-management.io/v1beta2
kind: MultiClusterObservability
metadata:
name: observability
spec:
advanced:
retentionConfig:
blockDuration: 2h
deleteDelay: 48h
retentionInLocal: 24h
retentionResolutionRaw: 3d
enableDownsampling: false
observabilityAddonSpec:
enableMetrics: true
interval: 300
storageConfig:
alertmanagerStorageSize: 10Gi
compactStorageSize: 100Gi
metricObjectStorage:
key: thanos.yaml
name: thanos-object-storage
receiveStorageSize: 25Gi
ruleStorageSize: 10Gi
storeStorageSize: 25Gi
Apply the MultiClusterObservability CR by running the following command on the hub cluster:
$ oc apply -f monitoringMultiClusterObservability.yaml
Check the routes and pods in the namespace to validate that the services have deployed on the hub cluster by running the following command:
$ oc get routes,pods -n open-cluster-management-observability
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
route.route.openshift.io/alertmanager alertmanager-open-cluster-management-observability.cloud.example.com /api/v2 alertmanager oauth-proxy reencrypt/Redirect None
route.route.openshift.io/grafana grafana-open-cluster-management-observability.cloud.example.com grafana oauth-proxy reencrypt/Redirect None (1)
route.route.openshift.io/observatorium-api observatorium-api-open-cluster-management-observability.cloud.example.com observability-observatorium-api public passthrough/None None
route.route.openshift.io/rbac-query-proxy rbac-query-proxy-open-cluster-management-observability.cloud.example.com rbac-query-proxy https reencrypt/Redirect None
NAME READY STATUS RESTARTS AGE
pod/observability-alertmanager-0 3/3 Running 0 1d
pod/observability-alertmanager-1 3/3 Running 0 1d
pod/observability-alertmanager-2 3/3 Running 0 1d
pod/observability-grafana-685b47bb47-dq4cw 3/3 Running 0 1d
<...snip…>
pod/observability-thanos-store-shard-0-0 1/1 Running 0 1d
pod/observability-thanos-store-shard-1-0 1/1 Running 0 1d
pod/observability-thanos-store-shard-2-0 1/1 Running 0 1d
| 1 | A dashboard is accessible at the grafana route listed. You can use this to view metrics across all managed clusters. |
For more information on observability in Red Hat Advanced Cluster Management, see Observability.
OKD includes a large number of alert rules, which can change from release to release.
Review all of the alert rules in a cluster.
To review all the alert rules in a cluster, run the following command:
$ oc get cm -n openshift-monitoring prometheus-k8s-rulefiles-0 -o yaml
Rules can include a description and provide a link to additional information and mitigation steps.
For example, see the rule for etcdHighFsyncDurations:
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations
are {{ $value }}s on etcd instance {{ $labels.instance }}.'
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
summary: etcd cluster 99th percentile fsync durations are too high.
expr: |
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 1
for: 10m
labels:
severity: critical
You can view alerts in the OKD console. However, an administrator must configure an external receiver to forward the alerts to. OKD supports the following receiver types:
A third-party incident response platform.
An arbitrary API endpoint that receives an alert through a POST request and can take any necessary action.
Sends an email to a designated address.
Sends a notification to either a Slack channel or an individual user.
By default, OKD does not collect metrics for application workloads. You can configure a cluster to collect workload metrics.
You have defined endpoints to gather workload metrics on the cluster.
Create a ConfigMap CR and save it as monitoringConfigMap.yaml, as in the following example:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true (1)
| 1 | Set to true to enable workload monitoring. |
Apply the ConfigMap CR by running the following command:
$ oc apply -f monitoringConfigMap.yaml
Create a ServiceMonitor CR, and save it as monitoringServiceMonitor.yaml, as in the following example:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app: ui
name: myapp
namespace: myns
spec:
endpoints: (1)
- interval: 30s
port: ui-http
scheme: http
path: /healthz (2)
selector:
matchLabels:
app: ui
| 1 | Use endpoints to define workload metrics. |
| 2 | Prometheus scrapes the path /metrics by default. You can define a custom path here. |
Apply the ServiceMonitor CR by running the following command:
$ oc apply -f monitoringServiceMonitor.yaml
Prometheus scrapes the /metrics path by default. However, you can define a custom path.
The vendor of the application must decide whether to expose the endpoint for scraping, with metrics that they deem relevant.
You can enable alerts for user workloads on a cluster.
Create a ConfigMap CR, and save it as monitoringConfigMap.yaml, as in the following example:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true (1)
# ...
| 1 | Set to true to enable workload monitoring. |
Apply the ConfigMap CR by running the following command:
$ oc apply -f monitoringConfigMap.yaml
Create a YAML file for alerting rules, monitoringAlertRule.yaml, as in the following example:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alert
namespace: myns
spec:
groups:
- name: example
rules:
- alert: InternalErrorsAlert
expr: flask_http_request_total{status="500"} > 0
# ...
Apply the alert rule by running the following command:
$ oc apply -f monitoringAlertRule.yaml