OKD generates a large amount of data, such as performance metrics and logs from both the platform and the the workloads running on it. As an administrator, you can use various tools to collect and analyze all the data available. What follows is an outline of best practices for system engineers, architects, and administrators configuring the observability stack.
Unless explicitly stated, the material in this document refers to both Edge and Core deployments.
The monitoring stack uses the following components:
Prometheus collects and analyzes metrics from OKD components and from workloads, if configured to do so.
Alertmanager is a component of Prometheus that handles routing, grouping, and silencing of alerts.
Thanos handles long term storage of metrics.
For a single-node OpenShift cluster, you should disable Alertmanager and Thanos because the cluster sends all metrics to the hub cluster for analysis and retention. |
Depending on your system, there can be hundreds of available measurements.
Here are some key metrics that you should pay attention to:
etcd
response times
API response times
Pod restarts and scheduling
Resource usage
OVN health
Overall cluster operator health
A good rule to follow is that if you decide that a metric is important, there should be an alert for it.
You can check the available metrics by runnning the following command:
|
The following tables show some queries that you can explore in the metrics query browser using the OKD console.
The URL for the console is https://<OpenShift Console FQDN>/monitoring/query-browser. You can get the Openshift Console FQDN by runnning the following command:
|
Metric | Query |
---|---|
CPU % requests by node |
|
Overall cluster CPU % utilization |
|
Memory % requests by node |
|
Overall cluster memory % utilization |
|
Metric | Query |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Combined |
|
Metric | Query |
---|---|
|
|
|
|
Leader elections |
|
Network latency |
|
Metric | Query |
---|---|
Degraded operators |
|
Total degraded operators per cluster |
|
Out of the box, Prometheus does not back up saved metrics with persistent storage. If you restart the Prometheus pods, all metrics data are lost. You should configure the monitoring stack to use the back-end storage that is available on the platform. To meet the high IO demands of Prometheus you should use local storage.
For Telco core clusters, you can use the Local Storage Operator for persistent storage for Prometheus.
Red Hat OpenShift Data Foundation (ODF), which deploys a ceph cluster for block, file, and object storage, is also a suitable candidate for a Telco core cluster.
To keep system resource requirements low on a RAN single-node OpenShift or far edge cluster, you should not provision backend storage for the monitoring stack. Such clusters forward all metrics to the hub cluster where you can provision a third party monitoring platform.
Single-node OpenShift at the edge keeps the footprint of the platform components to a minimum. The following procedure is an example of how you can configure a single-node OpenShift node with a small monitoring footprint.
For environments that use Red Hat Advanced Cluster Management (RHACM), you have enabled the Observability service.
The hub cluster is running Red Hat OpenShift Data Foundation (ODF).
Create a ConfigMap
CR, and save it as monitoringConfigMap.yaml
, as in the following example:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
alertmanagerMain:
enabled: false
telemeterClient:
enabled: false
prometheusK8s:
retention: 24h
On the single-node OpenShift, apply the ConfigMap
CR by running the following command:
$ oc apply -f monitoringConfigMap.yaml
Create a NameSpace
CR, and save it as monitoringNamespace.yaml
, as in the following example:
apiVersion: v1
kind: Namespace
metadata:
name: open-cluster-management-observability
On the hub cluster, apply the Namespace
CR on the hub cluster by running the following command:
$ oc apply -f monitoringNamespace.yaml
Create an ObjectBucketClaim
CR, and save it as monitoringObjectBucketClaim.yaml
, as in the following example:
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
name: multi-cloud-observability
namespace: open-cluster-management-observability
spec:
storageClassName: openshift-storage.noobaa.io
generateBucketName: acm-multi
On the hub cluster, apply the ObjectBucketClaim
CR, by running the following command:
$ oc apply -f monitoringObjectBucketClaim.yaml
Create a Secret
CR, and save it as monitoringSecret.yaml
, as in the following example:
apiVersion: v1
kind: Secret
metadata:
name: multiclusterhub-operator-pull-secret
namespace: open-cluster-management-observability
stringData:
.dockerconfigjson: 'PULL_SECRET'
On the hub cluster, apply the Secret
CR by running the following command:
$ oc apply -f monitoringSecret.yaml
Get the keys for the NooBaa service and the backend bucket name from the hub cluster by running the following commands:
$ NOOBAA_ACCESS_KEY=$(oc get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_ACCESS_KEY_ID|@base64d')
$ NOOBAA_SECRET_KEY=$(oc get secret noobaa-admin -n openshift-storage -o json | jq -r '.data.AWS_SECRET_ACCESS_KEY|@base64d')
$ OBJECT_BUCKET=$(oc get objectbucketclaim -n open-cluster-management-observability multi-cloud-observability -o json | jq -r .spec.bucketName)
Create a Secret
CR for bucket storage and save it as monitoringBucketSecret.yaml
, as in the following example:
apiVersion: v1
kind: Secret
metadata:
name: thanos-object-storage
namespace: open-cluster-management-observability
type: Opaque
stringData:
thanos.yaml: |
type: s3
config:
bucket: ${OBJECT_BUCKET}
endpoint: s3.openshift-storage.svc
insecure: true
access_key: ${NOOBAA_ACCESS_KEY}
secret_key: ${NOOBAA_SECRET_KEY}
On the hub cluster, apply the Secret
CR by running the following command:
$ oc apply -f monitoringBucketSecret.yaml
Create the MultiClusterObservability
CR and save it as monitoringMultiClusterObservability.yaml
, as in the following example:
apiVersion: observability.open-cluster-management.io/v1beta2
kind: MultiClusterObservability
metadata:
name: observability
spec:
advanced:
retentionConfig:
blockDuration: 2h
deleteDelay: 48h
retentionInLocal: 24h
retentionResolutionRaw: 3d
enableDownsampling: false
observabilityAddonSpec:
enableMetrics: true
interval: 300
storageConfig:
alertmanagerStorageSize: 10Gi
compactStorageSize: 100Gi
metricObjectStorage:
key: thanos.yaml
name: thanos-object-storage
receiveStorageSize: 25Gi
ruleStorageSize: 10Gi
storeStorageSize: 25Gi
On the hub cluster, apply the MultiClusterObservability
CR by running the following command:
$ oc apply -f monitoringMultiClusterObservability.yaml
Check the routes and pods in the namespace to validate that the services have deployed on the hub cluster by running the following command:
$ oc get routes,pods -n open-cluster-management-observability
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
route.route.openshift.io/alertmanager alertmanager-open-cluster-management-observability.cloud.example.com /api/v2 alertmanager oauth-proxy reencrypt/Redirect None
route.route.openshift.io/grafana grafana-open-cluster-management-observability.cloud.example.com grafana oauth-proxy reencrypt/Redirect None (1)
route.route.openshift.io/observatorium-api observatorium-api-open-cluster-management-observability.cloud.example.com observability-observatorium-api public passthrough/None None
route.route.openshift.io/rbac-query-proxy rbac-query-proxy-open-cluster-management-observability.cloud.example.com rbac-query-proxy https reencrypt/Redirect None
NAME READY STATUS RESTARTS AGE
pod/observability-alertmanager-0 3/3 Running 0 1d
pod/observability-alertmanager-1 3/3 Running 0 1d
pod/observability-alertmanager-2 3/3 Running 0 1d
pod/observability-grafana-685b47bb47-dq4cw 3/3 Running 0 1d
<...snip…>
pod/observability-thanos-store-shard-0-0 1/1 Running 0 1d
pod/observability-thanos-store-shard-1-0 1/1 Running 0 1d
pod/observability-thanos-store-shard-2-0 1/1 Running 0 1d
1 | A dashboard is accessible at the grafana route listed. You can use this to view metrics across all managed clusters. |
For more information on observability in Red Hat Advanced Cluster Management, see Observability.
OKD includes a large number of alert rules, which can change from release to release.
Use the following procedure to review all of the alert rules in a cluster.
To review all the alert rules in a cluster, you can run the following command:
$ oc get cm -n openshift-monitoring prometheus-k8s-rulefiles-0 -o yaml
Rules can include a description and provide a link to additional information and mitigation steps.
For example, this is the rule for etcdHighFsyncDurations
:
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations
are {{ $value }}s on etcd instance {{ $labels.instance }}.'
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
summary: etcd cluster 99th percentile fsync durations are too high.
expr: |
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 1
for: 10m
labels:
severity: critical
You can view alerts in the OKD console, however an administrator should configure an external receiver to forward the alerts to. OKD supports the following receiver types:
PagerDuty: a 3rd party incident response platform
Webhook: an arbitrary API endpoint that receives an alert via a POST request and can take any necessary action
Email: sends an email to designated address
Slack: sends a notification to either a slack channel or an individual user
By default, OKD does not collect metrics for application workloads. You can configure a cluster to collect workload metrics.
You have defined endpoints to gather workload metrics on the cluster.
Create a ConfigMap
CR and save it as monitoringConfigMap.yaml
, as in the following example:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true (1)
1 | Set to true to enable workload monitoring. |
Apply the ConfigMap
CR by running the following command:
$ oc apply -f monitoringConfigMap.yaml
Create a ServiceMonitor
CR, and save it as monitoringServiceMonitor.yaml
, as in the following example:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app: ui
name: myapp
namespace: myns
spec:
endpoints: (1)
- interval: 30s
port: ui-http
scheme: http
path: /healthz (2)
selector:
matchLabels:
app: ui
1 | Use endpoints to define workload metrics. |
2 | Prometheus scrapes the path /metrics by default. You can define a custom path here. |
Apply the ServiceMonitor
CR by running the following command:
$ oc apply -f monitoringServiceMonitor.yaml
Prometheus scrapes the path /metrics
by default, however you can define a custom path.
It is up to the vendor of the application to expose this endpoint for scraping, with metrics that they deem relevant.
You can enable alerts for user workloads on a cluster.
Create a ConfigMap
CR, and save it as monitoringConfigMap.yaml
, as in the following example:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true (1)
# ...
1 | Set to true to enable workload monitoring. |
Apply the ConfigMap
CR by running the following command:
$ oc apply -f monitoringConfigMap.yaml
Create a YAML file for alerting rules, monitoringAlertRule.yaml
, as in the following example:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alert
namespace: myns
spec:
groups:
- name: example
rules:
- alert: InternalErrorsAlert
expr: flask_http_request_total{status="500"} > 0
# ...
Apply the alert rule by running the following command:
$ oc apply -f monitoringAlertRule.yaml