Prometheus Cluster Monitoring

Overview

OKD ships with a pre-configured and self-updating monitoring stack that is based on the Prometheus open source project and its wider eco-system. It provides monitoring of cluster components and ships with a set of alerts to immediately notify the cluster administrator about any occurring problems and a set of Grafana dashboards.

monitoring diagram

Highlighted in the diagram above, at the heart of the monitoring stack sits the OKD Cluster Monitoring Operator (CMO), which watches over the deployed monitoring components and resources, and ensures that they are always up to date.

The Prometheus Operator (PO) creates, configures, and manages Prometheus and Alertmanager instances. It also automatically generates monitoring target configurations based on familiar Kubernetes label queries.

In addition to Prometheus and Alertmanager, OKD Monitoring also includes node-exporter and kube-state-metrics. Node-exporter is an agent deployed on every node to collect metrics about it. The kube-state-metrics exporter agent converts Kubernetes objects to metrics consumable by Prometheus.

The targets monitored as part of the cluster monitoring are:

  • Prometheus itself

  • Prometheus-Operator

  • cluster-monitoring-operator

  • Alertmanager cluster instances

  • Kubernetes apiserver

  • kubelets (the kubelet embeds cAdvisor for per container metrics)

  • kube-controllers

  • kube-state-metrics

  • node-exporter

  • etcd (if etcd monitoring is enabled)

All these components are automatically updated.

For more information about the OKD Cluster Monitoring Operator, see the Cluster Monitoring Operator GitHub project.

In order to be able to deliver updates with guaranteed compatibility, configurability of the OKD Monitoring stack is limited to the explicitly available options.

Configuring OKD cluster monitoring

The OKD Ansible openshift_cluster_monitoring_operator role configures and deploys the Cluster Monitoring Operator using the variables from the inventory file.

Table 1. Ansible variables
Variable Description

openshift_cluster_monitoring_operator_install

Deploy the Cluster Monitoring Operator if true. Otherwise, undeploy. This variable is set to true by default.

openshift_cluster_monitoring_operator_prometheus_storage_capacity

The persistent volume claim size for each of the Prometheus instances. This variable applies only if openshift_cluster_monitoring_operator_prometheus_storage_enabled is set to true. Defaults to 50Gi.

openshift_cluster_monitoring_operator_alertmanager_storage_capacity

The persistent volume claim size for each of the Alertmanager instances. This variable applies only if openshift_cluster_monitoring_operator_alertmanager_storage_enabled is set to true. Defaults to 2Gi.

openshift_cluster_monitoring_operator_node_selector

Set to the desired, existing node selector to ensure that pods are placed onto nodes with specific labels. Defaults to node-role.kubernetes.io/infra=true.

openshift_cluster_monitoring_operator_alertmanager_config

Configures Alertmanager.

openshift_cluster_monitoring_operator_prometheus_storage_enabled

Enable persistent storage of Prometheus' time-series data. This variable is set to false by default.

openshift_cluster_monitoring_operator_alertmanager_storage_enabled

Enable persistent storage of Alertmanager notifications and silences. This variable is set to false by default.

Monitoring prerequisites

The monitoring stack imposes additional resource requirements. See computing resources recommendations for details.

Installing monitoring stack

The Monitoring stack is installed with OKD by default. You can prevent it from being installed. To do that, set this variable to false in the Ansible inventory file:

openshift_cluster_monitoring_operator_install

Persistent storage

Running cluster monitoring with persistent storage means that your metrics are stored to a persistent volume and can survive a pod being restarted or recreated. This is ideal if you require your metrics or alerting data to be guarded from data loss. For production environments it is highly recommended to configure persistent storage.

Enabling persistent storage

By default, persistent storage is disabled for both Prometheus time-series data and for Alertmanager notifications and silences. You can configure the cluster to persistently store any one of them or both.

  • To enable persistent storage of Prometheus time-series data, set this variable to true in the Ansible inventory file:

    openshift_cluster_monitoring_operator_prometheus_storage_enabled

  • To enable persistent storage of Alertmanager notifications and silences, set this variable to true in the Ansible inventory file:

    openshift_cluster_monitoring_operator_alertmanager_storage_enabled

Determining how much storage is necessary

How much storage you need depends on the number of pods. It is administrator’s responsibility to dedicate sufficient storage to ensure that the disk does not become full. For information on system requirements for persistent storage, see Capacity Planning for Cluster Monitoring Operator.

Setting persistent storage size

To specify the size of the persistent volume claim for Prometheus and Alertmanager, change these Ansible variables:

  • openshift_cluster_monitoring_operator_prometheus_storage_capacity (default: 50Gi)

  • openshift_cluster_monitoring_operator_alertmanager_storage_capacity (default: 2Gi)

Each of these variables applies only if its corresponding storage_enabled variable is set to true.

Allocating enough persistent volumes

Unless you use dynamically-provisioned storage, you need to make sure you have a persistent volume (PV) ready to be claimed by the PVC, one PV for each replica. Prometheus has two replicas and Alertmanager has three replicas, which amounts to five PVs.

Enabling dynamically-provisioned storage

Instead of statically-provisioned storage, you can use dynamically-provisioned storage. See Dynamic Volume Provisioning for details.

Supported configuration

The supported way of configuring OKD Monitoring is by configuring it using the options described in Configuring OpenShift cluster monitoring. Beyond those explicit configuration options, it is possible to inject additional configuration into the stack. However this is unsupported, as configuration paradigms might change across Prometheus releases, and such cases can only be handled gracefully if all configuration possibilities are controlled.

Explicitly unsupported cases include:

  • Creating additional ServiceMonitor objects in the openshift-monitoring namespace, thereby extending the targets the cluster monitoring Prometheus instance scrapes. This can cause collisions and load differences that cannot be accounted for, therefore the Prometheus setup can be unstable.

  • Creating additional ConfigMap objects, that cause the cluster monitoring Prometheus instance to include additional alerting and recording rules. Note that this behavior is known to cause a breaking behavior if applied, as Prometheus 2.0 will ship with a new rule file syntax.

Configuring Alertmanager

The Alertmanager manages incoming alerts, including silencing, inhibition, aggregation, and sending out notifications through methods such as email, PagerDuty, and HipChat.

The default configuration of the OKD Monitoring Alertmanager cluster is:

  global:
    resolve_timeout: 5m
  route:
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    receiver: default
    routes:
    - match:
        alertname: DeadMansSwitch
      repeat_interval: 5m
      receiver: deadmansswitch
  receivers:
  - name: default
  - name: deadmansswitch

This configuration can be overwritten using the Ansible variable openshift_cluster_monitoring_operator_alertmanager_config from the openshift_cluster_monitoring_operator role.

The following example configures PagerDuty for notifications. See the PagerDuty documentation for Alertmanager to learn how to retrieve the service_key.

openshift_cluster_monitoring_operator_alertmanager_config: |+
  global:
    resolve_timeout: 5m
  route:
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    receiver: default
    routes:
    - match:
        alertname: DeadMansSwitch
      repeat_interval: 5m
      receiver: deadmansswitch
    - match:
        service: example-app
      routes:
      - match:
          severity: critical
        receiver: team-frontend-page
  receivers:
  - name: default
  - name: deadmansswitch
  - name: team-frontend-page
    pagerduty_configs:
    - service_key: "<key>"

The sub-route matches only on alerts that have a severity of critical, and sends them via the receiver called team-frontend-page. As the name indicates, someone should be paged for alerts that are critical. See Alertmanager configuration for configuring alerting through different alert receivers.

Dead man’s switch

OKD Monitoring ships with a "Dead man’s switch" to ensure the availability of the monitoring infrastructure.

The "Dead man’s switch" is a simple Prometheus alerting rule that always triggers. The Alertmanager continuously sends notifications for the dead man’s switch to the notification provider that supports this functionality. This also ensures that communication between the Alertmanager and the notification provider is working.

This mechanism is supported by PagerDuty to issue alerts when the monitoring system itself is down. For more information, see Dead man’s switch PagerDuty below.

Grouping alerts

Once alerts are firing against the Alertmanager, it must be configured to know how to logically group them.

For this example a new route will be added to reflect alert routing of the "frontend" team.

First, add new routes. Multiple routes may be added beneath the original route, typically to define the receiver for the notification. The following example uses a matcher to ensure that only alerts coming from the service example-app are used.

global:
  resolve_timeout: 5m
route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: default
  routes:
  - match:
      alertname: DeadMansSwitch
    repeat_interval: 5m
    receiver: deadmansswitch
  - match:
      service: example-app
    routes:
    - match:
        severity: critical
      receiver: team-frontend-page
receivers:
- name: default
- name: deadmansswitch

The sub-route matches only on alerts that have a severity of critical, and sends them via the receiver called team-frontend-page. As the name indicates, someone should be paged for alerts that are critical.

Dead man’s switch PagerDuty

PagerDuty supports this mechanism through an integration called Dead Man’s Snitch. Simply add a PagerDuty configuration to the default deadmansswitch receiver. Use the process described above to add this configuration.

Configure Dead Man’s Snitch to page the operator if the Dead man’s switch alert is silent for 15 minutes. With the default Alertmanager configuration, the Dead man’s switch alert is repeated every five minutes. If Dead Man’s Snitch triggers after 15 minutes, it indicates that the notification has been unsuccessful at least twice.

Alerting rules

OKD Cluster Monitoring ships with the following alerting rules configured by default. Currently you cannot add custom alerting rules.

Some alerting rules have identical names. This is intentional. They are alerting about the same event with different thresholds, with different severity, or both. With the inhibition rules, the lower severity is inhibited when the higher severity is firing.

For more details on the alerting rules, see the configuration file.

Alert Severity Description

ClusterMonitoringOperatorErrors

critical

Cluster Monitoring Operator is experiencing X% errors.

AlertmanagerDown

critical

Alertmanager has disappeared from Prometheus target discovery.

ClusterMonitoringOperatorDown

critical

ClusterMonitoringOperator has disappeared from Prometheus target discovery.

KubeAPIDown

critical

KubeAPI has disappeared from Prometheus target discovery.

KubeControllerManagerDown

critical

KubeControllerManager has disappeared from Prometheus target discovery.

KubeSchedulerDown

critical

KubeScheduler has disappeared from Prometheus target discovery.

KubeStateMetricsDown

critical

KubeStateMetrics has disappeared from Prometheus target discovery.

KubeletDown

critical

Kubelet has disappeared from Prometheus target discovery.

NodeExporterDown

critical

NodeExporter has disappeared from Prometheus target discovery.

PrometheusDown

critical

Prometheus has disappeared from Prometheus target discovery.

PrometheusOperatorDown

critical

PrometheusOperator has disappeared from Prometheus target discovery.

KubePodCrashLooping

critical

Namespace/Pod (Container) is restarting times / second

KubePodNotReady

critical

Namespace/Pod is not ready.

KubeDeploymentGenerationMismatch

critical

Deployment Namespace/Deployment generation mismatch

KubeDeploymentReplicasMismatch

critical

Deployment Namespace/Deployment replica mismatch

KubeStatefulSetReplicasMismatch

critical

StatefulSet Namespace/StatefulSet replica mismatch

KubeStatefulSetGenerationMismatch

critical

StatefulSet Namespace/StatefulSet generation mismatch

KubeDaemonSetRolloutStuck

critical

Only X% of desired pods scheduled and ready for daemon set Namespace/DaemonSet

KubeDaemonSetNotScheduled

warning

A number of pods of daemonset Namespace/DaemonSet are not scheduled.

KubeDaemonSetMisScheduled

warning

A number of pods of daemonset Namespace/DaemonSet are running where they are not supposed to run.

KubeCronJobRunning

warning

CronJob Namespace/CronJob is taking more than 1h to complete.

KubeJobCompletion

warning

Job Namespaces/Job is taking more than 1h to complete.

KubeJobFailed

warning

Job Namespaces/Job failed to complete.

KubeCPUOvercommit

warning

Overcommited CPU resource requests on Pods, cannot tolerate node failure.

KubeMemOvercommit

warning

Overcommited Memory resource requests on Pods, cannot tolerate node failure.

KubeCPUOvercommit

warning

Overcommited CPU resource request quota on Namespaces.

KubeMemOvercommit

warning

Overcommited Memory resource request quota on Namespaces.

alerKubeQuotaExceeded

warning

X% usage of Resource in namespace Namespace.

KubePersistentVolumeUsageCritical

critical

The persistent volume claimed by PersistentVolumeClaim in namespace Namespace has X% free.

KubePersistentVolumeFullInFourDays

critical

Based on recent sampling, the persistent volume claimed by PersistentVolumeClaim in namespace Namespace is expected to fill up within four days. Currently X bytes are available.

KubeNodeNotReady

warning

Node has been unready for more than an hour

KubeVersionMismatch

warning

There are X different versions of Kubernetes components running.

KubeClientErrors

warning

Kubernetes API server client 'Job/Instance' is experiencing X% errors.'

KubeClientErrors

warning

Kubernetes API server client 'Job/Instance' is experiencing X errors / sec.'

KubeletTooManyPods

warning

Kubelet Instance is running X pods, close to the limit of 110.

KubeAPILatencyHigh

warning

The API server has a 99th percentile latency of X seconds for Verb Resource.

KubeAPILatencyHigh

critical

The API server has a 99th percentile latency of X seconds for Verb Resource.

KubeAPIErrorsHigh

critical

API server is erroring for X% of requests.

KubeAPIErrorsHigh

warning

API server is erroring for X% of requests.

KubeClientCertificateExpiration

warning

Kubernetes API certificate is expiring in less than 7 days.

KubeClientCertificateExpiration

critical

Kubernetes API certificate is expiring in less than 1 day.

AlertmanagerConfigInconsistent

critical

Summary: Configuration out of sync. Description: The configuration of the instances of the Alertmanager cluster Service are out of sync.

AlertmanagerFailedReload

warning

Summary: Alertmanager’s configuration reload failed. Description: Reloading Alertmanager’s configuration has failed for Namespace/Pod.

TargetDown

warning

Summary: Targets are down. Description: X% of Job targets are down.

DeadMansSwitch

none

Summary: Alerting DeadMansSwitch. Description: This is a DeadMansSwitch meant to ensure that the entire Alerting pipeline is functional.

NodeDiskRunningFull

warning

Device Device of node-exporter Namespace/Pod is running full within the next 24 hours.

NodeDiskRunningFull

critical

Device Device of node-exporter Namespace/Pod is running full within the next 2 hours.

PrometheusConfigReloadFailed

warning

Summary: Reloading Prometheus' configuration failed. Description: Reloading Prometheus' configuration has failed for Namespace/Pod

PrometheusNotificationQueueRunningFull

warning

Summary: Prometheus' alert notification queue is running full. Description: Prometheus' alert notification queue is running full for Namespace/Pod

PrometheusErrorSendingAlerts

warning

Summary: Errors while sending alert from Prometheus. Description: Errors while sending alerts from Prometheus Namespace/Pod to Alertmanager Alertmanager

PrometheusErrorSendingAlerts

critical

Summary: Errors while sending alerts from Prometheus. Description: Errors while sending alerts from Prometheus Namespace/Pod to Alertmanager Alertmanager

PrometheusNotConnectedToAlertmanagers

warning

Summary: Prometheus is not connected to any Alertmanagers. Description: Prometheus Namespace/Pod is not connected to any Alertmanagers

PrometheusTSDBReloadsFailing

warning

Summary: Prometheus has issues reloading data blocks from disk. Description: Job at Instance had X reload failures over the last four hours.

PrometheusTSDBCompactionsFailing

warning

Summary: Prometheus has issues compacting sample blocks. Description: Job at Instance had X compaction failures over the last four hours.

PrometheusTSDBWALCorruptions

warning

Summary: Prometheus write-ahead log is corrupted. Description: Job at Instance has a corrupted write-ahead log (WAL).

PrometheusNotIngestingSamples

warning

Summary: Prometheus isn’t ingesting samples. Description: Prometheus Namespace/Pod isn’t ingesting samples.

PrometheusTargetScrapesDuplicate

warning

Summary: Prometheus has many samples rejected. Description: Namespace/Pod has many samples rejected due to duplicate timestamps but different values

EtcdInsufficientMembers

critical

Etcd cluster "Job": insufficient members (X).

EtcdNoLeader

critical

Etcd cluster "Job": member Instance has no leader.

EtcdHighNumberOfLeaderChanges

warning

Etcd cluster "Job": instance Instance has seen X leader changes within the last hour.

EtcdHighNumberOfFailedGRPCRequests

warning

Etcd cluster "Job": X% of requests for GRPC_Method failed on etcd instance Instance.

EtcdHighNumberOfFailedGRPCRequests

critical

Etcd cluster "Job": X% of requests for GRPC_Method failed on etcd instance Instance.

EtcdGRPCRequestsSlow

critical

Etcd cluster "Job": gRPC requests to GRPC_Method are taking X_s on etcd instance _Instance.

EtcdMemberCommunicationSlow

warning

Etcd cluster "Job": member communication with To is taking X_s on etcd instance _Instance.

EtcdHighNumberOfFailedProposals

warning

Etcd cluster "Job": X proposal failures within the last hour on etcd instance Instance.

EtcdHighFsyncDurations

warning

Etcd cluster "Job": 99th percentile fync durations are X_s on etcd instance _Instance.

EtcdHighCommitDurations

warning

Etcd cluster "Job": 99th percentile commit durations X_s on etcd instance _Instance.

FdExhaustionClose

warning

Job instance Instance will exhaust its file descriptors soon

FdExhaustionClose

critical

Job instance Instance will exhaust its file descriptors soon

Configuring etcd monitoring

If the etcd service does not run correctly, successful operation of the whole OKD cluster is in danger. Therefore, it is reasonable to configure monitoring of etcd.

Follow these steps to configure etcd monitoring:

  1. Verify that the monitoring stack is running:

    $ oc -n openshift-monitoring get pods
    NAME                                           READY     STATUS              RESTARTS   AGE
    alertmanager-main-0                            3/3       Running             0          34m
    alertmanager-main-1                            3/3       Running             0          33m
    alertmanager-main-2                            3/3       Running             0          33m
    cluster-monitoring-operator-67b8797d79-sphxj   1/1       Running             0          36m
    grafana-c66997f-pxrf7                          2/2       Running             0          37s
    kube-state-metrics-7449d589bc-rt4mq            3/3       Running             0          33m
    node-exporter-5tt4f                            2/2       Running             0          33m
    node-exporter-b2mrp                            2/2       Running             0          33m
    node-exporter-fd52p                            2/2       Running             0          33m
    node-exporter-hfqgv                            2/2       Running             0          33m
    prometheus-k8s-0                               4/4       Running             1          35m
    prometheus-k8s-1                               0/4       ContainerCreating   0          21s
    prometheus-operator-6c9fddd47f-9jfgk           1/1       Running             0          36m
  2. Open the configuration file for the cluster monitoring stack:

    $ oc -n openshift-monitoring edit configmap cluster-monitoring-config
  3. Under config.yaml: |+, add the etcd section.

    1. If you run etcd in static pods on your master nodes, you can specify the etcd nodes using the selector:

      ...
      data:
        config.yaml: |+
          ...
          etcd:
            targets:
              selector:
                openshift.io/component: etcd
                openshift.io/control-plane: "true"
    2. If you run etcd on separate hosts, you need to specify the nodes using IP addresses:

      ...
      data:
        config.yaml: |+
          ...
          etcd:
            targets:
             ips:
             - "127.0.0.1"
             - "127.0.0.2"
             - "127.0.0.3"

      If etcd nodes IP addresses change, you need to update this list.

  4. Verify that the etcd service monitor is now running:

    $ oc -n openshift-monitoring get servicemonitor
    NAME                  AGE
    alertmanager          35m
    etcd                  1m
    kube-apiserver        36m
    kube-controllers      36m
    kube-state-metrics    34m
    kubelet               36m
    node-exporter         34m
    prometheus            36m
    prometheus-operator   37m

    It might take up to a minute for the etcd service monitor to start.

  5. Now you can navigate to the Web interface to see more information about status of etcd monitoring:

    1. To get the URL, run:

      $ oc -n openshift-monitoring get routes
      NAME                HOST/PORT                                                                           PATH      SERVICES            PORT      TERMINATION   WILDCARD
      ...
      prometheus-k8s      prometheus-k8s-openshift-monitoring.apps.msvistun.origin-gce.dev.openshift.com                prometheus-k8s      web       reencrypt     None
    2. Using https, navigate to the URL listed for prometheus-k8s. Log in.

  6. Ensure the user belongs to the cluster-monitoring-view role. This role provides access to viewing cluster monitoring UIs. For example, to add user developer to cluster-monitoring-view, run:

    $ oc adm policy add-cluster-role-to-user cluster-monitoring-view developer
  7. In the Web interface, log in as the user belonging to the cluster-monitoring-view role.

  8. Click Status, then Targets. If you see an etcd entry, etcd is being monitored.

    etcd no certificate

While etcd is being monitored, Prometheus is not yet able to authenticate against etcd, and so cannot gather metrics. To configure Prometheus authentication against etcd:

  1. Copy the /etc/etcd/ca/ca.crt and /etc/etcd/ca/ca.key credentials files from the master node to the local machine:

    $ ssh -i gcp-dev/ssh-privatekey cloud-user@35.237.54.213
    ...
  2. Create the openssl.cnf file with these contents:

    [ req ]
    req_extensions = v3_req
    distinguished_name = req_distinguished_name
    [ req_distinguished_name ]
    [ v3_req ]
    basicConstraints = CA:FALSE
    keyUsage = nonRepudiation, keyEncipherment, digitalSignature
    extendedKeyUsage=serverAuth, clientAuth
  3. Generate the etcd.key private key file:

    $ openssl genrsa -out etcd.key 2048
  4. Generate the etcd.csr certificate signing request file:

    $ openssl req -new -key etcd.key -out etcd.csr -subj "/CN=etcd" -config openssl.cnf
  5. Generate the etcd.crt certificate file:

    $ openssl x509 -req -in etcd.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out etcd.crt -days 365 -extensions v3_req -extfile openssl.cnf
  6. Put the credentials into format used by OKD:

    cat <<-EOF > etcd-cert-secret.yaml
    apiVersion: v1
    data:
      etcd-client-ca.crt: "$(cat ca.crt | base64 --wrap=0)"
      etcd-client.crt: "$(cat etcd.crt | base64 --wrap=0)"
      etcd-client.key: "$(cat etcd.key | base64 --wrap=0)"
    kind: Secret
    metadata:
      name: kube-etcd-client-certs
      namespace: openshift-monitoring
    type: Opaque
    EOF

    This creates the etcd-cert-secret.yaml file

  7. Apply the credentials file to the cluster:

$ oc apply -f etcd-cert-secret.yaml
  1. Visit the "Targets" page of the Web interface again. Verify that etcd is now being correctly monitored. It might take several minutes for changes to take effect.

    etcd monitoring working

Accessing Prometheus, Alertmanager, and Grafana

OKD Monitoring ships with a Prometheus instance for cluster monitoring and a central Alertmanager cluster. In addition to Prometheus and Alertmanager, OKD Monitoring also includes a Grafana instance as well as pre-built dashboards for cluster monitoring troubleshooting.

You can get the addresses for accessing Prometheus, Alertmanager, and Grafana web UIs by running:

$ oc -n openshift-monitoring get routes
NAME                HOST/PORT                                                     ...
alertmanager-main   alertmanager-main-openshift-monitoring.apps.url.openshift.com ...
grafana             grafana-openshift-monitoring.apps.url.openshift.com           ...
prometheus-k8s      prometheus-k8s-openshift-monitoring.apps.url.openshift.com    ...

Make sure to prepend https:// to these addresses. You cannot access web UIs using unencrypted connection.

Authentication is performed against the OKD identity and uses the same credentials or means of authentication as is used elsewhere in OKD. You need to use a role that has read access to all namespaces, such as the cluster-monitoring-view cluster role.

Update and compatibility guarantees

In order to be able to deliver updates with guaranteed compatibility, configurability of the OKD Monitoring stack is limited to the explicitly available options. This document describes known pitfalls of which types of configuration and customization are unsupported, as well as misuse of resources provided by OKD Monitoring. All configuration options described in in this topic are explicitly supported.

Modifying OKD monitoring resources

The OKD Monitoring stack ensures its resources are always in the state it expects them to be. If they are modified, OKD Monitoring will ensure that this will be reset. Nonetheless it is possible to pause this behavior, by setting the paused field in the AppVersion called openshift-monitoring. Setting the OKD Monitoring stack to be paused, stops all future updates and will cause modification of the otherwise managed resources. If resources are modified in an uncontrolled manner, this will cause undefined behavior during updates.

In order to ensure compatible and functioning updates, the paused field must be set to false on upgrades.

Using resources created by OKD monitoring

OKD Monitoring creates a number of resources. These resources are not meant to be used by any other resources, as there are no guarantees about their backward compatibility. For example, a ClusterRole called prometheus-k8s is created, and has very specific roles that exist solely for the cluster monitoring Prometheus pods to be able to access the resources it requires access to. All of these resources have no compatibility guarantees going forward. While some of these resources may incidentally have the necessary information for RBAC purposes for example, they can be subject to change in any upcoming release, with no backward compatibility.

If the Role or ClusterRole objects that are similar are needed, we recommend creating a new object that has exactly the permissions required for the case at hand, rather than using the resources created and maintained by OKD Monitoring.