×

Overview

The kubelet exposes metrics that can be collected and stored in back-ends by Heapster.

As an OKD administrator, you can view a cluster’s metrics from all containers and components in one user interface. These metrics are also used by horizontal pod autoscalers in order to determine when and how to scale.

This topic describes using Hawkular Metrics as a metrics engine which stores the data persistently in a Cassandra database. When this is configured, CPU, memory and network-based metrics are viewable from the OKD web console and are available for use by horizontal pod autoscalers.

Heapster retrieves a list of all nodes from the master server, then contacts each node individually through the /stats endpoint. From there, Heapster scrapes the metrics for CPU, memory and network usage, then exports them into Hawkular Metrics.

The storage volume metrics available on the kubelet are not available through the /stats endpoint, but are available through the /metrics endpoint. See OKD via Prometheus for detailed information.

Browsing individual pods in the web console displays separate sparkline charts for memory and CPU. The time range displayed is selectable, and these charts automatically update every 30 seconds. If there are multiple containers on the pod, then you can select a specific container to display its metrics.

If resource limits are defined for your project, then you can also see a donut chart for each pod. The donut chart displays usage against the resource limit. For example: 145 Available of 200 MiB, with the donut chart showing 55 MiB Used.

For more information about the metrics integration, refer to the Origin Metrics GitHub project.

Before You Begin

If your OKD installation was originally performed on a version previous to v1.0.8, even if it has since been updated to a newer version, follow the instructions for node certificates outlined in Updating Master and Node Certificates. If the node certificate does not contain the IP address of the node, then Heapster fails to retrieve any metrics.

An Ansible playbook is available to deploy and upgrade cluster metrics. You should familiarize yourself with the Advanced Installation section. This provides information for preparing to use Ansible and includes information about configuration. Parameters are added to the Ansible inventory file to configure various areas of cluster metrics.

The following describe the various areas and the parameters that can be added to the Ansible inventory file in order to modify the defaults:

Metrics Project

The components for cluster metrics must be deployed to the openshift-infra project in order for autoscaling to work. Horizontal pod autoscalers specifically use this project to discover the Heapster service and use it to retrieve metrics. The metrics project can be changed by adding openshift_metrics_project to the inventory file.

Metrics Data Storage

You can store the metrics data to either persistent storage or to a temporary pod volume.

Persistent Storage

Running OKD cluster metrics with persistent storage means that your metrics are stored to a persistent volume and are able to survive a pod being restarted or recreated. This is ideal if you require your metrics data to be guarded from data loss. For production environments it is highly recommended to configure persistent storage for your metrics pods.

The size requirement of the Cassandra storage is dependent on the number of pods. It is the administrator’s responsibility to ensure that the size requirements are sufficient for their setup and to monitor usage to ensure that the disk does not become full. The size of the persisted volume claim is specified with the openshift_metrics_cassandra_pvc_size ansible variable which is set to 10 GB by default.

If you would like to use dynamically provisioned persistent volumes set the openshift_metrics_cassandra_storage_type variable to dynamic in the inventory file.

Capacity Planning for Cluster Metrics

After running the openshift_metrics Ansible role, the output of oc get pods should resemble the following:

 # oc get pods -n openshift-infra
 NAME                                READY             STATUS      RESTARTS          AGE
 hawkular-cassandra-1-l5y4g          1/1               Running     0                 17h
 hawkular-metrics-1t9so              1/1               Running     0                 17h
 heapster-febru                      1/1               Running     0                 17h

OKD metrics are stored using the Cassandra database, which is deployed with settings of openshift_metrics_cassandra_limits_memory: 2G; this value could be adjusted further based upon the available memory as determined by the Cassandra start script. This value should cover most OKD metrics installations, but using environment variables you can modify the MAX_HEAP_SIZE along with heap new generation size, HEAP_NEWSIZE, in the Cassandra Dockerfile prior to deploying cluster metrics.

By default, metrics data is stored for seven days. After seven days, Cassandra begins to purge the oldest metrics data. Metrics data for deleted pods and projects is not automatically purged; it is only removed once the data is more than seven days old.

Example 1. Data Accumulated by 10 Nodes and 1000 Pods

In a test scenario including 10 nodes and 1000 pods, a 24 hour period accumulated 2.5 GB of metrics data. Therefore, the capacity planning formula for metrics data in this scenario is:

(((2.5 Ă— 109) Ă· 1000) Ă· 24) Ă· 106 = ~0.125 MB/hour per pod.

Example 2. Data Accumulated by 120 Nodes and 10000 Pods

In a test scenario including 120 nodes and 10000 pods, a 24 hour period accumulated 25 GB of metrics data. Therefore, the capacity planning formula for metrics data in this scenario is:

(((11.410 Ă— 109) Ă· 1000) Ă· 24) Ă· 106 = 0.475 MB/hour

1000 pods 10000 pods

Cassandra storage data accumulated over 24 hours (default metrics parameters)

2.5 GB

11.4 GB

These two test cases are presented on the following graph:

1000 pods versus 10000 pods monitored during 24 hours

If the default value of 7 days for openshift_metrics_duration and 30 seconds for openshift_metrics_resolution are preserved, then weekly storage requirements for the Cassandra pod would be:

1000 pods 10000 pods

Cassandra storage data accumulated over seven days (default metrics parameters)

20 GB

90 GB

In the previous table, an additional 10 percent was added to the expected storage space as a buffer for unexpected monitored pod usage.

If the Cassandra persisted volume runs out of sufficient space, then data loss occurs.

For cluster metrics to work with persistent storage, ensure that the persistent volume has the ReadWriteOnce access mode. If this mode is not active, then the persistent volume claim cannot locate the persistent volume, and Cassandra fails to start.

To use persistent storage with the metric components, ensure that a persistent volume of sufficient size is available. The creation of persistent volume claims is handled by the OpenShift Ansible openshift_metrics role.

OKD metrics also supports dynamically-provisioned persistent volumes. To use this feature with OKD metrics, it is necessary to set the value of openshift_metrics_cassandra_storage_type to dynamic. You can use EBS, GCE, and Cinder storage back-ends to dynamically provision persistent volumes.

For information on configuring the performance and scaling the cluster metrics pods, see the Scaling Cluster Metrics topic.

Table 1. Cassandra Database storage requirements based on number of nodes/pods in the cluster
Number of Nodes Number of Pods Cassandra Storage growth speed Cassandra storage growth per day Cassandra storage growth per week

210

10500

500 MB per hour

15 GB

75 GB

990

11000

1 GB per hour

30 GB

210 GB

In the above calculation, approximately 20 percent of the expected size was added as overhead to ensure that the storage requirements do not exceed calculated value.

If the METRICS_DURATION and METRICS_RESOLUTION values are kept at the default (7 days and 15 seconds respectively), it is safe to plan Cassandra storage size requrements for week, as in the values above.

Because OKD metrics uses the Cassandra database as a datastore for metrics data, if USE_PERSISTANT_STORAGE=true is set during the metrics set up process, PV will be on top in the network storage, with NFS as the default. However, using network storage in combination with Cassandra is not recommended, as per the Cassandra documentation.

Known Issues and Limitations

Testing found that the heapster metrics component is capable of handling up to 25,000 pods. If the amount of pods exceed that number, Heapster begins to fall behind in metrics processing, resulting in the possibility of metrics graphs no longer appearing. Work is ongoing to increase the number of pods that Heapster can gather metrics on, as well as upstream development of alternate metrics-gathering solutions.

Non-Persistent Storage

Running OKD cluster metrics with non-persistent storage means that any stored metrics are deleted when the pod is deleted. While it is much easier to run cluster metrics with non-persistent data, running with non-persistent data does come with the risk of permanent data loss. However, metrics can still survive a container being restarted.

In order to use non-persistent storage, you must set the openshift_metrics_cassandra_storage_type variable to emptydir in the inventory file.

When using non-persistent storage, metrics data is written to /var/lib/origin/openshift.local.volumes/pods on the node where the Cassandra pod runs Ensure /var has enough free space to accommodate metrics storage.

Metrics Ansible Role

The OKD Ansible openshift_metrics role configures and deploys all of the metrics components using the variables from the Configuring Ansible inventory file.

Specifying Metrics Ansible Variables

The openshift_metrics role included with OpenShift Ansible defines the tasks to deploy cluster metrics. The following is a list of role variables that can be added to your inventory file if it is necessary to override them.

Table 2. Ansible Variables
Variable Description

openshift_metrics_install_metrics

Deploy metrics if true. Otherwise, undeploy.

openshift_metrics_start_cluster

Start the metrics cluster after deploying the components.

openshift_metrics_image_prefix

The prefix for the component images. With openshift/origin-metrics-cassandra:v1.3, set prefix openshift/origin-.

openshift_metrics_image_version

The version for the component images. For example, with openshift/origin-metrics-cassandra:v1.3, set version as v1.3.

openshift_metrics_startup_timeout

The time, in seconds, to wait until Hawkular Metrics and Heapster start up before attempting a restart.

openshift_metrics_duration

The number of days to store metrics before they are purged.

openshift_metrics_resolution

The frequency that metrics are gathered. Defined as a number and time identifier: seconds (s), minutes (m), hours (h).

openshift_metrics_cassandra_pvc_prefix

The persistent volume claim prefix created for Cassandra. A serial number is appended to the prefix starting from 1.

openshift_metrics_cassandra_pvc_size

The persistent volume claim size for each of the Cassandra nodes.

openshift_metrics_cassandra_storage_class_name

If you want to explicitly set the storage class, you must not set openshift_metrics_cassandra_storage_type to emptydir or dynamic.

openshift_metrics_cassandra_storage_type

Use emptydir for ephemeral storage (for testing); pv for persistent volumes, which need to be created before the installation; or dynamic for dynamic persistent volumes.

openshift_metrics_cassandra_replicas

The number of Cassandra nodes for the metrics stack. This value dictates the number of Cassandra replication controllers.

openshift_metrics_cassandra_limits_memory

The memory limit for the Cassandra pod. For example, a value of 2Gi would limit Cassandra to 2 GB of memory. This value could be further adjusted by the start script based on available memory of the node on which it is scheduled.

openshift_metrics_cassandra_limits_cpu

The CPU limit for the Cassandra pod. For example, a value of 4000m (4000 millicores) would limit Cassandra to 4 CPUs.

openshift_metrics_cassandra_requests_memory

The amount of memory to request for Cassandra pod. For example, a value of 2Gi would request 2 GB of memory.

openshift_metrics_cassandra_requests_cpu

The CPU request for the Cassandra pod. For example, a value of 4000m (4000 millicores) would request 4 CPUs.

openshift_metrics_cassandra_storage_group

The supplemental storage group to use for Cassandra.

openshift_metrics_cassandra_nodeselector

Set to the desired, existing node selector to ensure that pods are placed onto nodes with specific labels. For example, {"region":"infra"}.

openshift_metrics_hawkular_ca

An optional certificate authority (CA) file used to sign the Hawkular certificate.

openshift_metrics_hawkular_cert

The certificate file used for re-encrypting the route to Hawkular metrics. The certificate must contain the host name used by the route. If unspecified, the default router certificate is used.

openshift_metrics_hawkular_key

The key file used with the Hawkular certificate.

openshift_metrics_hawkular_limits_memory

The amount of memory to limit the Hawkular pod. For example, a value of 2Gi would limit the Hawkular pod to 2 GB of memory. This value could be further adjusted by the start script based on available memory of the node on which it is scheduled.

openshift_metrics_hawkular_limits_cpu

The CPU limit for the Hawkular pod. For example, a value of 4000m (4000 millicores) would limit the Hawkular pod to 4 CPUs.

openshift_metrics_hawkular_replicas

The number of replicas for Hawkular metrics.

openshift_metrics_hawkular_requests_memory

The amount of memory to request for the Hawkular pod. For example, a value of 2Gi would request 2 GB of memory.

openshift_metrics_hawkular_requests_cpu

The CPU request for the Hawkular pod. For example, a value of 4000m (4000 millicores) would request 4 CPUs.

openshift_metrics_hawkular_nodeselector

Set to the desired, existing node selector to ensure that pods are placed onto nodes with specific labels. For example, {"region":"infra"}.

openshift_metrics_heapster_allowed_users

A comma-separated list of CN to accept. By default, this is set to allow the OpenShift service proxy to connect. Add system:master-proxy to the list when overriding in order to allow horizontal pod autoscaling to function properly.

openshift_metrics_heapster_limits_memory

The amount of memory to limit the Heapster pod. For example, a value of 2Gi would limit the Heapster pod to 2 GB of memory.

openshift_metrics_heapster_limits_cpu

The CPU limit for the Heapster pod. For example, a value of 4000m (4000 millicores) would limit the Heapster pod to 4 CPUs.

openshift_metrics_heapster_requests_memory

The amount of memory to request for Heapster pod. For example, a value of 2Gi would request 2 GB of memory.

openshift_metrics_heapster_requests_cpu

The CPU request for the Heapster pod. For example, a value of 4000m (4000 millicores) would request 4 CPUs.

openshift_metrics_heapster_standalone

Deploy only Heapster, without the Hawkular Metrics and Cassandra components.

openshift_metrics_heapster_nodeselector

Set to the desired, existing node selector to ensure that pods are placed onto nodes with specific labels. For example, {"region":"infra"}.

openshift_metrics_install_hawkular_agent

Set to true to install the Hawkular OpenShift Agent (HOSA). Set to false to remove the HOSA from an installation. HOSA can be used to collect custom metrics from your pods. This component is currently in Technology Preview and is not installed by default.

openshift_metrics_hawkular_hostname

Set when executing the openshift_metrics Ansible role, since it uses the host name for the Hawkular Metrics route. This value should correspond to a fully qualified domain name.

The Hawkular OKD Agent on OKD is a Technology Preview feature only.

See Compute Resources for further discussion on how to specify requests and limits.

If you are using persistent storage with Cassandra, it is the administrator’s responsibility to set a sufficient disk size for the cluster using the openshift_metrics_cassandra_pvc_size variable. It is also the administrator’s responsibility to monitor disk usage to make sure that it does not become full.

Data loss results if the Cassandra persisted volume runs out of sufficient space.

All of the other variables are optional and allow for greater customization. For instance, if you have a custom install in which the Kubernetes master is not available under https://kubernetes.default.svc:443 you can specify the value to use instead with the openshift_metrics_master_url parameter. To deploy a specific version of the metrics components, modify the openshift_metrics_image_version variable.

It is highly recommended to not use latest for the openshift_metrics_image_version. The latest version corresponds to the very latest version available and can cause issues if it brings in a newer version not meant to function on the version of OKD you are currently running.

Using Secrets

The OKD Ansible openshift_metrics role auto-generates self-signed certificates for use between its components and generates a re-encrypting route to expose the Hawkular Metrics service. This route is what allows the web console to access the Hawkular Metrics service.

In order for the browser running the web console to trust the connection through this route, it must trust the route’s certificate. This can be accomplished by providing your own certificates signed by a trusted Certificate Authority. The openshift_metrics role allows you to specify your own certificates, which it then uses when creating the route.

The router’s default certificate are used if you do not provide your own.

Providing Your Own Certificates

To provide your own certificate, which is used by the re-encrypting route, you can set the openshift_metrics_hawkular_cert, openshift_metrics_hawkular_key, and openshift_metrics_hawkular_ca variables in your inventory file.

The hawkular-metrics.pem value needs to contain the certificate in its .pem format. You may also need to provide the certificate for the Certificate Authority which signed this pem file via the hawkular-metrics-ca.cert secret.

For more information, see the re-encryption route documentation.

Deploying the Metric Components

Because deploying and configuring all the metric components is handled with OKD Ansible, you can deploy everything in one step.

The following examples show you how to deploy metrics with and without persistent storage using the default parameters.

The host that you run the Ansible playbook on must have at least 75MiB of free memory per host in the inventory.

In accordance with upstream Kubernetes rules, metrics can be collected only on the default interface of eth0.

Example 3. Deploying with Persistent Storage

The following command sets the Hawkular Metrics route to use hawkular-metrics.example.com and is deployed using persistent storage.

You must have a persistent volume of sufficient size available.

$ ansible-playbook [-i </path/to/inventory>] <OPENSHIFT_ANSIBLE_DIR>/playbooks/openshift-metrics/config.yml \
   -e openshift_metrics_install_metrics=True \
   -e openshift_metrics_hawkular_hostname=hawkular-metrics.example.com \
   -e openshift_metrics_cassandra_storage_type=pv
Example 4. Deploying without Persistent Storage

The following command sets the Hawkular Metrics route to use hawkular-metrics.example.com and deploy without persistent storage.

$ ansible-playbook [-i </path/to/inventory>] <OPENSHIFT_ANSIBLE_DIR>/playbooks/openshift-metrics/config.yml \
   -e openshift_metrics_install_metrics=True \
   -e openshift_metrics_hawkular_hostname=hawkular-metrics.example.com

Because this is being deployed without persistent storage, metric data loss can occur.

Metrics Diagnostics

The are some diagnostics for metrics to assist in evaluating the state of the metrics stack. To execute diagnostics for metrics:

$ oc adm diagnostics MetricsApiProxy

Setting the Metrics Public URL

The OKD web console uses the data coming from the Hawkular Metrics service to display its graphs. The URL for accessing the Hawkular Metrics service must be configured with the metricsPublicURL option in the master webconsole-config configmap file. This URL corresponds to the route created with the openshift_metrics_hawkular_hostname inventory variable used during the deployment of the metrics components.

You must be able to resolve the openshift_metrics_hawkular_hostname from the browser accessing the console.

For example, if your openshift_metrics_hawkular_hostname corresponds to hawkular-metrics.example.com, then you must make the following change in the webconsole-config configmap file:

clusterInfo:
  ...
  metricsPublicURL: "https://hawkular-metrics.example.com/hawkular/metrics"

Once you have updated and saved the webconsole-config configmap file, you must restart your OKD instance.

When your OKD server is back up and running, metrics are displayed on the pod overview pages.

If you are using self-signed certificates, remember that the Hawkular Metrics service is hosted under a different host name and uses different certificates than the console. You may need to explicitly open a browser tab to the value specified in metricsPublicURL and accept that certificate.

To avoid this issue, use certificates which are configured to be acceptable by your browser.

Accessing Hawkular Metrics Directly

To access and manage metrics more directly, use the Hawkular Metrics API.

When accessing Hawkular Metrics from the API, you are only able to perform reads. Writing metrics is disabled by default. If you want individual users to also be able to write metrics, you must set the openshift_metrics_hawkular_user_write_access variable to true.

However, it is recommended to use the default configuration and only have metrics enter the system via Heapster. If write access is enabled, any user can write metrics to the system, which can affect performance and cause Cassandra disk usage to unpredictably increase.

The Hawkular Metrics documentation covers how to use the API, but there are a few differences when dealing with the version of Hawkular Metrics configured for use on OKD:

OKD Projects and Hawkular Tenants

Hawkular Metrics is a multi-tenanted application. It is configured so that a project in OKD corresponds to a tenant in Hawkular Metrics.

As such, when accessing metrics for a project named MyProject you must set the Hawkular-Tenant header to MyProject.

There is also a special tenant named _system which contains system level metrics. This requires either a cluster-reader or cluster-admin level privileges to access.

Authorization

The Hawkular Metrics service authenticates the user against OKD to determine if the user has access to the project it is trying to access.

Hawkular Metrics accepts a bearer token from the client and verifies that token with the OKD server using a SubjectAccessReview. If the user has proper read privileges for the project, they are allowed to read the metrics for that project. For the _system tenant, the user requesting to read from this tenant must have cluster-reader permission.

When accessing the Hawkular Metrics API, you must pass a bearer token in the Authorization header.

Accessing Heapster Directly

Heapster is configured to only be accessible via the API proxy. Accessing Heapster requires either a cluster-reader or cluster-admin privileges.

For example, to access the Heapster validate page, you need to access it using something similar to:

$ curl -H "Authorization: Bearer XXXXXXXXXXXXXXXXX" \
       -X GET https://${KUBERNETES_MASTER}/api/v1/proxy/namespaces/openshift-infra/services/https:heapster:/validate

For more information about Heapster and how to access its APIs, refer the Heapster project.

Scaling OKD Cluster Metrics Pods

Information about scaling cluster metrics capabilities is available in the Scaling and Performance Guide.

Integration with Aggregated Logging

Hawkular Alerts must be connected to the Aggregated Logging’s Elasticsearch to react on log events. By default, Hawkular tries to find Elasticsearch on its default place (namespace logging, pod logging-es) at every boot. If Aggregated Logging is installed after Hawkular, the Hawkular Metrics pod might need to be restarted in order to recognize the new Elasticsearch server. The Hawkular boot log provides a clear indication if the integration could not be properly configured, with messages like:

Failed to import the logging certificate into the store. Continuing, but the
logging integration might fail.

or

Could not get the logging secret! Status code: 000. The Hawkular Alerts
integration with Logging might not work properly.

This feature is available from version v1.7. You can confirm if logging is available by checking the log for an entry like:

Retrieving the Logging's CA and adding to the trust store, if Logging is
available.

Cleanup

You can remove everything deployed by the OKD Ansible openshift_metrics role by performing the following steps:

$ ansible-playbook [-i </path/to/inventory>] <OPENSHIFT_ANSIBLE_DIR>/playbooks/openshift-metrics/config.yml \
   -e openshift_metrics_install_metrics=False

Prometheus on OKD

Prometheus is a stand-alone, open source systems monitoring and alerting toolkit. You can use Prometheus to visualize metrics and alerts for OKD system resources.

Prometheus on OKD is a Technology Preview feature only.

Setting Prometheus Role Variables

The Prometheus role creates:

  • The openshift-metrics namespace.

  • Prometheus clusterrolebinding and service account.

  • Prometheus pod with Prometheus behind OAuth proxy, Alertmanager, and Alert Buffer as a stateful set.

  • Prometheus and prometheus-alerts ConfigMaps.

  • Prometheus and Prometheus Alerts services and direct routes.

Prometheus deployment is enabled by default, uninstall it by setting openshift_prometheus_state to absent. For example:

# openshift_prometheus_state=absent

Set the following role variables to install and configure Prometheus.

Table 3. Prometheus Variables
Variable Description

openshift_prometheus_state

The default value is present, which results in the installation or update of Prometheus. To uninstall Prometheus, set to absent.

openshift_prometheus_namespace

Project namespace where the components are deployed. Default set to openshift-metrics. For example, openshift_prometheus_namespace=${USER_PROJECT}.

openshift_prometheus_node_selector

Selector for the nodes on which Prometheus is deployed. Default set to node-role.kubernetes.io/infra=true.

openshift_prometheus_storage_kind

Set to create PV for Prometheus. For example, openshift_prometheus_storage_kind=nfs.

openshift_prometheus_alertmanager_storage_kind

Set to create PV for Alertmanager. For example, openshift_prometheus_alertmanager_storage_kind=nfs.

openshift_prometheus_alertbuffer_storage_kind

Set to create PV for Alert Buffer. For example, openshift_prometheus_alertbuffer_storage_kind=nfs.

openshift_prometheus_storage_type

Set to create PVC for Prometheus. For example, openshift_prometheus_storage_type=pvc.

openshift_prometheus_alertmanager_storage_type

Set to create PVC for Alertmanager. For example, openshift_prometheus_alertmanager_storage_type=pvc.

openshift_prometheus_alertbuffer_storage_type

Set to create PVC for Alert Buffer. For example, openshift_prometheus_alertbuffer_storage_type=pvc.

openshift_prometheus_additional_rules_file

Additional Prometheus rules file. Set to null by default.

Deploying Prometheus Using Ansible Installer

The host that you run the Ansible playbook on must have at least 75MiB of free memory per host in the inventory.

The Ansible Installer is the default method of deploying Prometheus.

Run the playbook:

$ ansible-playbook -vvv -i ${INVENTORY_FILE} playbooks/openshift-prometheus/config.yml

Make sure you have nodes labeled with node-role.kubernetes.io/infra=true, which is the default value for openshift_prometheus_node_selector. If you want to use other node selectors, please see Deploy Using Node-Selector.

Additional Methods for Deploying Prometheus

Deploy Using Node-Selector

Label the node on which you want to deploy Prometheus:

# oc adm label node/$NODE ${KEY}=${VALUE}

Deploy Prometheus with Ansible and container resources:

# Set node selector for prometheus
openshift_prometheus_node_selector={"${KEY}":"${VALUE}"}

Run the playbook:

$ ansible-playbook -vvv -i ${INVENTORY_FILE} playbooks/openshift-prometheus/config.yml

Deploy Using a Non-default Namespace

Identify your namespace:

# Set non-default openshift_prometheus_namespace
openshift_prometheus_namespace=${USER_PROJECT}

Run the playbook:

$ ansible-playbook -vvv -i ${INVENTORY_FILE} playbooks/openshift-prometheus/config.yml

Accessing the Prometheus Web UI

The Prometheus server automatically exposes a Web UI at localhost:9090. You can access the Prometheus Web UI with the view role.

Configuring Prometheus for OKD

Prometheus Storage Related Variables

With each Prometheus component (including Prometheus, Alertmanager, Alert Buffer, and OAuth proxy) you can set the PV claim by setting corresponding role variable, for example:

openshift_prometheus_storage_type: pvc
openshift_prometheus_alertmanager_pvc_name: alertmanager
openshift_prometheus_alertbuffer_pvc_size: 10G
openshift_prometheus_pvc_access_modes: [ReadWriteOnce]

Prometheus Alert Rules File Variable

You can add an external file with alert rules by setting the path to an additional rules variable:

openshift_prometheus_additional_rules_file: <PATH>

The file must follow the Prometheus Alert rules format. The following example sets a rule to send an alert when one of the cluster nodes is down:

groups:
- name: example-rules
  interval: 30s # defaults to global interval
  rules:
  - alert: Node Down
    expr: up{job="kubernetes-nodes"} == 0
    for: 10m (1)
    annotations:
      miqTarget: "ContainerNode"
      severity: "HIGH"
      message: "{{ '{{' }}{{ '$labels.instance' }}{{ '}}' }} is down"
1 The optional for value specifies the amount of time Prometheus waits before it sends an alert for this element. For example, if you set 10m, Prometheus waits 10 minutes after it encounters this issue before sending an alert.

Prometheus Variables to Control Resource Limits

With each Prometheus component (including Prometheus, Alertmanager, Alert Buffer, and OAuth proxy) you can specify CPU, memory limits, and requests by setting the corresponding role variable, for example:

openshift_prometheus_alertmanager_limits_memory: 1Gi
openshift_prometheus_oauth_proxy_cpu_requests: 100m

Once openshift_metrics_project: openshift-infra is installed, metrics can be gathered from the http://${POD_IP}:7575/metrics endpoint.

OKD Metrics via Prometheus

The state of a system can be gauged by the metrics that it emits. This section describes current and proposed metrics that identify the health of the storage subsystem and cluster.

Current Metrics

This section describes the metrics currently emitted from Kubernetes’s storage subsystem.

Cloud Provider API Call Metrics

This metric reports the time and count of success and failures of all cloudprovider API calls. These metrics include aws_attach_time and aws_detach_time. The type of emitted metrics is a histogram, and hence, Prometheus also generates sum, count, and bucket metrics for these metrics.

Example summary of cloudprovider metrics from GCE:
cloudprovider_gce_api_request_duration_seconds { request = "instance_list"}
cloudprovider_gce_api_request_duration_seconds { request = "disk_insert"}
cloudprovider_gce_api_request_duration_seconds { request = "disk_delete"}
cloudprovider_gce_api_request_duration_seconds { request = "attach_disk"}
cloudprovider_gce_api_request_duration_seconds { request = "detach_disk"}
cloudprovider_gce_api_request_duration_seconds { request = "list_disk"}
Example summary of cloudprovider metrics from AWS:
cloudprovider_aws_api_request_duration_seconds { request = "attach_volume"}
cloudprovider_aws_api_request_duration_seconds { request = "detach_volume"}
cloudprovider_aws_api_request_duration_seconds { request = "create_tags"}
cloudprovider_aws_api_request_duration_seconds { request = "create_volume"}
cloudprovider_aws_api_request_duration_seconds { request = "delete_volume"}
cloudprovider_aws_api_request_duration_seconds { request = "describe_instance"}
cloudprovider_aws_api_request_duration_seconds { request = "describe_volume"}

Volume Operation Metrics

These metrics report time taken by a storage operation once started. These metrics keep track of operation time at the plug-in level, but do not include time taken by goroutine to run or operation to be picked up from the internal queue. These metrics are a type of histogram.

Example summary of available volume operation metrics
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_attach" }
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_detach" }
storage_operation_duration_seconds { volume_plugin = "glusterfs", operation_name = "volume_provision" }
storage_operation_duration_seconds { volume_plugin = "gce-pd", operation_name = "volume_delete" }
storage_operation_duration_seconds { volume_plugin = "vsphere", operation_name = "volume_mount" }
storage_operation_duration_seconds { volume_plugin = "iscsi" , operation_name = "volume_unmount" }
storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "unmount_device" }
storage_operation_duration_seconds { volume_plugin = "cinder" , operation_name = "verify_volumes_are_attached" }
storage_operation_duration_seconds { volume_plugin = "<n/a>" , operation_name = "verify_volumes_are_attached_per_node" }

See Volume operation metrics for more information.

Volume Stats Metrics

These metrics typically report usage stats of PVC (such as used space versus available space). The type of metrics emitted is gauge.

Table 4. Volume Stats Metrics
Metric Type Labels/tags

volume_stats_capacityBytes

Gauge

namespace,persistentvolumeclaim,persistentvolume=

volume_stats_usedBytes

Gauge

namespace=<persistentvolumeclaim-namespace> persistentvolumeclaim=<persistentvolumeclaim-name> persistentvolume=<persistentvolume-name>

volume_stats_availableBytes

Gauge

namespace=<persistentvolumeclaim-namespace> persistentvolumeclaim=<persistentvolumeclaim-name> persistentvolume=

volume_stats_InodesFree

Gauge

namespace=<persistentvolumeclaim-namespace> persistentvolumeclaim=<persistentvolumeclaim-name> persistentvolume=<persistentvolume-name>

volume_stats_Inodes

Gauge

namespace=<persistentvolumeclaim-namespace> persistentvolumeclaim=<persistentvolumeclaim-name> persistentvolume=<persistentvolume-name>

volume_stats_InodesUsed

Gauge

namespace=<persistentvolumeclaim-namespace> persistentvolumeclaim=<persistentvolumeclaim-name> persistentvolume=<persistentvolume-name>

Undeploying Prometheus

To undeploy Prometheus, run:

$ ansible-playbook -vvv -i ${INVENTORY_FILE} playbooks/openshift-prometheus/config.yml -e openshift_prometheus_state=absent