Prometheus queries for virtual resources

Prerequisites
Querying metrics
- Querying metrics for all projects as a cluster administrator
- Querying metrics for user-defined projects as a developer
Virtualization metrics
Additional resources

OKD Virtualization provides metrics that you can use to monitor the consumption of cluster infrastructure resources, including vCPU, network, storage, and guest memory swapping. You can also use metrics to query live migration status.

Use the OKD monitoring dashboard to query virtualization metrics.

Prerequisites

To use the vCPU metric, the schedstats=enable kernel argument must be applied to the MachineConfig object. This kernel argument enables scheduler statistics used for debugging and performance tuning and adds a minor additional load to the scheduler. For more information, see Adding kernel arguments to nodes.
For guest memory swapping queries to return data, memory swapping must be enabled on the virtual guests.

Querying metrics

The OKD monitoring dashboard enables you to run Prometheus Query Language (PromQL) queries to examine metrics visualized on a plot. This functionality provides information about the state of a cluster and any user-defined workloads that you are monitoring.

As a cluster administrator, you can query metrics for all core OKD and user-defined projects.

As a developer, you must specify a project name when querying metrics. You must have the required privileges to view metrics for the selected project.

Querying metrics for all projects as a cluster administrator

As a cluster administrator or as a user with view permissions for all projects, you can access metrics for all default OKD and user-defined projects in the Metrics UI.

Prerequisites

You have access to the cluster as a user with the cluster-admin cluster role or with view permissions for all projects.
You have installed the OpenShift CLI (oc).

Procedure

From the Administrator perspective in the OKD web console, select Observe → Metrics.

To add one or more queries, do any of the following:

Option	Description
Create a custom query.	Add your Prometheus Query Language (PromQL) query to the Expression field. As you type a PromQL expression, autocomplete suggestions appear in a drop-down list. These suggestions include functions, metrics, labels, and time tokens. You can use the keyboard arrows to select one of these suggested items and then press Enter to add the item to your expression. You can also move your mouse pointer over a suggested item to view a brief description of that item.
Add multiple queries.	Select Add query.
Duplicate an existing query.	Select the Options menu next to the query, then choose Duplicate query.
Disable a query from being run.	Select the Options menu next to the query and choose Disable query.

Option

Description

Create a custom query.

Add your Prometheus Query Language (PromQL) query to the Expression field.

As you type a PromQL expression, autocomplete suggestions appear in a drop-down list. These suggestions include functions, metrics, labels, and time tokens. You can use the keyboard arrows to select one of these suggested items and then press Enter to add the item to your expression. You can also move your mouse pointer over a suggested item to view a brief description of that item.

Add multiple queries.

Select Add query.

Duplicate an existing query.

Select the Options menu kebab next to the query, then choose Duplicate query.

Disable a query from being run.

Select the Options menu kebab next to the query and choose Disable query.

To run queries that you created, select Run queries. The metrics from the queries are visualized on the plot. If a query is invalid, the UI shows an error message.

Queries that operate on large amounts of data might time out or overload the browser when drawing time series graphs. To avoid this, select Hide graph and calibrate your query using only the metrics table. Then, after finding a feasible query, enable the plot to draw the graphs.

By default, the query table shows an expanded view that lists every metric and its current value. You can select ˅ to minimize the expanded view for a query.

Optional: The page URL now contains the queries you ran. To use this set of queries again in the future, save this URL.

Explore the visualized metrics. Initially, all metrics from all enabled queries are shown on the plot. You can select which metrics are shown by doing any of the following:

Option	Description
Hide all metrics from a query.	Click the Options menu for the query and click Hide all series.
Hide a specific metric.	Go to the query table and click the colored square near the metric name.
Zoom into the plot and change the time range.	Either: Visually select the time range by clicking and dragging on the plot horizontally. Use the menu in the left upper corner to select the time range.
Reset the time range.	Select Reset zoom.
Display outputs for all queries at a specific point in time.	Hold the mouse cursor on the plot at that point. The query outputs will appear in a pop-up box.
Hide the plot.	Select Hide graph.

Option

Description

Hide all metrics from a query.

Click the Options menu kebab for the query and click Hide all series.

Hide a specific metric.

Go to the query table and click the colored square near the metric name.

Zoom into the plot and change the time range.

Either:

Visually select the time range by clicking and dragging on the plot horizontally.
Use the menu in the left upper corner to select the time range.

Reset the time range.

Select Reset zoom.

Display outputs for all queries at a specific point in time.

Hold the mouse cursor on the plot at that point. The query outputs will appear in a pop-up box.

Hide the plot.

Select Hide graph.

Querying metrics for user-defined projects as a developer

You can access metrics for a user-defined project as a developer or as a user with view permissions for the project.

In the Developer perspective, the Metrics UI includes some predefined CPU, memory, bandwidth, and network packet queries for the selected project. You can also run custom Prometheus Query Language (PromQL) queries for CPU, memory, bandwidth, network packet and application metrics for the project.

Developers can only use the Developer perspective and not the Administrator perspective. As a developer, you can only query metrics for one project at a time.

Prerequisites

You have access to the cluster as a developer or as a user with view permissions for the project that you are viewing metrics for.
You have enabled monitoring for user-defined projects.
You have deployed a service in a user-defined project.
You have created a ServiceMonitor custom resource definition (CRD) for the service to define how the service is monitored.

Procedure

From the Developer perspective in the OKD web console, select Observe → Metrics.
Select the project that you want to view metrics for in the Project: list.
Select a query from the Select query list, or create a custom PromQL query based on the selected query by selecting Show PromQL. The metrics from the queries are visualized on the plot.

In the Developer perspective, you can only run one query at a time.

Explore the visualized metrics by doing any of the following:

Option	Description
Zoom into the plot and change the time range.	Either: Visually select the time range by clicking and dragging on the plot horizontally. Use the menu in the left upper corner to select the time range.
Reset the time range.	Select Reset zoom.
Display outputs for all queries at a specific point in time.	Hold the mouse cursor on the plot at that point. The query outputs appear in a pop-up box.

Option

Description

Zoom into the plot and change the time range.

Either:

Visually select the time range by clicking and dragging on the plot horizontally.
Use the menu in the left upper corner to select the time range.

Reset the time range.

Select Reset zoom.

Display outputs for all queries at a specific point in time.

Hold the mouse cursor on the plot at that point. The query outputs appear in a pop-up box.

Virtualization metrics

The following metric descriptions include example Prometheus Query Language (PromQL) queries. These metrics are not an API and might change between versions.

The following examples use topk queries that specify a time period. If virtual machines are deleted during that time period, they can still appear in the query output.

vCPU metrics

The following query can identify virtual machines that are waiting for Input/Output (I/O):

kubevirt_vmi_vcpu_wait_seconds: Returns the wait time (in seconds) for a virtual machine’s vCPU. Type: Counter.

A value above '0' means that the vCPU wants to run, but the host scheduler cannot run it yet. This inability to run indicates that there is an issue with I/O.

To query the vCPU metric, the schedstats=enable kernel argument must first be applied to the MachineConfig object. This kernel argument enables scheduler statistics used for debugging and performance tuning and adds a minor additional load to the scheduler.

Example vCPU wait time query

topk(3, sum by (name, namespace) (rate(kubevirt_vmi_vcpu_wait_seconds[6m]))) > 0 (1)

1	This query returns the top 3 VMs waiting for I/O at every given moment over a six-minute time period.

Network metrics

The following queries can identify virtual machines that are saturating the network:

kubevirt_vmi_network_receive_bytes_total: Returns the total amount of traffic received (in bytes) on the virtual machine’s network. Type: Counter.
kubevirt_vmi_network_transmit_bytes_total: Returns the total amount of traffic transmitted (in bytes) on the virtual machine’s network. Type: Counter.

Example network traffic query

topk(3, sum by (name, namespace) (rate(kubevirt_vmi_network_receive_bytes_total[6m])) + sum by (name, namespace) (rate(kubevirt_vmi_network_transmit_bytes_total[6m]))) > 0 (1)

1	This query returns the top 3 VMs transmitting the most network traffic at every given moment over a six-minute time period.

Storage metrics

Storage-related traffic

The following queries can identify VMs that are writing large amounts of data:

kubevirt_vmi_storage_read_traffic_bytes_total: Returns the total amount (in bytes) of the virtual machine’s storage-related traffic. Type: Counter.
kubevirt_vmi_storage_write_traffic_bytes_total: Returns the total amount of storage writes (in bytes) of the virtual machine’s storage-related traffic. Type: Counter.

Example storage-related traffic query

topk(3, sum by (name, namespace) (rate(kubevirt_vmi_storage_read_traffic_bytes_total[6m])) + sum by (name, namespace) (rate(kubevirt_vmi_storage_write_traffic_bytes_total[6m]))) > 0 (1)

1	This query returns the top 3 VMs performing the most storage traffic at every given moment over a six-minute time period.

Storage snapshot data

kubevirt_vmsnapshot_disks_restored_from_source_total: Returns the total number of virtual machine disks restored from the source virtual machine. Type: Gauge.
kubevirt_vmsnapshot_disks_restored_from_source_bytes: Returns the amount of space in bytes restored from the source virtual machine. Type: Gauge.

Examples of storage snapshot data queries

kubevirt_vmsnapshot_disks_restored_from_source_total{vm_name="simple-vm", vm_namespace="default"} (1)

1	This query returns the total number of virtual machine disks restored from the source virtual machine.

kubevirt_vmsnapshot_disks_restored_from_source_bytes{vm_name="simple-vm", vm_namespace="default"} (1)

1	This query returns the amount of space in bytes restored from the source virtual machine.

I/O performance

The following queries can determine the I/O performance of storage devices:

kubevirt_vmi_storage_iops_read_total: Returns the amount of write I/O operations the virtual machine is performing per second. Type: Counter.
kubevirt_vmi_storage_iops_write_total: Returns the amount of read I/O operations the virtual machine is performing per second. Type: Counter.

Example I/O performance query

topk(3, sum by (name, namespace) (rate(kubevirt_vmi_storage_iops_read_total[6m])) + sum by (name, namespace) (rate(kubevirt_vmi_storage_iops_write_total[6m]))) > 0 (1)

1	This query returns the top 3 VMs performing the most I/O operations per second at every given moment over a six-minute time period.

Guest memory swapping metrics

The following queries can identify which swap-enabled guests are performing the most memory swapping:

kubevirt_vmi_memory_swap_in_traffic_bytes_total: Returns the total amount (in bytes) of memory the virtual guest is swapping in. Type: Gauge.
kubevirt_vmi_memory_swap_out_traffic_bytes_total: Returns the total amount (in bytes) of memory the virtual guest is swapping out. Type: Gauge.

Example memory swapping query

topk(3, sum by (name, namespace) (rate(kubevirt_vmi_memory_swap_in_traffic_bytes_total[6m])) + sum by (name, namespace) (rate(kubevirt_vmi_memory_swap_out_traffic_bytes_total[6m]))) > 0 (1)

1	This query returns the top 3 VMs where the guest is performing the most memory swapping at every given moment over a six-minute time period.

Memory swapping indicates that the virtual machine is under memory pressure. Increasing the memory allocation of the virtual machine can mitigate this issue.

Live migration metrics

The following metrics can be queried to show live migration status:

kubevirt_migrate_vmi_data_processed_bytes: The amount of guest operating system data that has migrated to the new virtual machine (VM). Type: Gauge.
kubevirt_migrate_vmi_data_remaining_bytes: The amount of guest operating system data that remains to be migrated. Type: Gauge.
kubevirt_migrate_vmi_dirty_memory_rate_bytes: The rate at which memory is becoming dirty in the guest operating system. Dirty memory is data that has been changed but not yet written to disk. Type: Gauge.
kubevirt_migrate_vmi_pending_count: The number of pending migrations. Type: Gauge.
kubevirt_migrate_vmi_scheduling_count: The number of scheduling migrations. Type: Gauge.
kubevirt_migrate_vmi_running_count: The number of running migrations. Type: Gauge.
kubevirt_migrate_vmi_succeeded: The number of successfully completed migrations. Type: Gauge.
kubevirt_migrate_vmi_failed: The number of failed migrations. Type: Gauge.

Prerequisites

Querying metrics

Querying metrics for all projects as a cluster administrator

Querying metrics for user-defined projects as a developer

Virtualization metrics

vCPU metrics

Network metrics

Storage metrics

Storage-related traffic

Storage snapshot data

I/O performance

Guest memory swapping metrics

Live migration metrics

Additional resources