$ oc get prometheusrules -n netobserv -oyaml
|
Network observability alerts is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. |
The Network Observability Operator provides a set of alerts for monitoring the network in your OKD cluster. The alerts are based on its built-in metrics, but can include other metrics, such as ones provided by the OKD monitoring stack. Alerts are designed to give you a quick indication of your cluster’s network health.
Network observability includes predefined alerts. Use these alerts to gain insight into the health and performance of your OKD applications and infrastructure.
The predefined alerts provide a quick health indication of your cluster’s network in the Network Health dashboard. You can also customize alerts using Prometheus Query Language (PromQL) queries.
By default, network observability creates alerts that are contextual to the features you enable.
For example, packet drop-related alerts are created only if the PacketDrop agent feature is enabled in the FlowCollector custom resource (CR). Alerts are built on metrics, and you might see configuration warnings if enabled alerts are missing their required metrics.
You can configure these metrics in the spec.processor.metrics.includeList object of the FlowCollector CR.
These alert templates are installed by default:
PacketDropsByDeviceTriggers on high percentage of packet drops from devices (/proc/net/dev).
PacketDropsByKernelTriggers on high percentage of packet drops by the kernel; it requires the PacketDrop agent feature.
IPsecErrorsTriggers when IPsec encryption errors are detected by network observability; it requires the IPSec agent feature.
NetpolDeniedTriggers when traffic denied by network policies is detected by network observability; it requires the NetworkEvents agent feature.
LatencyHighTrendTriggers when an increase of TCP latency is detected by network observability; it requires the FlowRTT agent feature.
DNSErrorsTriggers when DNS errors are detected by network observability; it requires the DNSTracking agent feature.
These are operational alerts that relate to the self-health of network observability:
NetObservNoFlowsTriggers when no flows are being observed for a certain period.
NetObservLokiErrorTriggers when flows are being dropped due to Loki errors.
You can configure, extend, or disable alerts for network observability. You can view the resulting PrometheusRule resource in the default netobserv namespace by running the following command:
$ oc get prometheusrules -n netobserv -oyaml
When alerts are enabled in the Network Observability Operator, two things happen:
New alerts appear in Observe → Alerting → Alerting rules tab in the OKD web console.
A new Network Health dashboard appears in OKD web console → Observe.
The Network Health dashboard provides a summary of triggered alerts and pending alerts, distinguishing between critical, warning, and minor issues. Alerts for rule violations are displayed in the following tabs:
Global: Shows alerts that are global to the cluster.
Nodes: Shows alerts for rule violations per node.
Namespaces: Shows alerts for rule violations per namespace.
Click on a resource card to see more information. Next to each alert, a three dot menu appears. From this menu, you can navigate to Network Traffic → Traffic flows to see more detailed information for the selected resource.
Network Observability Operator alerts are a Technology Preview feature. To use this feature, you must enable it in the FlowCollector custom resource (CR), and then continue with configuring alerts to your specific needs.
Edit the FlowCollector CR to set the experimental alerts flag to true:
apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
name: flow-collector
spec:
processor:
advanced:
env:
EXPERIMENTAL_ALERTS_HEALTH: "true"
You can still use the existing method for creating alerts. For more information, see "Creating alerts".
Alerts in the Network Observability Operator are defined using alert templates and variants in the spec.processor.metrics.alerts object of the FlowCollector custom resource (CR). You can customize the default templates and variants for flexible, fine-grained alerting.
After you enable alerts, the Network Health dashboard appears in the Observe section of the OKD web console.
For each template, you can define a list of variants, each with their own thresholds and grouping configurations. For more information, see the "List of default alert templates".
Here is an example:
apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
name: flow-collector
spec:
processor:
metrics:
alerts:
- template: PacketDropsByKernel
variants:
# triggered when the whole cluster traffic (no grouping) reaches 10% of drops
- thresholds:
critical: "10"
# triggered when per-node traffic reaches 5% of drops, with gradual severity
- thresholds:
critical: "15"
warning: "10"
info: "5"
groupBy: Node
|
Customizing an alert replaces the default configuration for that template. If you want to keep the default configurations, you must manually replicate them. |
Learn about the base query for Prometheus Query Language (PromQL), and how to customize it so you can configure network observability alerts for your specific needs.
The alerting API in the network observability FlowCollector custom resource (CR) is mapped to the Prometheus Operator API, generating a PrometheusRule. You can see the PrometheusRule in the default netobserv namespace by running the following command:
$ oc get prometheusrules -n netobserv -oyaml
This example provides the base PromQL query pattern for an alert about a surge in incoming traffic:
sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)
This query calculates the byte rate coming from the openshift-ingress namespace to any of your workloads' namespaces over the past 30 minutes.
You can customize the query, including retaining only some rates, running the query for specific time periods, and setting a final threshold.
Appending > 1000 to this query retains only the rates observed that are greater than 1 KB/s, which eliminates noise from low-bandwidth consumers.
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
The byte rate is relative to the sampling interval defined in the FlowCollector custom resource (CR) configuration. If the sampling interval is 1:100, the actual traffic might be approximately 100 times higher than the reported metrics.
You can run the same query for a particular period of time using the offset modifier. For example, a query for one day earlier can be run using offset 1d, and a query for five hours ago can be run using offset 5h.
sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
You can use the formula 100 * (<query now> - <query from the previous day>) / <query from the previous day> to calculate the percentage of increase compared to the previous day. This value can be negative if the byte rate today is lower than the previous day.
You can apply a final threshold to filter increases that are lower than the desired percentage. For example, > 100 eliminates increases that are lower than 100%.
Together, the complete expression for the PrometheusRule looks like the following:
...
expr: |-
(100 *
(
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
- sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
)
/ sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
> 100
The Network Observability Operator uses components from other OKD features, such as the monitoring stack, to enhance visibility into network traffic. For more information, see: "Monitoring stack architecture".
Some metadata must be configured for the alert definitions. This metadata is used by Prometheus and the Alertmanager service from the monitoring stack, or by the Network Health dashboard.
The following example shows an AlertingRule resource with the configured metadata:
apiVersion: monitoring.openshift.io/v1
kind: AlertingRule
metadata:
name: netobserv-alerts
namespace: openshift-monitoring
spec:
groups:
- name: NetObservAlerts
rules:
- alert: NetObservIncomingBandwidth
annotations:
netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
message: |-
NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
summary: "Surge in incoming traffic"
expr: |-
(100 *
(
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
- sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
)
/ sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
> 100
for: 1m
labels:
app: netobserv
netobserv: "true"
severity: warning
where:
spec.groups.rules.alert.labels.netobservSpecifies the alert for the Network Health dashboard to detect when set to true.
spec.groups.rules.alert.labels.severitySpecifies the severity of the alert. The following values are valid: critical, warning, or info.
You can leverage the output labels from the defined PromQL expression in the message annotation. In the example, since results are grouped per DstK8S_Namespace, the expression {{ $labels.DstK8S_Namespace }} is used in the message text.
The netobserv_io_network_health annotation is optional, and controls how the alert is rendered on the Network Health page.
The netobserv_io_network_health annotation is a JSON string consisting of the following fields:
| Field | Type | Description |
|---|---|---|
|
List of strings |
One or more labels that hold namespaces. When provided, the alert appears under the Namespaces tab. |
|
List of strings |
One or more labels that hold node names. When provided, the alert appears under the Nodes tab. |
|
String |
The alert threshold, expected to match the threshold defined in the |
|
String |
The data unit, used only for display purposes. |
|
String |
An upper bound value used to compute the score on a closed scale. Metric values exceeding this bound are clamped. |
|
List of objects |
A list of links to display contextually with the alert. Each link requires a |
|
String |
An additional filter to inject into the URL for the Network Traffic page. |
The namespaceLabels and nodeLabels are mutually exclusive. If neither is provided, the alert appears under the Global tab.
Use the Prometheus Query Language (PromQL) to define a custom AlertingRule resource to trigger alerts based on specific network metrics (e.g., traffic surges).
Familiarity with PromQL.
You have installed OKD 4.14 or later.
You have access to the cluster as a user with the cluster-admin role.
You have installed the Network Observability Operator.
Create a YAML file named custom-alert.yaml that contains your AlertingRule resource.
Apply the custom alert rule by running the following command:
$ oc apply -f custom-alert.yaml
Verify that the PrometheusRule resource was created in the netobserv namespace by running the following command:
$ oc get prometheusrules -n netobserv -oyaml
The output should include the netobserv-alerts rule you just created, confirming that the resource was generated correctly.
Confirm the rule is active by checking the Network Health dashboard in the OKD web console → Observe.
Alert templates can be disabled in the spec.processor.metrics.disableAlerts field of the FlowCollector custom resource (CR). This setting accepts a list of alert template names. For a list of alert template names, see: "List of default alerts".
If a template is disabled and overridden in the spec.processor.metrics.alerts field, the disable setting takes precedence and the alert rule is not created.