Configuring LokiStack storage - Logging | Observability

Prerequisites
Core Setup and Configuration
Loki deployment sizing
Authorizing LokiStack rules RBAC permissions
- Examples
Creating a log-based alerting rule with Loki
Configuring Loki to tolerate memberlist creation failure
Enabling stream-based retention with Loki
Loki pod placement

You can configure a LokiStack CR to store application, audit, and infrastructure-related logs.

Loki is a horizontally scalable, highly available, multi-tenant log aggregation system offered as a GA log store for logging for Red Hat OpenShift that can be visualized with the OpenShift Observability UI. The Loki configuration provided by OpenShift Logging is a short-term log store designed to enable users to perform fast troubleshooting with the collected logs. For that purpose, the logging for Red Hat OpenShift configuration of Loki has short-term storage, and is optimized for very recent queries.

For long-term storage or queries over a long time period, users should look to log stores external to their cluster. Loki sizing is only tested and supported for short term storage, for a maximum of 30 days.

Prerequisites

You have installed the Loki Operator by using the CLI or web console.
You have a serviceAccount in the same namespace in which you create the ClusterLogForwarder.
The serviceAccount is assigned collect-audit-logs, collect-application-logs, and collect-infrastructure-logs cluster roles.

Core Setup and Configuration

Role-based access controls, basic monitoring, and pod placement to deploy Loki.

Loki deployment sizing

Sizing for Loki follows the format of 1x.<size> where the value 1x is number of instances and <size> specifies performance capabilities.

The 1x.pico configuration defines a single Loki deployment with minimal resource and limit requirements, offering high availability (HA) support for all Loki components. This configuration is suited for deployments that do not require a single replication factor or auto-compaction.

Disk requests are similar across size configurations, allowing customers to test different sizes to determine the best fit for their deployment needs.

It is not possible to change the number 1x for the deployment size.

Table 1. Loki sizing
	1x.demo	1x.pico [6.1+ only]	1x.extra-small	1x.small	1x.medium
Data transfer	Demo use only	50GB/day	100GB/day	500GB/day	2TB/day
Queries per second (QPS)	Demo use only	1-25 QPS at 200ms	1-25 QPS at 200ms	25-50 QPS at 200ms	25-75 QPS at 200ms
Replication factor	None	2	2	2	2
Total CPU requests	None	7 vCPUs	14 vCPUs	34 vCPUs	54 vCPUs
Total CPU requests if using the ruler	None	8 vCPUs	16 vCPUs	42 vCPUs	70 vCPUs
Total memory requests	None	17Gi	31Gi	67Gi	139Gi
Total memory requests if using the ruler	None	18Gi	35Gi	83Gi	171Gi
Total disk requests	40Gi	590Gi	430Gi	430Gi	590Gi
Total disk requests if using the ruler	60Gi	910Gi	750Gi	750Gi	910Gi

Authorizing LokiStack rules RBAC permissions

Administrators can allow users to create and manage their own alerting and recording rules by binding cluster roles to usernames. Cluster roles are defined as ClusterRole objects that contain necessary role-based access control (RBAC) permissions for users.

The following cluster roles for alerting and recording rules are available for LokiStack:

Rule name Description

Rule name	Description
`alertingrules.loki.grafana.com-v1-admin`	Users with this role have administrative-level access to manage alerting rules. This cluster role grants permissions to create, read, update, delete, list, and watch `AlertingRule` resources within the `loki.grafana.com/v1` API group.
`alertingrules.loki.grafana.com-v1-crdview`	Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to `AlertingRule` resources within the `loki.grafana.com/v1` API group, but do not have permissions for modifying or managing these resources.
`alertingrules.loki.grafana.com-v1-edit`	Users with this role have permission to create, update, and delete `AlertingRule` resources.
`alertingrules.loki.grafana.com-v1-view`	Users with this role can read `AlertingRule` resources within the `loki.grafana.com/v1` API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.
`recordingrules.loki.grafana.com-v1-admin`	Users with this role have administrative-level access to manage recording rules. This cluster role grants permissions to create, read, update, delete, list, and watch `RecordingRule` resources within the `loki.grafana.com/v1` API group.
`recordingrules.loki.grafana.com-v1-crdview`	Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to `RecordingRule` resources within the `loki.grafana.com/v1` API group, but do not have permissions for modifying or managing these resources.
`recordingrules.loki.grafana.com-v1-edit`	Users with this role have permission to create, update, and delete `RecordingRule` resources.
`recordingrules.loki.grafana.com-v1-view`	Users with this role can read `RecordingRule` resources within the `loki.grafana.com/v1` API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.

alertingrules.loki.grafana.com-v1-admin

Users with this role have administrative-level access to manage alerting rules. This cluster role grants permissions to create, read, update, delete, list, and watch AlertingRule resources within the loki.grafana.com/v1 API group.

alertingrules.loki.grafana.com-v1-crdview

Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to AlertingRule resources within the loki.grafana.com/v1 API group, but do not have permissions for modifying or managing these resources.

alertingrules.loki.grafana.com-v1-edit

Users with this role have permission to create, update, and delete AlertingRule resources.

alertingrules.loki.grafana.com-v1-view

Users with this role can read AlertingRule resources within the loki.grafana.com/v1 API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.

recordingrules.loki.grafana.com-v1-admin

Users with this role have administrative-level access to manage recording rules. This cluster role grants permissions to create, read, update, delete, list, and watch RecordingRule resources within the loki.grafana.com/v1 API group.

recordingrules.loki.grafana.com-v1-crdview

Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to RecordingRule resources within the loki.grafana.com/v1 API group, but do not have permissions for modifying or managing these resources.

recordingrules.loki.grafana.com-v1-edit

Users with this role have permission to create, update, and delete RecordingRule resources.

recordingrules.loki.grafana.com-v1-view

Users with this role can read RecordingRule resources within the loki.grafana.com/v1 API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.

Examples

To apply cluster roles for a user, you must bind an existing cluster role to a specific username.

Cluster roles can be cluster or namespace scoped, depending on which type of role binding you use. When a RoleBinding object is used, as when using the oc adm policy add-role-to-user command, the cluster role only applies to the specified namespace. When a ClusterRoleBinding object is used, as when using the oc adm policy add-cluster-role-to-user command, the cluster role applies to all namespaces in the cluster.

The following example command gives the specified user create, read, update and delete (CRUD) permissions for alerting rules in a specific namespace in the cluster:

Example cluster role binding command for alerting rule CRUD permissions in a specific namespace

$ oc adm policy add-role-to-user alertingrules.loki.grafana.com-v1-admin -n <namespace> <username>

The following command gives the specified user administrator permissions for alerting rules in all namespaces:

Example cluster role binding command for administrator permissions

$ oc adm policy add-cluster-role-to-user alertingrules.loki.grafana.com-v1-admin <username>

Creating a log-based alerting rule with Loki

The AlertingRule CR contains a set of specifications and webhook validation definitions to declare groups of alerting rules for a single LokiStack instance. In addition, the webhook validation definition provides support for rule validation conditions:

If an AlertingRule CR includes an invalid interval period, it is an invalid alerting rule
If an AlertingRule CR includes an invalid for period, it is an invalid alerting rule.
If an AlertingRule CR includes an invalid LogQL expr, it is an invalid alerting rule.
If an AlertingRule CR includes two groups with the same name, it is an invalid alerting rule.
If none of the above applies, an alerting rule is considered valid.

Table 2. AlertingRule definitions
Tenant type	Valid namespaces for `AlertingRule` CRs
application	`<your_application_namespace>`
audit	`openshift-logging`
infrastructure	`openshift-/`, `kube-/\`, `default`

Procedure

Create an AlertingRule custom resource (CR):

Example infrastructure AlertingRule CR

  apiVersion: loki.grafana.com/v1
  kind: AlertingRule
  metadata:
    name: loki-operator-alerts
    namespace: openshift-operators-redhat (1)
    labels: (2)
      openshift.io/<label_name>: "true"
  spec:
    tenantID: "infrastructure" (3)
    groups:
      - name: LokiOperatorHighReconciliationError
        rules:
          - alert: HighPercentageError
            expr: | (4)
              sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job)
                /
              sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job)
                > 0.01
            for: 10s
            labels:
              severity: critical (5)
            annotations:
              summary: High Loki Operator Reconciliation Errors (6)
              description: High Loki Operator Reconciliation Errors (7)

1	The namespace where this `AlertingRule` CR is created must have a label matching the LokiStack `spec.rules.namespaceSelector` definition.
2	The `labels` block must match the LokiStack `spec.rules.selector` definition.
3	`AlertingRule` CRs for `infrastructure` tenants are only supported in the `openshift-`, `kube-\`, or `default` namespaces.
4	The value for `kubernetes_namespace_name:` must match the value for `metadata.namespace`.
5	The value of this mandatory field must be `critical`, `warning`, or `info`.
6	This field is mandatory.
7	This field is mandatory.

Example application AlertingRule CR

  apiVersion: loki.grafana.com/v1
  kind: AlertingRule
  metadata:
    name: app-user-workload
    namespace: app-ns (1)
    labels: (2)
      openshift.io/<label_name>: "true"
  spec:
    tenantID: "application"
    groups:
      - name: AppUserWorkloadHighError
        rules:
          - alert:
            expr: | (3)
              sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
            for: 10s
            labels:
              severity: critical (4)
            annotations:
              summary:  (5)
              description:  (6)

1	The namespace where this `AlertingRule` CR is created must have a label matching the LokiStack `spec.rules.namespaceSelector` definition.
2	The `labels` block must match the LokiStack `spec.rules.selector` definition.
3	Value for `kubernetes_namespace_name:` must match the value for `metadata.namespace`.
4	The value of this mandatory field must be `critical`, `warning`, or `info`.
5	The value of this mandatory field is a summary of the rule.
6	The value of this mandatory field is a detailed description of the rule.

Apply the AlertingRule CR:
```
$ oc apply -f <filename>.yaml
```

Configuring Loki to tolerate memberlist creation failure

In an OKD cluster, administrators generally use a non-private IP network range. As a result, the LokiStack memberlist configuration fails because, by default, it only uses private IP networks.

As an administrator, you can select the pod network for the memberlist configuration. You can modify the LokiStack custom resource (CR) to use the podIP address in the hashRing spec. To configure the LokiStack CR, use the following command:

$ oc patch LokiStack logging-loki -n openshift-logging  --type=merge -p '{"spec": {"hashRing":{"memberlist":{"instanceAddrType":"podIP"},"type":"memberlist"}}}'

Example LokiStack to include podIP

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
# ...
  hashRing:
    type: memberlist
    memberlist:
      instanceAddrType: podIP
# ...

Enabling stream-based retention with Loki

You can configure retention policies based on log streams. Rules for these may be set globally, per-tenant, or both. If you configure both, tenant rules apply before global rules.

If there is no retention period defined on the s3 bucket or in the LokiStack custom resource (CR), then the logs are not pruned and they stay in the s3 bucket forever, which might fill up the s3 storage.

Schema v13 is recommended.

Procedure

Create a LokiStack CR:

Enable stream-based retention globally as shown in the following example:

Example global stream-based retention for AWS

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
  limits:
   global: (1)
      retention: (2)
        days: 20
        streams:
        - days: 4
          priority: 1
          selector: '{kubernetes_namespace_name=~"test.+"}' (3)
        - days: 1
          priority: 1
          selector: '{log_type="infrastructure"}'
  managementState: Managed
  replicationFactor: 1
  size: 1x.small
  storage:
    schemas:
    - effectiveDate: "2020-10-11"
      version: v13
    secret:
      name: logging-loki-s3
      type: aws
  storageClassName: gp3-csi
  tenants:
    mode: openshift-logging

1	Sets retention policy for all log streams. Note: This field does not impact the retention period for stored logs in object storage.
2	Retention is enabled in the cluster when this block is added to the CR.
3	Contains the LogQL query used to define the log stream.spec: limits:

Enable stream-based retention per-tenant basis as shown in the following example:

Example per-tenant stream-based retention for AWS

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
  limits:
    global:
      retention:
        days: 20
    tenants: (1)
      application:
        retention:
          days: 1
          streams:
            - days: 4
              selector: '{kubernetes_namespace_name=~"test.+"}' (2)
      infrastructure:
        retention:
          days: 5
          streams:
            - days: 1
              selector: '{kubernetes_namespace_name=~"openshift-cluster.+"}'
  managementState: Managed
  replicationFactor: 1
  size: 1x.small
  storage:
    schemas:
    - effectiveDate: "2020-10-11"
      version: v13
    secret:
      name: logging-loki-s3
      type: aws
  storageClassName: gp3-csi
  tenants:
    mode: openshift-logging

1	Sets retention policy by tenant. Valid tenant types are `application`, `audit`, and `infrastructure`.
2	Contains the LogQL query used to define the log stream.

Apply the LokiStack CR:
```
$ oc apply -f <filename>.yaml
```

Loki pod placement

You can control which nodes the Loki pods run on, and prevent other workloads from using those nodes, by using tolerations or node selectors on the pods.

You can apply tolerations to the log store pods with the LokiStack custom resource (CR) and apply taints to a node with the node specification. A taint on a node is a key:value pair that instructs the node to repel all pods that do not allow the taint. Using a specific key:value pair that is not on other pods ensures that only the log store pods can run on that node.

Example LokiStack with node selectors

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
# ...
  template:
    compactor: (1)
      nodeSelector:
        node-role.kubernetes.io/infra: "" (2)
    distributor:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    gateway:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    indexGateway:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    ingester:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    querier:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    queryFrontend:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    ruler:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
# ...

1	Specifies the component pod type that applies to the node selector.
2	Specifies the pods that are moved to nodes containing the defined label.

Example LokiStack CR with node selectors and tolerations

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
# ...
  template:
    compactor:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    distributor:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    indexGateway:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    ingester:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    querier:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    queryFrontend:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    ruler:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    gateway:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
# ...

To configure the nodeSelector and tolerations fields of the LokiStack (CR), you can use the oc explain command to view the description and fields for a particular resource:

$ oc explain lokistack.spec.template

Example output

KIND:     LokiStack
VERSION:  loki.grafana.com/v1

RESOURCE: template <Object>

DESCRIPTION:
     Template defines the resource/limits/tolerations/nodeselectors per
     component

FIELDS:
   compactor	<Object>
     Compactor defines the compaction component spec.

   distributor	<Object>
     Distributor defines the distributor component spec.
...

For more detailed information, you can add a specific field:

$ oc explain lokistack.spec.template.compactor

Example output

KIND:     LokiStack
VERSION:  loki.grafana.com/v1

RESOURCE: compactor <Object>

DESCRIPTION:
     Compactor defines the compaction component spec.

FIELDS:
   nodeSelector	<map[string]string>
     NodeSelector defines the labels required by a node to schedule the
     component onto it.
...

Enhanced Reliability and Performance

Configurations to ensure Loki’s reliability and efficiency in production.

Enabling authentication to cloud-based log stores using short-lived tokens

Workload identity federation enables authentication to cloud-based log stores using short-lived tokens.

Procedure

Use one of the following options to enable authentication:

If you use the OKD web console to install the Loki Operator, clusters that use short-lived tokens are automatically detected. You are prompted to create roles and supply the data required for the Loki Operator to create a CredentialsRequest object, which populates a secret.

If you use the OpenShift CLI (oc) to install the Loki Operator, you must manually create a Subscription object using the appropriate template for your storage provider, as shown in the following examples. This authentication strategy is only supported for the storage providers indicated.

Example Azure sample subscription

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: loki-operator
  namespace: openshift-operators-redhat
spec:
  channel: "stable-6.0"
  installPlanApproval: Manual
  name: loki-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  config:
    env:
      - name: CLIENTID
        value: <your_client_id>
      - name: TENANTID
        value: <your_tenant_id>
      - name: SUBSCRIPTIONID
        value: <your_subscription_id>
      - name: REGION
        value: <your_region>

Example AWS sample subscription

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: loki-operator
  namespace: openshift-operators-redhat
spec:
  channel: "stable-6.0"
  installPlanApproval: Manual
  name: loki-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  config:
    env:
    - name: ROLEARN
      value: <role_ARN>

Configuring Loki to tolerate node failure

The Loki Operator supports setting pod anti-affinity rules to request that pods of the same component are scheduled on different available nodes in the cluster.

Affinity is a property of pods that controls the nodes on which they prefer to be scheduled. Anti-affinity is a property of pods that prevents a pod from being scheduled on a node.

In OKD, pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key-value labels on other pods.

The Operator sets default, preferred podAntiAffinity rules for all Loki components, which includes the compactor, distributor, gateway, indexGateway, ingester, querier, queryFrontend, and ruler components.

You can override the preferred podAntiAffinity settings for Loki components by configuring required settings in the requiredDuringSchedulingIgnoredDuringExecution field:

Example user settings for the ingester component

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
# ...
  template:
    ingester:
      podAntiAffinity:
      # ...
        requiredDuringSchedulingIgnoredDuringExecution: (1)
        - labelSelector:
            matchLabels: (2)
              app.kubernetes.io/component: ingester
          topologyKey: kubernetes.io/hostname
# ...

1	The stanza to define a required rule.
2	The key-value pair (label) that must be matched to apply the rule.

LokiStack behavior during cluster restarts

When an OKD cluster is restarted, LokiStack ingestion and the query path continue to operate within the available CPU and memory resources available for the node. This means that there is no downtime for the LokiStack during OKD cluster updates. This behavior is achieved by using PodDisruptionBudget resources. The Loki Operator provisions PodDisruptionBudget resources for Loki, which determine the minimum number of pods that must be available per component to ensure normal operations under certain conditions.

Advanced Deployment and Scalability

Specialized configurations for high availability, scalability, and error handling.

Zone aware data replication

The Loki Operator offers support for zone-aware data replication through pod topology spread constraints. Enabling this feature enhances reliability and safeguards against log loss in the event of a single zone failure. When configuring the deployment size as 1x.extra-small, 1x.small, or 1x.medium, the replication.factor field is automatically set to 2.

To ensure proper replication, you need to have at least as many availability zones as the replication factor specifies. While it is possible to have more availability zones than the replication factor, having fewer zones can lead to write failures. Each zone should host an equal number of instances for optimal operation.

Example LokiStack CR with zone replication enabled

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
 name: logging-loki
 namespace: openshift-logging
spec:
 replicationFactor: 2 (1)
 replication:
   factor: 2 (2)
   zones:
   -  maxSkew: 1 (3)
      topologyKey: topology.kubernetes.io/zone (4)

1	Deprecated field, values entered are overwritten by `replication.factor`.
2	This value is automatically set when deployment size is selected at setup.
3	The maximum difference in number of pods between any two topology domains. The default is 1, and you cannot specify a value of 0.
4	Defines zones in the form of a topology key that corresponds to a node label.

Recovering Loki pods from failed zones

In OKD a zone failure happens when specific availability zone resources become inaccessible. Availability zones are isolated areas within a cloud provider’s data center, aimed at enhancing redundancy and fault tolerance. If your OKD cluster is not configured to handle this, a zone failure can lead to service or data loss.

Loki pods are part of a StatefulSet, and they come with Persistent Volume Claims (PVCs) provisioned by a StorageClass object. Each Loki pod and its PVCs reside in the same zone. When a zone failure occurs in a cluster, the StatefulSet controller automatically attempts to recover the affected pods in the failed zone.

The following procedure will delete the PVCs in the failed zone, and all data contained therein. To avoid complete data loss the replication factor field of the LokiStack CR should always be set to a value greater than 1 to ensure that Loki is replicating.

Prerequisites

Verify your LokiStack CR has a replication factor greater than 1.
Zone failure detected by the control plane, and nodes in the failed zone are marked by cloud provider integration.

The StatefulSet controller automatically attempts to reschedule pods in a failed zone. Because the associated PVCs are also in the failed zone, automatic rescheduling to a different zone does not work. You must manually delete the PVCs in the failed zone to allow successful re-creation of the stateful Loki Pod and its provisioned PVC in the new zone.

Procedure

List the pods in Pending status by running the following command:

$ oc get pods --field-selector status.phase==Pending -n openshift-logging

Example oc get pods output

NAME                           READY   STATUS    RESTARTS   AGE (1)
logging-loki-index-gateway-1   0/1     Pending   0          17m
logging-loki-ingester-1        0/1     Pending   0          16m
logging-loki-ruler-1           0/1     Pending   0          16m

1	These pods are in `Pending` status because their corresponding PVCs are in the failed zone.

List the PVCs in Pending status by running the following command:

$ oc get pvc -o=json -n openshift-logging | jq '.items[] | select(.status.phase == "Pending") | .metadata.name' -r

Example oc get pvc output

storage-logging-loki-index-gateway-1
storage-logging-loki-ingester-1
wal-logging-loki-ingester-1
storage-logging-loki-ruler-1
wal-logging-loki-ruler-1

Delete the PVC(s) for a pod by running the following command:
```
$ oc delete pvc <pvc_name> -n openshift-logging
```
Delete the pod(s) by running the following command:
```
$ oc delete pod <pod_name> -n openshift-logging
```
Once these objects have been successfully deleted, they should automatically be rescheduled in an available zone.

Troubleshooting PVC in a terminating state

The PVCs might hang in the terminating state without being deleted, if PVC metadata finalizers are set to kubernetes.io/pv-protection. Removing the finalizers should allow the PVCs to delete successfully.

Remove the finalizer for each PVC by running the command below, then retry deletion.

$ oc patch pvc <pvc_name> -p '{"metadata":{"finalizers":null}}' -n openshift-logging

Troubleshooting Loki rate limit errors

If the Log Forwarder API forwards a large block of messages that exceeds the rate limit to Loki, Loki generates rate limit (429) errors.

These errors can occur during normal operation. For example, when adding the logging to a cluster that already has some logs, rate limit errors might occur while the logging tries to ingest all of the existing log entries. In this case, if the rate of addition of new logs is less than the total rate limit, the historical data is eventually ingested, and the rate limit errors are resolved without requiring user intervention.

In cases where the rate limit errors continue to occur, you can fix the issue by modifying the LokiStack custom resource (CR).

The LokiStack CR is not available on Grafana-hosted Loki. This topic does not apply to Grafana-hosted Loki servers.

Conditions

The Log Forwarder API is configured to forward logs to Loki.

Your system sends a block of messages that is larger than 2 MB to Loki. For example:

"values":[["1630410392689800468","{\"kind\":\"Event\",\"apiVersion\":\
.......
......
......
......
\"received_at\":\"2021-08-31T11:46:32.800278+00:00\",\"version\":\"1.7.4 1.6.0\"}},\"@timestamp\":\"2021-08-31T11:46:32.799692+00:00\",\"viaq_index_name\":\"audit-write\",\"viaq_msg_id\":\"MzFjYjJkZjItNjY0MC00YWU4LWIwMTEtNGNmM2E5ZmViMGU4\",\"log_type\":\"audit\"}"]]}]}

After you enter oc logs -n openshift-logging -l component=collector, the collector logs in your cluster show a line containing one of the following error messages:

429 Too Many Requests Ingestion rate limit exceeded

Example Vector error message

2023-08-25T16:08:49.301780Z  WARN sink{component_kind="sink" component_id=default_loki_infra component_type=loki component_name=default_loki_infra}: vector::sinks::util::retries: Retrying after error. error=Server responded with an error: 429 Too Many Requests internal_log_rate_limit=true

Example Fluentd error message

2023-08-30 14:52:15 +0000 [warn]: [default_loki_infra] failed to flush the buffer. retry_times=2 next_retry_time=2023-08-30 14:52:19 +0000 chunk="604251225bf5378ed1567231a1c03b8b" error_class=Fluent::Plugin::LokiOutput::LogPostError error="429 Too Many Requests Ingestion rate limit exceeded for user infrastructure (limit: 4194304 bytes/sec) while attempting to ingest '4082' lines totaling '7820025' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\n"

The error is also visible on the receiving end. For example, in the LokiStack ingester pod:

Example Loki ingester error message

level=warn ts=2023-08-30T14:57:34.155592243Z caller=grpc_logging.go:43 duration=1.434942ms method=/logproto.Pusher/Push err="rpc error: code = Code(429) desc = entry with timestamp 2023-08-30 14:57:32.012778399 +0000 UTC ignored, reason: 'Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream

Procedure

Update the ingestionBurstSize and ingestionRate fields in the LokiStack CR:

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
  limits:
    global:
      ingestion:
        ingestionBurstSize: 16 (1)
        ingestionRate: 8 (2)
# ...

1	The `ingestionBurstSize` field defines the maximum local rate-limited sample size per distributor replica in MB. This value is a hard limit. Set this value to at least the maximum logs size expected in a single push request. Single requests that are larger than the `ingestionBurstSize` value are not permitted.
2	The `ingestionRate` field is a soft limit on the maximum amount of ingested samples per second in MB. Rate limit errors occur if the rate of logs exceeds the limit, but the collector retries sending the logs. As long as the total average is lower than the limit, the system recovers and errors are resolved without user intervention.

Storing logs with LokiStack

Prerequisites

Core Setup and Configuration

Loki deployment sizing

Authorizing LokiStack rules RBAC permissions

Examples

Creating a log-based alerting rule with Loki

Configuring Loki to tolerate memberlist creation failure

Enabling stream-based retention with Loki

Loki pod placement

Enhanced Reliability and Performance

Enabling authentication to cloud-based log stores using short-lived tokens

Configuring Loki to tolerate node failure

LokiStack behavior during cluster restarts

Advanced Deployment and Scalability

Zone aware data replication

Recovering Loki pods from failed zones

Troubleshooting PVC in a terminating state

Troubleshooting Loki rate limit errors