Distributing hosted cluster workloads - Preparing to deploy hosted control planes | Hosted control planes

Labeling management cluster nodes
Priority classes
Custom taints and tolerations
Control plane isolation
- Network policy isolation
- Control plane pod isolation

Before you get started with hosted control planes for OKD, you must properly label nodes so that the pods of hosted clusters can be scheduled into infrastructure nodes. Node labeling is also important for the following reasons:

To ensure high availability and proper workload deployment. For example, to avoid having the control plane workload count toward your OKD subscription, you can set the node-role.kubernetes.io/infra label.
To ensure that control plane workloads are separate from other workloads in the management cluster.
To ensure that control plane workloads are configured at the correct multi-tenancy distribution level for your deployment. The distribution levels are as follows:
- Everything shared: Control planes for hosted clusters can run on any node that is designated for control planes.
- Request serving isolation: Serving pods are requested in their own dedicated nodes.
- Nothing shared: Every control plane has its own dedicated nodes.

For more information about dedicating a node to a single hosted cluster, see "Labeling management cluster nodes".

Do not use the management cluster for your workload. Workloads must not run on nodes where control planes run.

Labeling management cluster nodes

Proper node labeling is a prerequisite to deploying hosted control planes.

As a management cluster administrator, you use the following labels and taints in management cluster nodes to schedule a control plane workload:

hypershift.openshift.io/control-plane: true: Use this label and taint to dedicate a node to running hosted control plane workloads. By setting a value of true, you avoid sharing the control plane nodes with other components, for example, the infrastructure components of the management cluster or any other mistakenly deployed workload.
hypershift.openshift.io/cluster: ${HostedControlPlane Namespace}: Use this label and taint when you want to dedicate a node to a single hosted cluster.

Apply the following labels on the nodes that host control-plane pods:

node-role.kubernetes.io/infra: Use this label to avoid having the control-plane workload count toward your subscription.
topology.kubernetes.io/zone: Use this label on the management cluster nodes to deploy highly available clusters across failure domains. The zone might be a location, rack name, or the hostname of the node where the zone is set. For example, a management cluster has the following nodes: worker-1a, worker-1b, worker-2a, and worker-2b. The worker-1a and worker-1b nodes are in rack1, and the worker-2a and worker-2b nodes are in rack2. To use each rack as an availability zone, enter the following commands:
```
$ oc label node/worker-1a node/worker-1b topology.kubernetes.io/zone=rack1
```
```
$ oc label node/worker-2a node/worker-2b topology.kubernetes.io/zone=rack2
```

Pods for a hosted cluster have tolerations, and the scheduler uses affinity rules to schedule them. Pods tolerate taints for control-plane and the cluster for the pods. The scheduler prioritizes the scheduling of pods into nodes that are labeled with hypershift.openshift.io/control-plane and hypershift.openshift.io/cluster: ${HostedControlPlane Namespace}.

For the ControllerAvailabilityPolicy option, use HighlyAvailable, which is the default value that the hosted control planes command-line interface, hcp, deploys. When you use that option, you can schedule pods for each deployment within a hosted cluster across different failure domains by setting topology.kubernetes.io/zone as the topology key. Scheduling pods for a deployment within a hosted cluster across different failure domains is available only for highly available control planes.

Procedure

To enable a hosted cluster to require its pods to be scheduled into infrastructure nodes, set HostedCluster.spec.nodeSelector, as shown in the following example:

  spec:
    nodeSelector:
      node-role.kubernetes.io/infra: ""

This way, hosted control planes for each hosted cluster are eligible infrastructure node workloads, and you do not need to entitle the underlying OKD nodes.

Priority classes

Four built-in priority classes influence the priority and preemption of the hosted cluster pods. You can create the pods in the management cluster in the following order from highest to lowest:

hypershift-operator: HyperShift Operator pods.
hypershift-etcd: Pods for etcd.
hypershift-api-critical: Pods that are required for API calls and resource admission to succeed. These pods include pods such as kube-apiserver, aggregated API servers, and web hooks.
hypershift-control-plane: Pods in the control plane that are not API-critical but still need elevated priority, such as the cluster version Operator.

Custom taints and tolerations

By default, pods for a hosted cluster tolerate the control-plane and cluster taints. However, you can also use custom taints on nodes so that hosted clusters can tolerate those taints on a per-hosted-cluster basis by setting HostedCluster.spec.tolerations.

Passing tolerations for a hosted cluster is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Example configuration

  spec:
    tolerations:
    - effect: NoSchedule
      key: kubernetes.io/custom
      operator: Exists

You can also set tolerations on the hosted cluster while you create a cluster by using the --tolerations hcp CLI argument.

Example CLI argument

--toleration="key=kubernetes.io/custom,operator=Exists,effect=NoSchedule"

For fine granular control of hosted cluster pod placement on a per-hosted-cluster basis, use custom tolerations with nodeSelectors. You can co-locate groups of hosted clusters and isolate them from other hosted clusters. You can also place hosted clusters in infra and control plane nodes.

Tolerations on the hosted cluster spread only to the pods of the control plane. To configure other pods that run on the management cluster and infrastructure-related pods, such as the pods to run virtual machines, you need to use a different process.

Control plane isolation

You can configure hosted control planes to isolate network traffic or control plane pods.

Network policy isolation

Each hosted control plane is assigned to run in a dedicated Kubernetes namespace. By default, the Kubernetes namespace denies all network traffic.

The following network traffic is allowed through the network policy that is enforced by the Kubernetes Container Network Interface (CNI):

Ingress pod-to-pod communication in the same namespace (intra-tenant)
Ingress on port 6443 to the hosted kube-apiserver pod for the tenant
Metric scraping from the management cluster Kubernetes namespace with the network.openshift.io/policy-group: monitoring label is allowed for monitoring

Control plane pod isolation

In addition to network policies, each hosted control plane pod is run with the restricted security context constraint. This policy denies access to all host features and requires pods to be run with a UID and with SELinux context that is allocated uniquely to each namespace that hosts a customer control plane.

The policy ensures the following constraints:

Pods cannot run as privileged.
Pods cannot mount host directory volumes.
Pods must run as a user in a pre-allocated range of UIDs.
Pods must run with a pre-allocated MCS label.
Pods cannot access the host network namespace.
Pods cannot expose host network ports.
Pods cannot access the host PID namespace.
By default, pods drop the following Linux capabilities: KILL, MKNOD, SETUID, and SETGID.

The management components, such as kubelet and crio, on each management cluster worker node are protected by an SELinux label that is not accessible to the SELinux context for pods that support hosted control planes.

The following SELinux labels are used for key processes and sockets:

kubelet: system_u:system_r:unconfined_service_t:s0
crio: system_u:system_r:container_runtime_t:s0
crio.sock: system_u:object_r:container_var_run_t:s0
<example user container processes>: system_u:system_r:container_t:s0:c14,c24