# yum install libcgroup-tools
To provide more reliable scheduling and minimize node resource overcommitment, reserve a portion of the CPU and memory resources for use by the underlying node components such as kubelet, kube-proxy, and the container engine. The resources that you reserve are also used by the remaining system components such as sshd, NetworkManager, and so on. Specifying the resources to reserve provides the scheduler with more information about the remaining memory and CPU resources that a node has available for use by pods.
Resources are reserved for node components and system components in OKD by configuring the system-reserved
node setting.
OKD does not use the kube-reserved
setting. Documentation for Kubernetes and some cloud vendors that provide a Kubernetes environment might suggest configuring kube-reserved
. That information does not apply to an OKD cluster.
Use caution when you tune your cluster with resource limits and enforcing limits with evictions. Enforcing system-reserved
limits can prevent critical system services from receiving CPU time or ending the critical system services when memory resources run low.
In most cases, tuning resource allocation is performed by making an adjustment and then monitoring the cluster performance with a production-like workload. That process is repeated until the cluster is stable and meets service-level agreements.
For more information on the effects of these settings, see Computing Allocated Resources.
Setting | Description |
---|---|
|
This setting is not used with OKD. Add the CPU and memory resources that you planned to reserve to |
|
Resources that are reserved for the node components and system components. Default is none. |
View the services that are controlled by system-reserved
with a tool such as lscgroup
by running the following commands:
# yum install libcgroup-tools
$ lscgroup memory:/system.slice
Reserve resources in the kubeletArguments
section of the
node
configuration map by adding a set of <resource_type>=<resource_quantity>
pairs. For example, cpu=500m,memory=1Gi
reserves 500 millicores of CPU and one gigabyte of memory.
kubeletArguments:
system-reserved:
- "cpu=500m,memory=1Gi"
Add the system-reserved
field if it does not exist.
Do not edit the |
To determine appropriate values for these settings, view the resource usage of a node by using the node summary API. For more information, see System Resources Reported by Node.
After you set system-reserved
:
Monitor the memory usage of a node for high-water marks:
$ ps aux | grep <service-name>
For example:
$ ps aux | grep atomic-openshift-node
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 11089 11.5 0.3 112712 996 pts/1 R+ 16:23 0:00 grep --color=auto atomic-openshift-node
If this value is close to your system-reserved
mark, you can increase the system-reserved
value.
Monitor the memory usage of system services with a tool such as cgget
by running the following commands:
# yum install libcgroup-tools
$ cgget -g memory /system.slice | grep memory.usage_in_bytes
If this value is close to your system-reserved
mark, you can increase the system-reserved
value.
Use the OKD cluster loader to measure performance metrics of your deployment at various cluster states.
An allocated amount of a resource is computed based on the following formula:
[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]
The withholding of |
If [Allocatable]
is negative, it is set to 0
.
To view the current capacity and allocatable resources for a node, run the following command:
$ oc get node/<node_name> -o yaml
In the following partial output, the allocatable values are less than the capacity. The difference is expected and matches a cpu=500m,memory=1Gi
resource allocation for system-reserved
.
status:
...
allocatable:
cpu: "3500m"
memory: 6857952Ki
pods: "110"
capacity:
cpu: "4"
memory: 8010948Ki
pods: "110"
...
The scheduler uses the values for allocatable
to decide if a node is a candidate for pod scheduling.
Each node reports the system resources that are used by the container runtime and kubelet. To simplify configuring system-reserved
, view the resource usage for the node by using the node summary API. The node summary is available at <master>/api/v1/nodes/<node>/proxy/stats/summary
.
For instance, to access the resources from cluster.node22 node, run the following command:
$ curl <certificate details> https://<master>/api/v1/nodes/cluster.node22/proxy/stats/summary
The response includes information that is similar to the following:
{
"node": {
"nodeName": "cluster.node22",
"systemContainers": [
{
"cpu": {
"usageCoreNanoSeconds": 929684480915,
"usageNanoCores": 190998084
},
"memory": {
"rssBytes": 176726016,
"usageBytes": 1397895168,
"workingSetBytes": 1050509312
},
"name": "kubelet"
},
{
"cpu": {
"usageCoreNanoSeconds": 128521955903,
"usageNanoCores": 5928600
},
"memory": {
"rssBytes": 35958784,
"usageBytes": 129671168,
"workingSetBytes": 102416384
},
"name": "runtime"
}
]
}
}
See REST API Overview for more details about certificate details.
The node is able to limit the total amount of resources that pods can consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from using CPU and memory resources that are needed by system services such as the container runtime and node agent. To improve node reliability, administrators should reserve resources based on a target for resource use.
The node enforces resource constraints using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy that is separate from system daemons.
To configure node enforcement, use the following parameters in the appropriate node configuration map.
kubeletArguments:
cgroups-per-qos:
- "true" (1)
cgroup-driver:
- "systemd" (2)
enforce-node-allocatable:
- "pods" (3)
1 | Enable or disable a cgroup hierarchy for each quality of service. The cgroups are managed by the node. Any change of this setting requires a full drain of the node. This flag must be true to enable the node to enforce the node-allocatable resource constraints. The default value is true and Red Hat does not recommend that customers change this value. |
2 | The cgroup driver that is used by the node to manage the cgroup hierarchies. This value must match the driver that is associated with the container runtime. Valid values are systemd and cgroupfs , but Red Hat supports systemd only. |
3 | A comma-delimited list of scopes for where the node should enforce node resource constraints. The default value is pods and Red Hat supports pods only. |
Administrators should treat system daemons similar to pods that have a guaranteed quality of service. System daemons can burst within their bounding control groups and this behavior must be managed as part of cluster deployments. Reserve CPU and memory resources for system daemons by specifying the resources in system-reserved
as shown in section Configuring Nodes for Allocated Resources.
To view the cgroup driver that is set, run the following command:
$ systemctl status atomic-openshift-node -l | grep cgroup-driver=
The output includes a response that is similar to the following:
--cgroup-driver=systemd
For more information on managing and troubleshooting cgroup drivers, see Introduction to Control Groups (Cgroups).
If a node is under memory pressure, it can impact the entire node and all pods running on it. If a system daemon uses more than its reserved amount of memory, an out-of-memory event can occur that impacts the entire node and all pods running on the node. To avoid or reduce the probability of system out-of-memory events, the node provides out of resource handling.