This document describes how to configure clusters and VMs to support high performance and low latency workloads with the computing efficiencies of non-uniform memory access (NUMA). There are instructions for tuning Kubernetes settings for cluster nodes. This document also includes instructions for configuring Virtual Machines (VMs) with NUMA affinity so that they get scheduled on and take advantage of the NUMA nodes.
With a NUMA-aware VM, all the communication within the VM is local to the NUMA node. The NUMA-aware VM avoids data transactions to and from remote resources that can degrade VM performance.
Configure nodes to use NUMA
The following sections describe how to configure the critical Kubernetes components to tune the node and make sure it can schedule NUMA-aware containers. These NUMA nodes are tuned to optimize CPU and memory performance. Follow the instructions for each node that you want to run NUMA-aware VMs.
Update the kubelet configuration
As part of the node configuration to support NUMA node affinity, you need to make the following changes in the kubelet configuration:
- Enable the CPU Manager with a
staticpolicy - Enable the Memory Manager with a
Staticpolicy - Enable the Topology manager with
restrictedtopology
To configure kubelet on your worker node:
Locate the
kubeletfile on your worker node and open it for editing:edit /etc/default/kubeletIf you don't see the
kubeletfile, create it with the following command:echo "KUBELET_EXTRA_ARGS=\"\"" >> /etc/default/kubeletThis command creates the
kubeletfile with an emptyKUBELET_EXTRA_ARGS=""section.To enable the CPU Manager with a
staticpolicy, add the--cpu-manager-policy=staticflag to theKUBELET_EXTRA_ARGS=""section of the file:KUBELET_EXTRA_ARGS="--cpu-manager-policy=static"
To enable the Memory Manager with a
Staticpolicy, add the--memory-manager-policy=Staticflag to theKUBELET_EXTRA_ARGS=""section of the file:KUBELET_EXTRA_ARGS="--cpu-manager-policy=static --memory-manager-policy=Static"
To enable the Topology Manager with a
restrictedpolicy, add the--topology-manager-policy=restrictedflag to theKUBELET_EXTRA_ARGS=""section of the file:KUBELET_EXTRA_ARGS="--cpu-manager-policy=static --memory-manager-policy=Static --topology-manager-policy=restricted"
Check the current amount of memory reserved by Google Distributed Cloud:
cat /var/lib/kubelet/kubeadm-flags.envThe output should look like the following:
KUBELET_KUBEADM_ARGS="--anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --feature-gates=SeccompDefault=true --kube-reserved=cpu=100m,memory=3470Mi --max-pods=110 --node-ip=192.168.1.190 --node-labels=baremetal.cluster.gke.io/k8s-ip=192.168.1.190,baremetal.cluster.gke.io/namespace=cluster-user001,baremetal.cluster.gke.io/node-pool=node-pool-1,cloud.google.com/gke-nodepool=node-pool-1 --pod-infra-container-image=gcr.io/anthos-baremetal-release/pause-amd64:3.1-gke.5 --provider-id=baremetal://192.168.1.190 --read-only-port=10255 --rotate-server-certificates=true --seccomp-default=true"
The
--kube-reserved=cpu=100m,memory=3470Misetting indicates that Google Distributed Cloud has reserved 3,470 mebibytes of memory on the node.Set the
--reserved-memoryflag in theKUBELET_EXTRA_ARGSsection of thekubeletfile to 100 mebibytes more than the current reserved memory to account for the eviction threshold. If there is no reserved memory, you can skip this step.For example, with the reserved memory of
3470Mifrom the example in the preceding step, you reserve3570Miof memory in thekubeletfile:KUBELET_EXTRA_ARGS="--cpu-manager-policy=static --memory-manager-policy=Static --topology-manager-policy=restricted --reserved-memory=0:memory=3570Mi"
Remove CPU and memory state files from the
/var/libdirectory:rm /var/lib/cpu_manager_state rm /var/lib/memory_manager_stateRestart kubelet:
systemctl start kubelet
For more information about these policy settings, see the following Kubernetes documentation:
Configure the node to use hugepages
Once you have enabled Memory Manager with the Static policy, you can add
hugepages to further improve container workload performance on your NUMA nodes.
Hugepages, as the name suggests, let you specify memory pages that are larger
than the standard 4 kibibyte (KiB). VM Runtime on GDC supports 2
mebibyte (MiB) and 1 gibibyte (GiB) hugepages. You can set hugepages for a node
at runtime, or for when the node machine boots. We recommend that you configure
hugepages on each node that you want to run NUMA-aware VMs.
To configure the number of hugepages of a specific size on your NUMA node at runtime, use the following command:
echo HUGEPAGE_QTY > \ /sys/devices/system/node/NUMA_NODE/hugepages/hugepages-HUGEPAGE_SIZEkB/nr_hugepagesReplace the following:
HUGEPAGE_QTY: the number of hugepages to allocate of the specified size.NUMA_NODE: the NUMA node, such asnode0, to which you're allocating hugepages.HUGEPAGE_SIZE: the size of the hugepages in kibibytes,2048(2 MiB) or1048576(1 GiB).
Configure a VM to use the NUMA node
Once your cluster nodes are tuned for NUMA, you can create NUMA-aware VMs. NUMA-aware VMs are scheduled on NUMA nodes.
To create a NUMA-aware VM:
Follow the instructions to create a VM from a manifest.
Use the following
computesettings to configure your VM to be NUMA-aware:spec.compute.guaranteed: Setguaranteedtotrue. With this setting, thevirt-launcherPod is configured to be placed in the Kubernetes Guaranteed Quality of Service (QoS) class.spec.compute.advancedCompute:dedicatedCPUPlacement: SetdedicatedCPUPlacementtotrue. This setting pins virtual CPUs to the physical CPUs of the node.hugePageSize: SethugePageSizeto either2Mior1Gito specify the hugepages size for your VM to use, 2 mebibyte or 1 gibibyte.numaGuestMappingPassthrough: Include an empty structure ({}) for this setting. This setting establishes NUMA affinity so that your VM is scheduled on NUMA nodes only.
The following example VirtualMachine manifest shows how a NUMA-aware VM configuration might look:
apiVersion: vm.cluster.gke.io/v1 kind: VirtualMachine metadata: name: vm1 spec: compute: cpu: vcpus: 2 guaranteed: true advancedCompute: dedicatedCPUPlacement: true hugePageSize: 2Mi numaGuestMappingPassthrough: {} memory: capacity: 256Mi interfaces: - name: eth0 networkName: pod-network default: true disks: - virtualMachineDiskName: disk-from-gcs boot: true readOnly: true