This document describes how to configure clusters and VMs to support high performance and low latency workloads with the computing efficiencies of non-uniform memory access (NUMA). There are instructions for tuning Kubernetes settings for cluster nodes. This document also includes instructions for configuring Virtual Machines (VMs) with NUMA affinity so that they get scheduled on and take advantage of the NUMA nodes.
With a NUMA-aware VM, all the communication within the VM is local to the NUMA node. The NUMA-aware VM avoids data transactions to and from remote resources that can degrade VM performance.
Configure nodes to use NUMA
The following sections describe how to configure the critical Kubernetes components to tune the node and make sure it can schedule NUMA-aware containers. These NUMA nodes are tuned to optimize CPU and memory performance. Follow the instructions for each node that you want to run NUMA-aware VMs.
Update the kubelet configuration
As part of the node configuration to support NUMA node affinity, you need to make the following changes in the kubelet configuration:
- Enable the CPU Manager with a
static
policy - Enable the Memory Manager with a
Static
policy - Enable the Topology manager with
restricted
topology
To configure kubelet on your worker node:
Locate the
kubelet
file on your worker node and open it for editing:edit /etc/default/kubelet
If you don't see the
kubelet
file, create it with the following command:echo "KUBELET_EXTRA_ARGS=\"\"" >> /etc/default/kubelet
This command creates the
kubelet
file with an emptyKUBELET_EXTRA_ARGS=""
section.To enable the CPU Manager with a
static
policy, add the--cpu-manager-policy=static
flag to theKUBELET_EXTRA_ARGS=""
section of the file:KUBELET_EXTRA_ARGS="--cpu-manager-policy=static"
To enable the Memory Manager with a
Static
policy, add the--memory-manager-policy=Static
flag to theKUBELET_EXTRA_ARGS=""
section of the file:KUBELET_EXTRA_ARGS="--cpu-manager-policy=static --memory-manager-policy=Static"
To enable the Topology Manager with a
restricted
policy, add the--topology-manager-policy=restricted
flag to theKUBELET_EXTRA_ARGS=""
section of the file:KUBELET_EXTRA_ARGS="--cpu-manager-policy=static --memory-manager-policy=Static --topology-manager-policy=restricted"
Check the current amount of memory reserved by Google Distributed Cloud:
cat /var/lib/kubelet/kubeadm-flags.env
The output should look like the following:
KUBELET_KUBEADM_ARGS="--anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --feature-gates=SeccompDefault=true --kube-reserved=cpu=100m,memory=3470Mi --max-pods=110 --node-ip=192.168.1.190 --node-labels=baremetal.cluster.gke.io/k8s-ip=192.168.1.190,baremetal.cluster.gke.io/namespace=cluster-user001,baremetal.cluster.gke.io/node-pool=node-pool-1,cloud.google.com/gke-nodepool=node-pool-1 --pod-infra-container-image=gcr.io/anthos-baremetal-release/pause-amd64:3.1-gke.5 --provider-id=baremetal://192.168.1.190 --read-only-port=10255 --rotate-server-certificates=true --seccomp-default=true"
The
--kube-reserved=cpu=100m,memory=3470Mi
setting indicates that Google Distributed Cloud has reserved 3,470 mebibytes of memory on the node.Set the
--reserved-memory
flag in theKUBELET_EXTRA_ARGS
section of thekubelet
file to 100 mebibytes more than the current reserved memory to account for the eviction threshold. If there is no reserved memory, you can skip this step.For example, with the reserved memory of
3470Mi
from the example in the preceding step, you reserve3570Mi
of memory in thekubelet
file:KUBELET_EXTRA_ARGS="--cpu-manager-policy=static --memory-manager-policy=Static --topology-manager-policy=restricted --reserved-memory=0:memory=3570Mi"
Remove CPU and memory state files from the
/var/lib
directory:rm /var/lib/cpu_manager_state rm /var/lib/memory_manager_state
Restart kubelet:
systemctl start kubelet
For more information about these policy settings, see the following Kubernetes documentation:
Configure the node to use hugepages
Once you have enabled Memory Manager with the Static
policy, you can add
hugepages to further improve container workload performance on your NUMA nodes.
Hugepages, as the name suggests, let you specify memory pages that are larger
than the standard 4 kibibyte (KiB). VM Runtime on Google Distributed Cloud supports 2
mebibyte (MiB) and 1 gibibyte (GiB) hugepages. You can set hugepages for a node
at runtime, or for when the node machine boots. We recommend that you configure
hugepages on each node that you want to run NUMA-aware VMs.
To configure the number of hugepages of a specific size on your NUMA node at runtime, use the following command:
echo HUGEPAGE_QTY > \ /sys/devices/system/node/NUMA_NODE/hugepages/hugepages-HUGEPAGE_SIZEkB/nr_hugepages
Replace the following:
HUGEPAGE_QTY
: the number of hugepages to allocate of the specified size.NUMA_NODE
: the NUMA node, such asnode0
, to which you're allocating hugepages.HUGEPAGE_SIZE
: the size of the hugepages in kibibytes,2048
(2 MiB) or1048576
(1 GiB).
Configure a VM to use the NUMA node
Once your cluster nodes are tuned for NUMA, you can create NUMA-aware VMs. NUMA-aware VMs are scheduled on NUMA nodes.
To create a NUMA-aware VM:
Follow the instructions to create a VM from a manifest.
Use the following
compute
settings to configure your VM to be NUMA-aware:spec.compute.guaranteed
: Setguaranteed
totrue
. With this setting, thevirt-launcher
Pod is configured to be placed in the Kubernetes Guaranteed Quality of Service (QoS) class.spec.compute.advancedCompute
:dedicatedCPUPlacement
: SetdedicatedCPUPlacement
totrue
. This setting pins virtual CPUs to the physical CPUs of the node.hugePageSize
: SethugePageSize
to either2Mi
or1Gi
to specify the hugepages size for your VM to use, 2 mebibyte or 1 gibibyte.numaGuestMappingPassthrough
: Include an empty structure ({}
) for this setting. This setting establishes NUMA affinity so that your VM is scheduled on NUMA nodes only.
The following example VirtualMachine manifest shows how a NUMA-aware VM configuration might look:
apiVersion: vm.cluster.gke.io/v1 kind: VirtualMachine metadata: name: vm1 spec: compute: cpu: vcpus: 2 guaranteed: true advancedCompute: dedicatedCPUPlacement: true hugePageSize: 2Mi numaGuestMappingPassthrough: {} memory: capacity: 256Mi interfaces: - name: eth0 networkName: pod-network default: true disks: - virtualMachineDiskName: disk-from-gcs boot: true readOnly: true