This page shows you how to maximize network bandwidth and throughput for high-performance GPU workloads in Google Kubernetes Engine (GKE) by using Standard clusters with GPUDirect-TCPXO, GPUDirect-TCPX, gVNIC, and multi-networking.
This page is intended for machine learning (ML) engineers and platform administrators who facilitate ML workloads. Before reading this page, ensure that you're familiar with networking technologies, such as network interface cards (NICs) and TCP, and with accelerator technologies like the NVIDIA Collective Communications Library (NCCL).
Artificial intelligence (AI), ML, and high performance computing (HPC) applications require powerful acceleration to optimize performance by reducing job completion times. For example, ML models that focus on conversational AI and image generation require high scalability and compute power.
About Google Cloud GPU supercomputers
Google Cloud has accelerator-optimized supercomputers that are built for scalable, massive models. These machines have the following benefits:
- Eight NVIDIA H100 GPUs per machine.
- Up to 200 Gbps bandwidth on the primary NIC.
- Secondary NICs (up to eight on A3 Mega machine types and up to four on A3 High machine types), each supporting up to 200 Gbps bandwidth for GPU data transfer.
For a full list of benefits, see A3 machine series in the Compute Engine documentation.
Your GKE workload must use all available GPUs and all available secondary NICs on a single node and use a significant portion of the available bandwidth. The solution described in this page is ideal for workloads that require high performance, high throughput, and low latency.
Required features and capabilities for maximized bandwidth
To maximize your network bandwidth in GPU supercomputer nodes, use all of the following features:
- GPUDirect networking stack: The A3 machine series supports
two networking stacks for custom, remote direct memory access (RDMA):
- On A3 High machine types, utilize GPUDirect-TCPX to reduce the overhead required to transfer packet payloads to and from GPUs, which significantly improves throughput at scale compared to GPUs that don't use GPUDirect.
- On A3 Mega machine types, utilize GPUDirect-TCPXO which further improves GPU to VM communication.
- gVNIC: Enable GPUDirect capabilities such as packet header splitting, flow steering, and buffer management. gVNIC is required to use GPUDirect-TCPX or GPUDirect-TCPXO. For details about gVNIC, see Increase network traffic speed for GPU nodes.
- Multi-networking: Add secondary NICs to the accelerator-optimized machine. Each NIC is associated with a separate subnet in its own VPC to avoid conflicts. For details about multi-network support, see Setup multi-network support for Pods.
- Placement policies: Use a resource placement policy to place all GPU nodes for a specific workload on physically close servers to minimize latency. For details, see Define compact placement for GKE nodes.
Procedure outline
To use all of these capabilities together, you'll do the following:
- Create Virtual Private Cloud (VPC)s and subnets
- Create the GKE environment:
- Create a cluster with multi-networking enabled
- Create a node pool with the following characteristics:
- gVNIC enabled
- Multi-networking subnets specified for each secondary NIC
- A3 machine series with H100 GPUs backing the nodes
- Latest NVIDIA drivers installed
- Install the GPUDirect binary and the NCCL plugin
- Deploy the NRI device injector plugin
- Deploy a test workload to verify GPUDirect setup
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- Ensure that you have enough quota for H100 GPUs. To request more quota, see GPU quotas.
Requirements
The following requirements apply to both GPUDirect-TCPX and GPUDirect-TCPXO unless otherwise indicated.
- GPUDirect-TCPX is supported on GKE version 1.27 or later and requires:
- A3 High machine type (for example,
a3-highgpu-8g
). - For GKE version 1.27, use GKE patch version 1.27.7-gke.1121000 or later.
- For GKE version 1.28, use GKE patch version 1.28.8-gke.1095000 or later.
- For GKE version 1.29, use GKE patch version 1.29.3-gke.1093000 or later.
- For GKE version 1.30, use GKE patch version 1.30.2-gke.1023000 or later.
- A3 High machine type (for example,
- GPUDirect-TCPXO is supported on GKE version 1.28 or later and requires:
- A3 Mega machine type (for example,
a3-megagpu-8g
). - For GKE version 1.28, use GKE patch version 1.28.9-gke.1250000 or later.
- For GKE version 1.29, use GKE patch version 1.29.4-gke.1542000 or later.
- For GKE version 1.30, use GKE patch version 1.30.4-gke.1129000 or later.
- A3 Mega machine type (for example,
- Your GPU nodes must use NVIDIA driver version 535 or later.
- You must use GKE Dataplane V2.
- The GKE node must use a Container-Optimized OS (COS) node image. Ubuntu and Windows node images are not supported.
Limitations
The following limitations apply:
- GPUDirect-TCPX and GPUDirect-TCPXO are not supported in Autopilot clusters
- GPUDirect-TCPX and GPUDirect-TCPXO are not supported with multi-instance GPUs, GPU time-sharing, or NVIDIA MPS
- You can't use NCCL FastSocket
- Your GKE workload must use all available GPUs and all available secondary NICs on a single node. Multiple pods cannot use GPUDirect-TCPX or GPUDirect-TCPXO on a single node.
Create VPCs and subnets
Create separate VPC networks in your project for each virtual NIC that you'll add to your nodes. Each VPC network must have a subnet and a firewall rule that allows internal network traffic.
Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule. Choose the GPUDirect-TCPX tab for A3 High machine types, or choose the GPUDirect-TCPXO tab for A3 Mega machine types, then complete the following instructions:
GPUDirect-TCPXO
To maximize your bandwidth, we recommend that you create eight new networks.
for N in $(seq 1 8); do gcloud compute networks create PROJECT_ID-net-$N \ --subnet-mode=custom \ --mtu=8244 gcloud compute networks subnets create PROJECT_ID-sub-$N \ --network=PROJECT_ID-net-$N \ --region=REGION \ --range=SUBNET_RANGE gcloud compute firewall-rules create PROJECT_ID-internal-$N \ --network=PROJECT_ID-net-$N \ --action=ALLOW \ --rules=tcp:0-65535,udp:0-65535,icmp \ --source-ranges=SOURCE_RANGE done
Replace the following:
PROJECT_ID
: your Google Cloud project ID.REGION
: the Compute Engine region for each subnet.SUBNET_RANGE
: the IP address range of each subnet in CIDR notation. This example command iterates for eight subnets, so you should use a variable to change the IP address for each subnet. For example, specify192.168.$N.0/24
so that the first subnet uses192.168.1.0/24
, the second subnet uses192.168.2.0/24
, etc.SOURCE_RANGE
: The source IP address range for the firewall rule to allow ingress traffic, in CIDR notation. For example,192.168.0.0/16
.
GPUDirect-TCPX
To maximize your bandwidth, we recommend that you create four new networks.
for N in $(seq 1 4); do gcloud compute networks create PROJECT_ID-net-$N \ --subnet-mode=custom \ --mtu=8244 gcloud compute networks subnets create PROJECT_ID-sub-$N \ --network=PROJECT_ID-net-$N \ --region=REGION \ --range=SUBNET_RANGE gcloud compute firewall-rules create PROJECT_ID-internal-$N \ --network=PROJECT_ID-net-$N \ --action=ALLOW \ --rules=tcp:0-65535,udp:0-65535,icmp \ --source-ranges=SOURCE_RANGE done
Replace the following:
PROJECT_ID
: your Google Cloud project ID.REGION
: the Compute Engine region for each subnet.SUBNET_RANGE
: the IP address range of each subnet in CIDR notation. This example command iterates for four subnets, so you should use a variable to change the IP address for each subnet. For example, specify192.168.$N.0/24
so that the first subnet uses192.168.1.0/24
, the second subnet uses192.168.2.0/24
, etc.SOURCE_RANGE
: The source IP address range for the firewall rule to allow ingress traffic, in CIDR notation. For example,192.168.0.0/16
.
Verify that the networks were created:
gcloud compute networks list
Create the GKE environment
Create a new GKE cluster that uses multi-networking (Preview) and create a GPU node pool that uses A3 machines with H100 GPUs and additional NICs. You can't update an existing cluster to use multi-networking.
GPUDirect-TCPXO
Choose an available GKE version that supports GPUDirect-TCPXO. To list the versions, run this command:
gcloud container get-server-config \ --format="yaml(validMasterVersions)" \ --zone=ZONE \ --project=PROJECT_ID
Replace the following:
ZONE
: the compute zone for the cluster control plane.PROJECT_ID
: your Google Cloud project ID.
Create a cluster:
gcloud --project ${PROJECT} beta container clusters create CLUSTER_NAME \ --enable-dataplane-v2 --enable-ip-alias --zone=ZONE \ --enable-multi-networking --cluster-version=VERSION --no-enable-autoupgrade
Replace the following:
CLUSTER_NAME
: the name of your new cluster.VERSION
: a GKE version that supports GPUDirect-TCPXO, as described in Requirements.REGION
: the Compute Engine region for the cluster.ZONE
: the compute zone for the cluster.
Create Network and GKENetworkParamSet resources in the cluster that correspond to the VPC networks and subnetworks that you created:
kubectl apply -f - <<EOF apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc1 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc1 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc2 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc2 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc3 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc3 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc4 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc4 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc5 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc5 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc6 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc6 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc7 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc7 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc8 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc8 type: Device --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc1 spec: vpc: PROJECT_ID-net-1 vpcSubnet: PROJECT_ID-sub-1 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc2 spec: vpc: PROJECT_ID-net-2 vpcSubnet: PROJECT_ID-sub-2 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc3 spec: vpc: PROJECT_ID-net-3 vpcSubnet: PROJECT_ID-sub-3 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc4 spec: vpc: PROJECT_ID-net-4 vpcSubnet: PROJECT_ID-sub-4 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc5 spec: vpc: PROJECT_ID-net-5 vpcSubnet: PROJECT_ID-sub-5 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc6 spec: vpc: PROJECT_ID-net-6 vpcSubnet: PROJECT_ID-sub-6 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc7 spec: vpc: PROJECT_ID-net-7 vpcSubnet: PROJECT_ID-sub-7 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc8 spec: vpc: PROJECT_ID-net-8 vpcSubnet: PROJECT_ID-sub-8 deviceMode: NetDevice EOF
These resources tell GKE to configure the NICs for GPU traffic in passthrough mode. GKE doesn't apply built-in networking programming using eBPF to this traffic.
Create a node pool for the H100 GPUs:
gcloud beta container node-pools create NODE_POOL_NAME \ --zone=ZONE \ --cluster=CLUSTER_NAME \ --project=PROJECT_ID \ --accelerator=type=nvidia-h100-mega-80gb,count=8,gpu-driver-version=LATEST \ --machine-type=a3-megagpu-8g \ --num-nodes=2 \ --additional-node-network network=PREFIX-net-1,subnetwork=PREFIX-sub-1 \ --additional-node-network network=PREFIX-net-2,subnetwork=PREFIX-sub-2 \ --additional-node-network network=PREFIX-net-3,subnetwork=PREFIX-sub-3 \ --additional-node-network network=PREFIX-net-4,subnetwork=PREFIX-sub-4 \ --additional-node-network network=PREFIX-net-5,subnetwork=PREFIX-sub-5 \ --additional-node-network network=PREFIX-net-6,subnetwork=PREFIX-sub-6 \ --additional-node-network network=PREFIX-net-7,subnetwork=PREFIX-sub-7 \ --additional-node-network network=PREFIX-net-8,subnetwork=PREFIX-sub-8 \ --enable-gvnic \ --no-enable-autoupgrade \ --scopes "https://www.googleapis.com/auth/cloud-platform" \ [--placement-policy=POLICY_NAME \ --reservation-affinity=specific \ --reservation=RESERVATION_NAME \ --host-maintenance-interval=PERIODIC]
Replace
NODE_POOL_NAME
with your node pool name.In the example, the
--scopes
"https://www.googleapis.com/auth/cloud-platform" argument sets the node instance's scope to becloud-platform
for testing convenience. For production, you may want to limit the scope to configure finer-grained credentials.Use the
--placement-policy
,--reservation-affinity
, and--reservation
flags if you are using a reservation. Specify these flags to configure the policy name and reservation in the node pool.If this command fails, you might not have enough H100 GPU quota in your project. Ensure that you have sufficient quota and retry the command.
Get a list of nodes in the cluster:
kubectl get nodes
Verify that each GPU node has eight GPUs:
kubectl describe node NODE_NAME
The output is similar to the following:
Capacity: ... nvidia.com/gpu: 8 Allocatable: ... nvidia.com/gpu: 8
GPUDirect-TCPX
Create a Standard cluster:
gcloud container clusters create CLUSTER_NAME \ --location=LOCATION \ --cluster-version=VERSION \ --enable-dataplane-v2 --enable-ip-alias \ --enable-multi-networking \ --no-enable-autoupgrade \
Replace the following:
CLUSTER_NAME
: the name of your new cluster.LOCATION
: the Compute Engine region. for the clusterVERSION
: the GKE version for the cluster. Use a supported version as described in the Requirements section.
Create Network and GKENetworkParamSet resources in the cluster that correspond to the VPC networks and subnetworks that you created:
kubectl apply -f - <<EOF apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc1 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc1 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc2 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc2 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc3 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc3 type: Device --- apiVersion: networking.gke.io/v1 kind: Network metadata: name: vpc4 spec: parametersRef: group: networking.gke.io kind: GKENetworkParamSet name: vpc4 type: Device --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc1 spec: vpc: PROJECT_ID-net-1 vpcSubnet: PROJECT_ID-sub-1 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc2 spec: vpc: PROJECT_ID-net-2 vpcSubnet: PROJECT_ID-sub-2 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc3 spec: vpc: PROJECT_ID-net-3 vpcSubnet: PROJECT_ID-sub-3 deviceMode: NetDevice --- apiVersion: networking.gke.io/v1 kind: GKENetworkParamSet metadata: name: vpc4 spec: vpc: PROJECT_ID-net-4 vpcSubnet: PROJECT_ID-sub-4 deviceMode: NetDevice EOF
These resources tell GKE to configure the NICs for GPU traffic in passthrough mode. GKE doesn't apply built-in networking programming using eBPF to this traffic.
Create a node pool for the H100 GPUs:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --location=LOCATION \ --machine-type=a3-highgpu-8g \ --accelerator=type=nvidia-h100-80gb,count=8,gpu-driver-version=LATEST \ --additional-node-network=network=PROJECT_ID-net-1,subnetwork=PROJECT_ID-sub-1 \ --additional-node-network=network=PROJECT_ID-net-2,subnetwork=PROJECT_ID-sub-2 \ --additional-node-network=network=PROJECT_ID-net-3,subnetwork=PROJECT_ID-sub-3 \ --additional-node-network=network=PROJECT_ID-net-4,subnetwork=PROJECT_ID-sub-4 \ --enable-gvnic \ --no-enable-autoupgrade
Replace
NODE_POOL_NAME
with the name of the node pool.If this command fails, you might not have enough H100 GPU quota in your project. Ensure that you have quota and retry the command.
Get a list of nodes in the cluster:
kubectl get nodes
Verify that each GPU node has eight GPUs:
kubectl describe node NODE_NAME
The output is similar to the following:
Capacity: ... nvidia.com/gpu: 8 Allocatable: ... nvidia.com/gpu: 8
Install the GPUDirect binary and configure NCCL
This section shows you how to install the GPUDirect binary, based on your A3 machine type (GPUDirect-TCPX for A3 High, GPUDirect-TCPXO for A3 Mega) and a specific NCCL library version using a DaemonSet.
GPUDirect-TCPXO
Review the
nccl-tcpxo-installer.yaml
Daemonset manifest in GitHub. This DaemonSet does the following:- Pre-installation to setup GPUDirect-TCPXO related configurations.
- Installs the NCCL library and GPUDirect-TCPXO binary on the node.
- Stores the library and the binary in the
/home/kubernetes/bin/nvidia/lib64
directory on the VM. By default, GKE mounts this directory into the/usr/local/nvidia/lib64
path in GPU containers that need to use NCCL and GPUDirect-TCPXO.
Deploy the DaemonSet:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/nccl-tcpxo-installer.yaml
The NCCL plugin takes approximately two minutes to start running.
Verify the status of the DaemonSet Pods:
kubectl get pods -n=kube-system -l=name=nccl-tcpxo-installer
The output is similar to the following:
# Output nccl-tcpxo-installer-6c2pv 1/1 Running 0 2m11s nccl-tcpxo-installer-qgg82 1/1 Running 0 2m11s
GPUDirect-TCPX
Review the
nccl-tcpx-installer.yaml
Daemonset manifest in GitHub. This DaemonSet does the following:- Installs the NCCL library and GPUDirect-TCPX binary on the node.
- Stores the library and the binary in the
/home/kubernetes/bin/nvidia/lib64
directory on the VM. By default, GKE mounts this directory into the/usr/local/nvidia/lib64
path in GPU containers that need to use NCCL and GPUDirect-TCPX.
Deploy the DaemonSet:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-tcpx-installer.yaml
The NCCL plugin takes approximately two minutes to start running.
Verify the status of the DaemonSet Pods:
kubectl get pods -n=kube-system -l=name=nccl-tcpx-installer
The output is similar to the following:
nccl-tcpx-installer-6c2pv 1/1 Running 0 2m11s nccl-tcpx-installer-qgg82 1/1 Running 0 2m11s
Deploy NRI device injector plugin
This section shows you how to install the NRI device injector by using a DaemonSet. Both H100 machine types(GPUDirect-TCPX for A3 High, GPUDirect-TCPXO for A3 Mega) install the same NRI device injector plugin.
Review the
nri-device-injector.yaml
Deployment manifest in GitHub. This DaemonSet does the following:- Enables Node Resource Interface (NRI) on the node that has H100 GPUs.
- Deploys a NRI device injector plugin container that injects GPU devices into containers specified by pod annotations.
Deploy the DaemonSet:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nri_device_injector/nri-device-injector.yaml
The NCCL plugin takes approximately two minutes to start running.
Verify the status of the DaemonSet Pods:
kubectl get pods -n=kube-system -l=name=device-injector
The output is similar to the following:
# Output device-injector-md6hb 1/1 Running 0 4h54m device-injector-vh9bm 1/1 Running 0 4h54m
Deploy a test workload
In this section, you deploy a sample workload to verify that NCCL and GPUDirect-TCPX or GPUDirect-TCPXO work as expected.
GPUDirect-TCPXO
This workload includes a sidecar container named the tcpxo-daemon, which runs a service that lets the Pod use GPUDirect-TCPXO. You must add this sidecar container to any Pods in your own environment that need to use GPUDirect-TCPXO. For a snippet of the required fields to add to your manifests, see Add GPUDirect to your manifest.
Review the
nccl-test-latest.yaml
manifest in GitHub. This manifest does the following:- Deploys two Pods, each of which runs in a node that has H100 GPUs.
- Deploys a sidecar container named
tcpxo-daemon
in each Pod to let those Pods use GPUDirect-TCPXO.
Deploy two Pods with the test workload:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/nccl-test-latest.yaml
Run the following commands to trigger an NCCL all-gather test for the two nodes:
kubectl exec --stdin --tty --container=nccl-test nccl-test-host-1 -- /scripts/allgather.sh nccl-host-1 nccl-host-2
The output is similar to the following:
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 16384 float none -1 4654.5 0.23 0.21 0 3890.9 0.27 0.25 0 2097152 32768 float none -1 4117.2 0.51 0.48 0 5153.5 0.41 0.38 0 4194304 65536 float none -1 6417.4 0.65 0.61 0 7295.5 0.57 0.54 0 8388608 131072 float none -1 7872.1 1.07 1.00 0 6451.4 1.30 1.22 0 16777216 262144 float none -1 6990.7 2.40 2.25 0 5609.3 2.99 2.80 0 33554432 524288 float none -1 8254.0 4.07 3.81 0 7415.1 4.53 4.24 0 67108864 1048576 float none -1 5546.3 12.10 11.34 0 6484.0 10.35 9.70 0 134217728 2097152 float none -1 6507.3 20.63 19.34 0 6015.4 22.31 20.92 0 268435456 4194304 float none -1 6744.1 39.80 37.32 0 7023.1 38.22 35.83 0 536870912 8388608 float none -1 8939.8 60.05 56.30 0 11706 45.86 43.00 0 1073741824 16777216 float none -1 8241.7 130.28 122.14 0 8375.2 128.20 120.19 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 22.449
GPUDirect-TCPX
This workload includes a sidecar container named the tcpx-daemon, which runs a service that lets the Pod use GPUDirect-TCPX. You must add this sidecar container to any Pods in your own environment that need to use GPUDirect-TCPX. For a snippet of the required fields to add to your manifests, see Add GPUDirect to your manifest.
- Review the
nccl-config.yaml
ConfigMap manifest in GitHub. This manifest deploys scripts that initialize an NCCL all-gather test and sets NCCL-specific configuration settings. Review the
nccl-test-latest.yaml
Deployment manifest in GitHub. This manifest does the following:- Deploys two Pods, each of which runs in a node that has H100 GPUs.
- Deploys a sidecar container named
tcpx-daemon
in each Pod to let those Pods use GPUDirect-TCPX.
Deploy the ConfigMap and the test workload:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-config.yaml kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-test-latest.yaml
Run the following commands to trigger an NCCL all-gather test for the nodes:
kubectl exec \ --stdin --tty --container=nccl-test nccl-test-host-1 \ -- /configs/allgather.sh nccl-host-1 nccl-host-2
The output is similar to the following:
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 16384 float none -1 696.8 1.50 1.41 0 729.0 1.44 1.35 0 2097152 32768 float none -1 776.4 2.70 2.53 0 726.7 2.89 2.71 0 4194304 65536 float none -1 774.3 5.42 5.08 0 805.1 5.21 4.88 0 8388608 131072 float none -1 812.1 10.33 9.68 0 817.6 10.26 9.62 0 16777216 262144 float none -1 1035.2 16.21 15.19 0 1067.8 15.71 14.73 0 33554432 524288 float none -1 1183.3 28.36 26.59 0 1211.8 27.69 25.96 0 67108864 1048576 float none -1 1593.4 42.12 39.49 0 1510.5 44.43 41.65 0 134217728 2097152 float none -1 2127.8 63.08 59.13 0 2312.7 58.03 54.41 0 268435456 4194304 float none -1 3603.0 74.50 69.85 0 3586.2 74.85 70.17 0 536870912 8388608 float none -1 7101.7 75.60 70.87 0 7060.9 76.03 71.28 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 29.8293
Use required NCCL configuration settings to improve performance
The following key-value pairs are the required NCCL configuration settings for GPUDirect-TCPX and GPUDirect-TCPXO. When deploying your workloads that use NCCL, set them as environment variables to optimize performance.
GPUDirect-TCPXO
## required
"LD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64\"",
"NCCL_FASTRAK_CTRL_DEV=eth0",
"NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8",
"NCCL_SOCKET_IFNAME=eth0",
"NCCL_CROSS_NIC=0",
"NCCL_ALGO=Ring,Tree",
"NCCL_PROTO=Simple",
"NCCL_MIN_NCHANNELS=4",
"NCCL_TUNER_PLUGIN=libnccl-tuner.so",
"NCCL_TUNER_CONFIG_PATH=/usr/local/nvidia/lib64/a3plus_tuner_config.textproto",
"NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/usr/local/nvidia/lib64/a3plus_guest_config.textproto",
"NCCL_DYNAMIC_CHUNK_SIZE=524288",
"NCCL_P2P_NET_CHUNKSIZE=524288",
"NCCL_P2P_PCI_CHUNKSIZE=524288",
"NCCL_P2P_NVL_CHUNKSIZE=1048576",
"NCCL_FASTRAK_NUM_FLOWS=2",
"NCCL_FASTRAK_USE_SNAP=1",
"NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS=600000",
"NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL=0",
"NCCL_BUFFSIZE=8388608",
"CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7",
"NCCL_NET_GDR_LEVEL=PIX",
"NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING=0",
"NCCL_FASTRAK_USE_LLCM=1",
"NCCL_NVLS_ENABLE=0"
## recommended, to log NCCL errors
"NCCL_DEBUG=WARN",
"NCCL_DEBUG_SUBSYS=INIT,NET,ENV,COLL,GRAPH"
Optionally, you can set all the configurations at once by following these steps:
Add the following key-value pair as an environment variable in your workload container manifest:
NCCL_LIB_DIR="/usr/local/nvidia/lib64"
Ensure the
nccl-env-profile.sh
script is executed when your workload container starts. For example, you can do this in your Pod specification by overriding the container's command to include the following:source ${NCCL_LIB_DIR}/nccl-env-profile.sh
GPUDirect-TCPX
## required
"LD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}:/usr/local/tcpx/lib64\"",
"NCCL_SOCKET_IFNAME=\"eth0\"",
"NCCL_ALGO=Ring",
"NCCL_PROTO=Simple",
"NCCL_CROSS_NIC=0",
"NCCL_NET_GDR_LEVEL=PIX",
"NCCL_P2P_PXN_LEVEL=0",
"NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4",
"NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0",
"NCCL_DYNAMIC_CHUNK_SIZE=524288",
"NCCL_P2P_NET_CHUNKSIZE=524288",
"NCCL_P2P_PCI_CHUNKSIZE=524288",
"NCCL_P2P_NVL_CHUNKSIZE=1048576",
"NCCL_BUFFSIZE=4194304",
"NCCL_NSOCKS_PERTHREAD=4",
"NCCL_SOCKET_NTHREADS=1",
"NCCL_GPUDIRECTTCPX_TX_BINDINGS=\"eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177\"",
"NCCL_GPUDIRECTTCPX_RX_BINDINGS=\"eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191\"",
"NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=500000"
"NCCL_NVLS_ENABLE=0"
Add GPUDirect to your manifests
This section shows the required fields that you must add to your Kubernetes manifests for your Pods to use GPUDirect.
GPUDirect-TCPXO
Add the following annotations to the Pod metadata. Without these annotations,
hostNetwork:true
will be required for the Pod, andprivileged:true
will be required for thetcpxo-daemon
container.metadata: annotations: devices.gke.io/container.tcpxo-daemon: |+ - path: /dev/nvidia0 - path: /dev/nvidia1 - path: /dev/nvidia2 - path: /dev/nvidia3 - path: /dev/nvidia4 - path: /dev/nvidia5 - path: /dev/nvidia6 - path: /dev/nvidia7 - path: /dev/nvidiactl - path: /dev/nvidia-uvm - path: /dev/dmabuf_import_helper networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc1"}, {"interfaceName":"eth2","network":"vpc2"}, {"interfaceName":"eth3","network":"vpc3"}, {"interfaceName":"eth4","network":"vpc4"}, {"interfaceName":"eth5","network":"vpc5"}, {"interfaceName":"eth6","network":"vpc6"}, {"interfaceName":"eth7","network":"vpc7"}, {"interfaceName":"eth8","network":"vpc8"} ]
Add the following fields to the Pod specification:
spec: volumes: - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: sys hostPath: path: /sys - name: proc-sys hostPath: path: /proc/sys - name: aperture-devices hostPath: path: /dev/aperture_devices
Add the following container to the manifest to run the
tcpxo-daemon
service. Replace (TCPXO_DAEMON_IMAGE
) with the latest imageus-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.11
:- name: tcpxo-daemon image: TCPXO_DAEMON_IMAGE imagePullPolicy: Always command: ["/bin/sh", "-c"] args: - | set -ex chmod 755 /fts/entrypoint_rxdm_container.sh /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr securityContext: capabilities: add: - NET_ADMIN - NET_BIND_SERVICE volumeMounts: - name: libraries mountPath: /usr/local/nvidia - name: sys mountPath: /hostsysfs - name: proc-sys mountPath: /hostprocsysfs env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64
Add the following environment variable to every GPU container:
env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 - name: NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY value: /dev/aperture_devices
Add the following volumeMounts to every GPU container. Without
aperture_devices
setups,privileged:true
is required for GPU containers:volumeMounts: - name: aperture-devices mountPath: /dev/aperture_devices
Add environment variables to configure NCCL options. For details, see Use recommended NCCL configuration settings to improve performance.
A completed Pod specification looks like the following:
apiVersion: v1
kind: Pod
metadata:
name: a3plus-workloads
annotations:
devices.gke.io/container.tcpxo-daemon: |+
- path: /dev/nvidia0
- path: /dev/nvidia1
- path: /dev/nvidia2
- path: /dev/nvidia3
- path: /dev/nvidia4
- path: /dev/nvidia5
- path: /dev/nvidia6
- path: /dev/nvidia7
- path: /dev/nvidiactl
- path: /dev/nvidia-uvm
- path: /dev/dmabuf_import_helper
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth1","network":"vpc1"},
{"interfaceName":"eth2","network":"vpc2"},
{"interfaceName":"eth3","network":"vpc3"},
{"interfaceName":"eth4","network":"vpc4"},
{"interfaceName":"eth5","network":"vpc5"},
{"interfaceName":"eth6","network":"vpc6"},
{"interfaceName":"eth7","network":"vpc7"},
{"interfaceName":"eth8","network":"vpc8"}
]
...
containers:
- name: tcpxo-daemon
image: TCPXO_DAEMON_IMAGE
imagePullPolicy: Always
command: ["/bin/sh", "-c"]
args:
- |
set -ex
chmod 755 /fts/entrypoint_rxdm_container.sh
/fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr
securityContext:
capabilities:
add:
- NET_ADMIN
- NET_BIND_SERVICE
volumeMounts:
- name: libraries
mountPath: /usr/local/nvidia
- name: sys
mountPath: /hostsysfs
- name: proc-sys
mountPath: /hostprocsysfs
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: main-application-container
...
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY
value: /dev/aperture_devices
securityContext:
volumeMounts:
- name: aperture-devices
mountPath: /dev/aperture_devices
resources:
limits:
nvidia.com/gpu: 8
volumes:
- name: libraries
hostPath:
path: /home/kubernetes/bin/nvidia
- name: sys
hostPath:
path: /sys
- name: proc-sys
hostPath:
path: /proc/sys
- name: aperture-devices
hostPath:
path: /dev/aperture_devices
GPUDirect-TCPX
Add the following annotations to the Pod metadata. Without these annotations,
hostNetwork:true
will be required for the Pod, andprivileged:true
will be required for thetcpx-daemon
container.metadata: annotations: devices.gke.io/container.tcpx-daemon: |+ - path: /dev/nvidia0 - path: /dev/nvidia1 - path: /dev/nvidia2 - path: /dev/nvidia3 - path: /dev/nvidia4 - path: /dev/nvidia5 - path: /dev/nvidia6 - path: /dev/nvidia7 - path: /dev/nvidiactl - path: /dev/nvidia-uvm networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc1"}, {"interfaceName":"eth2","network":"vpc2"}, {"interfaceName":"eth3","network":"vpc3"}, {"interfaceName":"eth4","network":"vpc4"}, ]
Add the following fields to the Pod specification:
spec: volumes: - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: sys hostPath: path: /sys - name: proc-sys hostPath: path: /proc/sys
Add the following container to the manifest to run the tcpx-daemon service:
- name: tcpx-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.9 command: - /tcpgpudmarxd/build/app/tcpgpudmarxd - --gpu_nic_preset - a3vm - --gpu_shmem_type - fd - --uds_path - /run/tcpx - --setup_param - \"--verbose 128 2 0 \" securityContext: capabilities: add: - NET_ADMIN volumeMounts: - name: libraries mountPath: /usr/local/nvidia/lib64 - name: tcpx-socket mountPath: /run/tcpx - name: sys mountPath: /hostsysfs - name: proc-sys mountPath: /hostprocsysfs env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64
Add the following volume mounts to any containers that request GPUs:
volumeMounts: - name: tcpx-socket mountPath: /tmp - name: libraries mountPath: /usr/local/nvidia/lib64
Add the following environment variable to every GPU container:
env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64
Add environment variables to configure NCCL options. For details, see the Use recommended NCCL configuration settings to improve performance section in this page.
A completed Pod specification looks like the following:
apiVersion: v1
kind: Pod
metadata:
name: a3-gpu-workloads-example
labels:
name: a3-gpu-workloads-example
annotations:
devices.gke.io/container.tcpx-daemon: |+
- path: /dev/nvidia0
- path: /dev/nvidia1
- path: /dev/nvidia2
- path: /dev/nvidia3
- path: /dev/nvidia4
- path: /dev/nvidia5
- path: /dev/nvidia6
- path: /dev/nvidia7
- path: /dev/nvidiactl
- path: /dev/nvidia-uvm
networking.gke.io/default-interface: 'eth0'
networking.gke.io/interfaces: |
[
{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth1","network":"vpc1"},
{"interfaceName":"eth2","network":"vpc2"},
{"interfaceName":"eth3","network":"vpc3"},
{"interfaceName":"eth4","network":"vpc4"}
]
spec:
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
- --gpu_nic_preset
- a3vm
- --gpu_shmem_type
- fd
- --uds_path
- /run/tcpx
- --setup_param
- \"--verbose 128 2 0 \"
securityContext:
capabilities:
add:
- NET_ADMIN
volumeMounts:
- name: libraries
mountPath: /usr/local/nvidia/lib64
readOnly: true
- name: tcpx-socket
mountPath: /run/tcpx
- name: sys
mountPath: /hostsysfs
- name: proc-sys
mountPath: /hostprocsysfs
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: a3-gpu-workloads-example
...
volumeMounts:
- name: tcpx-socket
mountPath: /tmp
- name: libraries
mountPath: /usr/local/nvidia/lib64
readOnly: true
resources:
limits:
nvidia.com/gpu: 8
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
...
volumes:
- name: libraries
hostPath:
path: /home/kubernetes/bin/nvidia/lib64
- name: tcpx-socket
emptyDir:
- name: sys
hostPath:
path: /sys
- name: proc-sys
hostPath:
path: /proc/sys
What's next
- Learn about best practices for GKE networking.
- Learn more about the Nvidia GPUDirect family of technologies for data movement and access on Nvidia GPUs.
- Learn about current GPU version availability and requesting GPUs in GKE.