This document outlines the deployment steps for provisioning A4 or A3 Ultra VMs that run on Hypercompute Cluster and use Slurm as an orchestrator. For more information about Hypercompute Cluster, see Hypercompute Cluster.
Before you begin
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Ensure that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity.
- To check quota, see View API-specific quota.
- If you don't have enough quota, request a quota increase.
Ensure that you have reserved A4 or A3 Ultra machines. To reserve resources, see Request capacity.
To provision Slurm clusters, you must use Cluster Toolkit the following versions:
- For A3 Ultra:
v1.44.1
or later - For A4:
v1.47.0
or later
To install Cluster Toolkit, see Set up Cluster Toolkit.
- For A3 Ultra:
Overview
To deploy the cluster and run a GPUDirect-RDMA performance test, complete the following steps:
- Set up a Cloud Storage bucket. See Set up Cloud Storage bucket.
- Switch to the Cluster Toolkit directory. See Switch to the Cluster Toolkit directory.
- Create a deployment file. See Create a deployment file.
- Provision the cluster. See Provision the cluster.
- Connect to the Slurm cluster. See Connect to the Slurm cluster.
- Test GPUDirect-RDMA performance on the cluster. See Test GPUDirect-RDMA performance on the cluster.
Set up a Cloud Storage bucket
To set up the storage bucket, follow the steps for your required machine type.
Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.
To create this bucket from the CLI, run the following command:
gcloud storage buckets create gs://BUCKET_NAME \ --project=PROJECT_ID \ --default-storage-class=STANDARD --location=REGION \ --uniform-bucket-level-access gcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following:
BUCKET_NAME
: a name for your Cloud Storage bucket that meets the bucket naming requirements.PROJECT_ID
: your project ID.REGION
: any Google Cloud region of your choice.
Switch to the Cluster Toolkit directory
After you have installed the Cluster Toolkit, ensure that you are in the Cluster Toolkit directory by running:
cd cluster-toolkit
This cluster deployment requires Cluster Toolkit v1.44.1
or
later. To check your version, you can run the ./gcluster --version
command.
Create a deployment file
Create a deployment file that you can use to specify the Cloud Storage bucket, set names for your network and subnetwork, and set deployment variables such as projectID, region, and zone.
To create a deployment file, follow the steps for your required machine type.
A4 machines
To create your deployment file, use a text editor to create a YAML file
named a4high-slurm-deployment.yaml
and add the following content.
--- terraform_backend_defaults: type: gcs configuration: bucket: BUCKET_NAME vars: deployment_name: DEPLOYMENT_NAME project_id: PROJECT_ID region: REGION zone: ZONE a4h_reservation_name: RESERVATION_NAME a4h_cluster_size: NUMBER_OF_VMS
Replace the following:
DEPLOYMENT_NAME
: a name for your deployment. If creating multiple clusters. Ensure to select a unique name for each.BUCKET_NAME
: the name of your Cloud Storage bucket, created in the previous section.PROJECT_ID
: your project ID.REGION
: the region that has the reserved machines.ZONE
: the zone that has the reserved machines. The region and zone information is provided by your Technical Account Manager (TAM) when the capacity was delivered.RESERVATION_NAME
: the name of your reservation.NUMBER_OF_VMS
: the number of VMs needed for the cluster.
A3 Ultra machines
To create your deployment file, use a text editor to create a YAML file
named a3ultra-slurm-deployment.yaml
and add the following content.
--- terraform_backend_defaults: type: gcs configuration: bucket: BUCKET_NAME vars: deployment_name: DEPLOYMENT_NAME project_id: PROJECT_ID region: REGION zone: ZONE a3u_reservation_name: RESERVATION_NAME a3u_cluster_size: NUMBER_OF_VMS
Replace the following:
DEPLOYMENT_NAME
: a name for your deployment. If creating multiple clusters. Ensure to select a unique name for each.BUCKET_NAME
: the name of your Cloud Storage bucket, created in the previous section.PROJECT_ID
: your project ID.REGION
: the region that has the reserved machines.ZONE
: the zone that has the reserved machines. The region and zone information is provided by your Technical Account Manager (TAM) when the capacity was delivered.RESERVATION_NAME
: the name of your reservation.NUMBER_OF_VMS
: the number of VMs needed for the cluster.
Provision the Slurm cluster
To provision the cluster, run the command for your machine type from the Cluster Toolkit directory. This step takes approximately 20-30 minutes.
A4 machines
./gcluster deploy -d a4high-slurm-deployment.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml
A3 Ultra machines
./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml
Connect to the Slurm cluster
To access your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.
Console
Go to the Compute Engine > VM instances page.
Locate the login node. It should have a name similar to
a3ultra-login-001
ora4high-login-001
.From the Connect column of the login node, click SSH.
gcloud
To connect to the login node, complete the following steps:
Identify the login node by using the
gcloud compute instances list
command.gcloud compute instances list \ --zone=
ZONE
\ --filter="name ~ login" --format "value(name)"If you have multiple Slurm clusters, you can identify each login node by using the
DEPLOYMENT_NAME
.Use the
gcloud compute ssh
command to connect to the login node.gcloud compute ssh LOGIN_NODE \ --zone=
ZONE
Replace the following:
ZONE
: the zone where your VMs are created.LOGIN_NODE
: the name of the login node.
Redeploy the Slurm cluster
If you need to increase the number of compute nodes or add new partitions to
your cluster, you might need to update configurations for your Slurm cluster by
redeploying. Redeployment can be sped up by using an existing image from a
previous deployment. To avoid creating new images during a redeploy, specify the
--only
flag.
To redeploy the cluster using an existing image, run the command for your required machine type.
A4 machines
./gcluster deploy -d a4high-slurm-deployment.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml --only cluster-env,cluster
A3 Ultra machines
./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml --only cluster-env,cluster
This command is only for redeployments where an image already exists, it only redeploys the cluster and its infrastructure.
Test network performance on the Slurm cluster
To test NCCL communication, complete the steps for your machine type.
A4 machines
The following test uses Ramble, which is an open-source, multi-platform experimentation framework written in Python that is used to coordinate the running of NCCL tests.
The run scripts used for this test are staged in the
/opt/apps/system_benchmarks
on the Slurm controller node and are
available to all nodes in the cluster. Running this test installs Ramble
to /opt/apps/ramble
.
From the login node in the ${HOME} directory, run the following command. Because the test can take approximately 10 minutes, or longer if other jobs are in the queue, the following command uses
nohup
and redirects thestdout/err
to a log file .nohup bash /opt/apps/system_benchmarks/run-nccl-tests-via-ramble.sh >& nccl.log &
This command creates a folder called
nccl-tests_$(date +%s)
that stores all of the test results. The date tag ensures that a unique folder is created based on each current timestamp.For example, if your cluster has 16 nodes then NCCL tests are ran for
all-gather
,all-reduce
, andreduce-scatter
on 2, 4, 8, and 16 nodes.Review the results. The
nccl.log
contains the logs from setting up and running the test. To view, you can run:tail -f nccl.log
You can also use
Ctrl-c
to stop tailing the output at any time. At the end of thenccl.log
, your output should resemble the following:... ---- SUMMARY for >1GB Message Sizes ---- workload n_nodes msg_size busbw all-gather 2 1073741824 XXX.XX all-gather 2 2147483648 XXX.XX all-gather 2 4294967296 XXX.XX all-gather 2 8589934592 XXX.XX ... all-reduce 2 1073741824 XXX.XX ... reduce-scatter 2 1073741824 XXX.XX ... -------- Benchmarking Complete -------
All of the Slurm job scripts and nccl-tests output logs are stored in the
nccl-tests_$(date +%s)/experiments
. A summary of the NCCL test performance is also stored innccl-tests_${date +%s)/summary.tsv
.Removing
nccl-tests_$(date +%s)/
removes all of the files generated during these tests.
A3 Ultra machines
Download the script needed to build the NCCL test.
From the shared directory of the login node, complete the following steps. The shared directory is usually located at
${HOME}
.wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh
After the script downloads, import a Pytorch image from the NVIDIA container registry and build the NCCL tests. To do this, run the following command:
sbatch build-nccl-tests.sh
The preceding script runs on one of your nodes. It uses the
--container-mounts
switch to mount your current directory,$PWD
, into the/nccl
directory within the container.Verify that the NCCL test is built. To verify this, run the following command:
sacct -a
If successfully completed, the output is similar to the following:
JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1 build-ncc+ a3ultra 112 COMPLETED 0:0
If the build is successful you should also have a file named
nvidia+pytorch+24.09-py3.sqsh
in the directory where you ran the command along with a directory namednccl-tests
.Check that the
nccl-tests/build
folder contains several binaries, includingall_gather_perf
.Download the NCCL test script.
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh
To run any job run on an A3 Ultra cluster, several environment variables must be set in order to enable high performance networking with GPUDirect-RDMA. Because we use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment. These variables can be inspected in the
run-nccl-tests.sh
script that you just downloaded.Run the NCCL test script.
sbatch run-nccl-tests.sh
Review the results. The script outputs a
slurm-XX.out
file that contains the result of the ncclall_gather_perf
benchmark.The output is similar to the following:
# # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 268435456 4194304 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 536870912 8388608 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 1073741824 16777216 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 2147483648 33554432 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 4294967296 67108864 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 8589934592 134217728 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 # Out of bounds values : 0 OK # Avg bus bandwidth : XXX.XX #
What's next
- View VMs topology
- Learn how to manage host events
- Monitor VMs in your Slurm cluster
- Report faulty host