Create an AI-optimized Slurm cluster

This document describes how to create and deploy Slurm clusters that use A4 or A3 Ultra accelerator-optimized machine types.

Before you begin

Before creating a Slurm cluster, if you haven't already done so, complete the following steps:

Choose a consumption option: the option that you pick determines how you want to get and use GPU resources.

To learn more, see Choose a consumption option.

Obtain capacity: to learn how to obtain capacity for your consumption option.

To learn more, see Capacity overview.

Ensure that you have enough Filestore quota: you need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity.
- To check quota, see View API-specific quota.
- If you don't have enough quota, request a quota increase.
Install Cluster Toolkit: to provision Slurm clusters, you must use Cluster Toolkit version v1.51.1 or later.
To install Cluster Toolkit, see Set up Cluster Toolkit.

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Set up a storage bucket

Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.

To create this bucket and enable versioning from the CLI, run the following commands:

gcloud storage buckets create gs://BUCKET_NAME \
    --project=PROJECT_ID \
    --default-storage-class=STANDARD --location=BUCKET_REGION \
    --uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning

Replace the following:

BUCKET_NAME: a name for your Cloud Storage bucket that meets the bucket naming requirements.
PROJECT_ID: your project ID.
BUCKET_REGION: any Google Cloud region of your choice.

Open the Cluster Toolkit directory

To use Slurm with Google Cloud, you must install Cluster Toolkit. After you install the toolkit, ensure that you are in the Cluster Toolkit directory by running the following command:

cd cluster-toolkit

This cluster deployment requires Cluster Toolkit v1.51.1 or later. To check your version, you can run the following command:

./gcluster --version

Create a deployment file

Create a deployment file that you can use to specify the Cloud Storage bucket, set names for your network and subnetwork, and set deployment variables such as project ID, region, and zone.

To create a deployment file, follow the steps for your required machine type.

A4 machines

The parameters that you need to add to your deployment file depend on the consumption option that you're using for your deployment. Select the tab that corresponds to your consumption option's provisioning model.

Reservation-bound

To create your deployment file, use a text editor to create a YAML file named a4high-slurm-deployment.yaml and add the following content.


terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a4h_cluster_size: NUMBER_OF_VMS
  a4h_reservation_name: RESERVATION_NAME

Replace the following:

BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
PROJECT_ID: your project ID.
REGION: the region that has the reserved machines.
ZONE: the zone where you want to provision the cluster. If you're using a reservation-based consumption option, the region and zone information was provided by your Technical Account Manager (TAM) when the capacity was delivered.
NUMBER_OF_VMS: the number of VMs that you want for the cluster.
RESERVATION_NAME: the name of your reservation.

Flex-start

To create your deployment file, use a text editor to create a YAML file named a4high-slurm-deployment.yaml and add the following content.


terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a4h_cluster_size: NUMBER_OF_VMS
  a4h_dws_flex_enabled: true

Replace the following:

BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
PROJECT_ID: your project ID.
REGION: the region where you want to provision your cluster.
ZONE: the zone where you want to provision your cluster.
NUMBER_OF_VMS: the number of VMs that you want for the cluster.

This deployment provisions static compute nodes, which means that the cluster has a set number of nodes at all times. If you want to enable your cluster to autoscale instead, use examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml file and edit the values of node_count_static and node_count_dynamic_max to match the following:

      node_count_static: 0
      node_count_dynamic_max: $(vars.a4h_cluster_size)

Spot

To create your deployment file, use a text editor to create a YAML file named a4high-slurm-deployment.yaml and add the following content.


terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a4h_cluster_size: NUMBER_OF_VMS
  a4h_enable_spot_vm: true

Replace the following:

BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
PROJECT_ID: your project ID.
REGION: the region where you want to provision your cluster.
ZONE: the zone where you want to provision your cluster.
NUMBER_OF_VMS: the number of VMs that you want for the cluster.

A3 Ultra machines

Reservation-bound

To create your deployment file, use a text editor to create a YAML file named a3ultra-slurm-deployment.yaml and add the following content.


terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a3u_cluster_size: NUMBER_OF_VMS
  a3u_reservation_name: RESERVATION_NAME

Replace the following:

BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
PROJECT_ID: your project ID.
REGION: the region that has the reserved machines.
ZONE: the zone where you want to provision the cluster. If you're using a reservation-based consumption option, the region and zone information was provided by your Technical Account Manager (TAM) when the capacity was delivered.
NUMBER_OF_VMS: the number of VMs that you want for the cluster.
RESERVATION_NAME: the name of your reservation.

Flex-start

To create your deployment file, use a text editor to create a YAML file named a3ultra-slurm-deployment.yaml and add the following content.


terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a3u_cluster_size: NUMBER_OF_VMS
  a3u_dws_flex_enabled: true

Replace the following:

BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
PROJECT_ID: your project ID.
REGION: the region where you want to provision your cluster.
ZONE: the zone where you want to provision your cluster.
NUMBER_OF_VMS: the number of VMs that you want for the cluster.

This deployment provisions static compute nodes, which means that the cluster has a set number of nodes at all times. If you want to enable your cluster to autoscale instead, use examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml file and edit the values of node_count_static and node_count_dynamic_max to match the following:

      node_count_static: 0
      node_count_dynamic_max: $(vars.a3u_cluster_size)

Spot

To create your deployment file, use a text editor to create a YAML file named a3ultra-slurm-deployment.yaml and add the following content.


terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a3u_cluster_size: NUMBER_OF_VMS
  a3u_enable_spot_vm: true

Replace the following:

BUCKET_NAME: the name of your Cloud Storage bucket, which you created in the previous section.
DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters, ensure that you select a unique name for each one.
PROJECT_ID: your project ID.
REGION: the region where you want to provision your cluster.
ZONE: the zone where you want to provision your cluster.
NUMBER_OF_VMS: the number of VMs that you want for the cluster.

Provision a Slurm cluster

Cluster Toolkit provisions the cluster based on the deployment file you created in the previous step and the default cluster blueprint. For more information about the software that is installed by the blueprint, including NVIDIA drivers and CUDA, learn more about Slurm custom images.

To provision the cluster, run the command for your machine type from the Cluster Toolkit directory. This step takes approximately 20-30 minutes.

A4 machines

./gcluster deploy -d a4high-slurm-deployment.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml --auto-approve

A3 Ultra machines

./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml --auto-approve

Connect to the Slurm cluster

To access your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.

Console

Go to the Compute Engine > VM instances page.

Go to the VM instances page
Locate the login node. It should have a name with the pattern DEPLOYMENT_NAME +login-001.
From the Connect column of the login node, click SSH.

gcloud

To connect to the login node, complete the following steps:

Identify the login node by using the gcloud compute instances list command.
```
gcloud compute instances list \
  --zones=ZONE \
  --filter="name ~ login" --format "value(name)"
```
If the output lists multiple Slurm clusters, you can identify your login node by the DEPLOYMENT_NAME that you specified.
Use the gcloud compute ssh command to connect to the login node.
```
gcloud compute ssh LOGIN_NODE \
  --zone=ZONE --tunnel-through-iap
```
Replace the following:
- ZONE: the zone where the VMs for your cluster are located.
- LOGIN_NODE: the name of the login node, which you identified in the previous step.

Test network performance on the Slurm cluster

To test NCCL communication, complete the steps for your machine type.

A4 machines

The following test uses Ramble, which is an open-source, multi-platform experimentation framework written in Python that is used to coordinate the running of NCCL tests.

The run scripts used for this test are staged in the /opt/apps/system_benchmarks on the Slurm controller node and are available to all nodes in the cluster. Running this test installs Ramble to /opt/apps/ramble.

From the login node in the ${HOME} directory, run the following command. Because the test can take approximately 10 minutes, or longer if other jobs are in the queue, the following command uses nohup and redirects the stdout/err to a log file .
```
nohup bash /opt/apps/system_benchmarks/run-nccl-tests-via-ramble.sh >& nccl.log &
```
This command creates a folder called nccl-tests_$(date +%s) that stores all of the test results. The date tag ensures that a unique folder is created based on each current timestamp.

For example, if your cluster has 16 nodes then NCCL tests are ran for all-gather, all-reduce, and reduce-scatter on 2, 4, 8, and 16 nodes.
Review the results. The nccl.log contains the logs from setting up and running the test. To view, you can run:
```
tail -f nccl.log
```
You can also use Ctrl+C to stop tailing the output at any time. At the end of the nccl.log, your output should resemble the following:
```
...
---- SUMMARY for >1GB Message Sizes ----
workload        n_nodes msg_size        busbw
all-gather      2       1073741824      XXX.XX
all-gather      2       2147483648      XXX.XX
all-gather      2       4294967296      XXX.XX
all-gather      2       8589934592      XXX.XX
...
all-reduce      2       1073741824      XXX.XX
...
reduce-scatter  2       1073741824      XXX.XX
...
-------- Benchmarking Complete -------
```
All of the Slurm job scripts and nccl-tests output logs are stored in the nccl-tests_$(date +%s)/experiments. A summary of the NCCL test performance is also stored in nccl-tests_${date +%s)/summary.tsv.

Removing nccl-tests_$(date +%s)/ removes all of the files generated during these tests.

A3 Ultra machines

Download the script needed to build the NCCL test.

From the shared directory of the login node, complete the following steps. The shared directory is usually located at ${HOME}.
```
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh
```
After the script downloads, import a Pytorch image from the NVIDIA container registry and build the NCCL tests. To do this, run the following command:
```
sbatch build-nccl-tests.sh
```
The preceding script runs on one of your nodes. It uses the --container-mounts switch to mount your current directory, $PWD, into the /nccl directory within the container.
Verify that the NCCL test is built. To verify this, run the following command:
```
sacct -a
```
If successfully completed, the output is similar to the following:
```
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1            build-ncc+    a3ultra                   112  COMPLETED      0:0
```
If the build is successful you should also have a file named nvidia+pytorch+24.09-py3.sqsh in the directory where you ran the command along with a directory named nccl-tests.
Check that the nccl-tests/build folder contains several binaries, including all_gather_perf.
Download the NCCL test script.
```
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh
```
To run any job run on an A3 Ultra cluster, several environment variables must be set in order to enable high performance networking with GPUDirect-RDMA. Because we use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment. These variables can be inspected in the run-nccl-tests.sh script that you just downloaded.
Run the NCCL test script. The test can take approximately 15 minutes, or longer.
```
sbatch run-nccl-tests.sh
```

Review the results. The script outputs a slurm-XX.out file that contains the result of the nccl all_gather_perf benchmark.

The output is similar to the following:

#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    268435456       4194304     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
    536870912       8388608     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
  1073741824      16777216     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
  2147483648      33554432     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
  4294967296      67108864     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
  8589934592     134217728     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : XXX.XX
#

Redeploy the Slurm cluster

If you need to increase the number of compute nodes or add new partitions to your cluster, you might need to update configurations for your Slurm cluster by redeploying. Redeployment can be sped up by using an existing image from a previous deployment. To avoid creating new images during a redeploy, specify the --only flag.

To redeploy the cluster using an existing image do the following:

Connect to the cluster

Run the command for your required machine type:

A4 machines

./gcluster deploy -d a4high-slurm-deployment.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml --only cluster-env,cluster --auto-approve -w

A3 Ultra machines

./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml --only cluster-env,cluster --auto-approve -w

This command is only for redeployments where an image already exists, it only redeploys the cluster and its infrastructure.

Destroy the Slurm cluster

By default the A3 Ultra and A4 High blueprints enable deletion protection on the Filestore instance. For the Filestore instance to be deleted when destroying the Slurm cluster, learn how to set or remove deletion protection on an existing instance to disable deletion protection before running the destroy command.

Disconnect from the cluster if you haven't already.
Before running the destroy command, navigate to the root of the Cluster Toolkit directory. By default, DEPLOYMENT_FOLDER is located at the root of the Cluster Toolkit directory.
To destroy the cluster, run:

./gcluster destroy DEPLOYMENT_FOLDER --auto-approve

Replace the following:

DEPLOYMENT_FOLDER: the name of the deployment folder. It's typically the same as DEPLOYMENT_NAME.

When destruction is complete you should see a message similar to the following:

  Destroy complete! Resources: xx destroyed.

To learn how to cleanly destroy infrastructure and for advanced manual deployment instructions, see the deployment folder located at the root of the Cluster Toolkit directory: DEPLOYMENT_FOLDER/instructions.txt

What's next

Verify reservation consumption
View VMs topology
Learn how to manage host events:
- Manage host events across VMs
- Manage host events across reservations
Monitor VMs in your Slurm cluster
Report faulty host

Create an AI-optimized Slurm cluster Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Set up a storage bucket

Open the Cluster Toolkit directory

Create a deployment file

A4 machines

Reservation-bound

Flex-start

Spot

A3 Ultra machines

Reservation-bound

Flex-start

Spot

Provision a Slurm cluster

A4 machines

A3 Ultra machines

Connect to the Slurm cluster

Console

gcloud

Test network performance on the Slurm cluster

A4 machines

A3 Ultra machines

Redeploy the Slurm cluster

A4 machines

A3 Ultra machines

Destroy the Slurm cluster

What's next

Create an AI-optimized Slurm cluster