Create a Slurm cluster

This document outlines the deployment steps for provisioning A4 or A3 Ultra VMs that run on Hypercompute Cluster and use Slurm as an orchestrator. For more information about Hypercompute Cluster, see Hypercompute Cluster.

Before you begin

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  2. Ensure that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity.

  3. Ensure that you have reserved A4 or A3 Ultra machines. To reserve resources, see Request capacity.

  4. To provision Slurm clusters, you must use Cluster Toolkit the following versions:

    • For A3 Ultra: v1.44.1 or later
    • For A4: v1.47.0 or later

    To install Cluster Toolkit, see Set up Cluster Toolkit.

Overview

To deploy the cluster and run a GPUDirect-RDMA performance test, complete the following steps:

  1. Set up a Cloud Storage bucket. See Set up Cloud Storage bucket.
  2. Switch to the Cluster Toolkit directory. See Switch to the Cluster Toolkit directory.
  3. Create a deployment file. See Create a deployment file.
  4. Provision the cluster. See Provision the cluster.
  5. Connect to the Slurm cluster. See Connect to the Slurm cluster.
  6. Test GPUDirect-RDMA performance on the cluster. See Test GPUDirect-RDMA performance on the cluster.

Set up a Cloud Storage bucket

To set up the storage bucket, follow the steps for your required machine type.

Cluster blueprints use Terraform modules to provision Cloud infrastructure. A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.

To create this bucket from the CLI, run the following command:

gcloud storage buckets create gs://BUCKET_NAME \
    --project=PROJECT_ID \
    --default-storage-class=STANDARD --location=REGION \
    --uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning

Replace the following:

  • BUCKET_NAME: a name for your Cloud Storage bucket that meets the bucket naming requirements.
  • PROJECT_ID: your project ID.
  • REGION: any Google Cloud region of your choice.

Switch to the Cluster Toolkit directory

After you have installed the Cluster Toolkit, ensure that you are in the Cluster Toolkit directory by running:

cd cluster-toolkit

This cluster deployment requires Cluster Toolkit v1.44.1 or later. To check your version, you can run the ./gcluster --version command.

Create a deployment file

Create a deployment file that you can use to specify the Cloud Storage bucket, set names for your network and subnetwork, and set deployment variables such as projectID, region, and zone.

To create a deployment file, follow the steps for your required machine type.

A4 machines

To create your deployment file, use a text editor to create a YAML file named a4high-slurm-deployment.yaml and add the following content.

---
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a4h_reservation_name: RESERVATION_NAME
  a4h_cluster_size: NUMBER_OF_VMS

Replace the following:

  • DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters. Ensure to select a unique name for each.
  • BUCKET_NAME: the name of your Cloud Storage bucket, created in the previous section.
  • PROJECT_ID: your project ID.
  • REGION: the region that has the reserved machines.
  • ZONE: the zone that has the reserved machines. The region and zone information is provided by your Technical Account Manager (TAM) when the capacity was delivered.
  • RESERVATION_NAME: the name of your reservation.
  • NUMBER_OF_VMS: the number of VMs needed for the cluster.

A3 Ultra machines

To create your deployment file, use a text editor to create a YAML file named a3ultra-slurm-deployment.yaml and add the following content.

---
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: BUCKET_NAME

vars:
  deployment_name: DEPLOYMENT_NAME
  project_id: PROJECT_ID
  region: REGION
  zone: ZONE
  a3u_reservation_name: RESERVATION_NAME
  a3u_cluster_size: NUMBER_OF_VMS

Replace the following:

  • DEPLOYMENT_NAME: a name for your deployment. If creating multiple clusters. Ensure to select a unique name for each.
  • BUCKET_NAME: the name of your Cloud Storage bucket, created in the previous section.
  • PROJECT_ID: your project ID.
  • REGION: the region that has the reserved machines.
  • ZONE: the zone that has the reserved machines. The region and zone information is provided by your Technical Account Manager (TAM) when the capacity was delivered.
  • RESERVATION_NAME: the name of your reservation.
  • NUMBER_OF_VMS: the number of VMs needed for the cluster.

Provision the Slurm cluster

To provision the cluster, run the command for your machine type from the Cluster Toolkit directory. This step takes approximately 20-30 minutes.

A4 machines

./gcluster deploy -d a4high-slurm-deployment.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml

A3 Ultra machines

./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml

Connect to the Slurm cluster

To access your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.

Console

  1. Go to the Compute Engine > VM instances page.

    Go to VM instances

  2. Locate the login node. It should have a name similar to a3ultra-login-001 or a4high-login-001.

  3. From the Connect column of the login node, click SSH.

gcloud

To connect to the login node, complete the following steps:

  1. Identify the login node by using the gcloud compute instances list command.

    gcloud compute instances list \
      --zone=ZONE \
      --filter="name ~ login" --format "value(name)"
    

    If you have multiple Slurm clusters, you can identify each login node by using the DEPLOYMENT_NAME.

  2. Use the gcloud compute ssh command to connect to the login node.

    gcloud compute ssh LOGIN_NODE \
      --zone=ZONE
    

    Replace the following:

    • ZONE: the zone where your VMs are created.
    • LOGIN_NODE: the name of the login node.

Redeploy the Slurm cluster

If you need to increase the number of compute nodes or add new partitions to your cluster, you might need to update configurations for your Slurm cluster by redeploying. Redeployment can be sped up by using an existing image from a previous deployment. To avoid creating new images during a redeploy, specify the --only flag.

To redeploy the cluster using an existing image, run the command for your required machine type.

A4 machines

./gcluster deploy -d a4high-slurm-deployment.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml --only cluster-env,cluster

A3 Ultra machines

./gcluster deploy -d a3ultra-slurm-deployment.yaml examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml --only cluster-env,cluster

This command is only for redeployments where an image already exists, it only redeploys the cluster and its infrastructure.

Test network performance on the Slurm cluster

To test NCCL communication, complete the steps for your machine type.

A4 machines

The following test uses Ramble, which is an open-source, multi-platform experimentation framework written in Python that is used to coordinate the running of NCCL tests.

The run scripts used for this test are staged in the /opt/apps/system_benchmarks on the Slurm controller node and are available to all nodes in the cluster. Running this test installs Ramble to /opt/apps/ramble.

  1. From the login node in the ${HOME} directory, run the following command. Because the test can take approximately 10 minutes, or longer if other jobs are in the queue, the following command uses nohup and redirects the stdout/err to a log file .

    nohup bash /opt/apps/system_benchmarks/run-nccl-tests-via-ramble.sh >& nccl.log &

    This command creates a folder called nccl-tests_$(date +%s) that stores all of the test results. The date tag ensures that a unique folder is created based on each current timestamp.

    For example, if your cluster has 16 nodes then NCCL tests are ran for all-gather, all-reduce, and reduce-scatter on 2, 4, 8, and 16 nodes.

  2. Review the results. The nccl.log contains the logs from setting up and running the test. To view, you can run:

    tail -f nccl.log

    You can also use Ctrl-c to stop tailing the output at any time. At the end of the nccl.log, your output should resemble the following:

    ...
    ---- SUMMARY for >1GB Message Sizes ----
    workload        n_nodes msg_size        busbw
    all-gather      2       1073741824      XXX.XX
    all-gather      2       2147483648      XXX.XX
    all-gather      2       4294967296      XXX.XX
    all-gather      2       8589934592      XXX.XX
    ...
    all-reduce      2       1073741824      XXX.XX
    ...
    reduce-scatter  2       1073741824      XXX.XX
    ...
    -------- Benchmarking Complete -------
    

    All of the Slurm job scripts and nccl-tests output logs are stored in the nccl-tests_$(date +%s)/experiments. A summary of the NCCL test performance is also stored in nccl-tests_${date +%s)/summary.tsv.

    Removing nccl-tests_$(date +%s)/ removes all of the files generated during these tests.

A3 Ultra machines

  1. Download the script needed to build the NCCL test.

    From the shared directory of the login node, complete the following steps. The shared directory is usually located at ${HOME}.

    wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh
  2. After the script downloads, import a Pytorch image from the NVIDIA container registry and build the NCCL tests. To do this, run the following command:

    sbatch build-nccl-tests.sh

    The preceding script runs on one of your nodes. It uses the --container-mounts switch to mount your current directory, $PWD, into the /nccl directory within the container.

  3. Verify that the NCCL test is built. To verify this, run the following command:

    sacct -a

    If successfully completed, the output is similar to the following:

    JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    1            build-ncc+    a3ultra                   112  COMPLETED      0:0
    

    If the build is successful you should also have a file named nvidia+pytorch+24.09-py3.sqsh in the directory where you ran the command along with a directory named nccl-tests.

  4. Check that the nccl-tests/build folder contains several binaries, including all_gather_perf.

  5. Download the NCCL test script.

    wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh

    To run any job run on an A3 Ultra cluster, several environment variables must be set in order to enable high performance networking with GPUDirect-RDMA. Because we use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment. These variables can be inspected in the run-nccl-tests.sh script that you just downloaded.

  6. Run the NCCL test script.

    sbatch run-nccl-tests.sh
  7. Review the results. The script outputs a slurm-XX.out file that contains the result of the nccl all_gather_perf benchmark.

    The output is similar to the following:

    #
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        268435456       4194304     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
        536870912       8388608     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
      1073741824      16777216     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
      2147483648      33554432     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
      4294967296      67108864     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
      8589934592     134217728     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : XXX.XX
    #
    

What's next