Create an AI-optimized Slurm cluster with an A4 machine type

This page describes how to create an AI-optimized Slurm cluster using A4 high accelerator-optimized machine types with the gcloud CLI and Cluster Toolkit.

A4 accelerator-optimized machine types come with NVIDIA B200 GPUs attached and are specifically engineered for intensive AI computation, ensuring your Slurm cluster can efficiently handle large-scale model training and inference. For more information on A4 accelerator-optimized machine types on Google Cloud, see Create an A3 Ultra or A4 instance.

Tutorial overview

This tutorial describes the steps to set up an AI-optimized Slurm cluster using A4 accelerator-optimized machine types. Specifically, you set up a cluster with Compute Engine virtual machines, create a Cloud Storage bucket to store the necessary Terraform modules, and set up a Filestore instance to provision your Slurm cluster. To complete the steps in this tutorial, you follow this process:

  1. Set up your Google Cloud project with the required permissions and environmental variables.
  2. Set up a Cloud Storage bucket.
  3. Set up Cluster Toolkit.
  4. Switch to the Cluster Toolkit directory.
  5. Create a Slurm deployment YAML file.
  6. Provision a Slurm cluster using a blueprint.
  7. Connect to the Slurm cluster.

Before you begin

  1. Request a reserved capacity block for 1 a4-highgpu-8g machine. These machines are required for this tutorial.
  2. Ensure that you have enough Filestore quota to provision the Slurm cluster. You need a minimum of 10,240 GiB of zonal capacity (also known as high scale SSD capacity).

    To check your Filestore quota, view Quotas & System limits in the Google Cloud console and filter the table to only show Filestore resources.

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Compute Engine, Filestore, Cloud Storage, Service Usage, and Cloud Resource Manager APIs:

    Enable the APIs

Costs

The cost of running this tutorial varies by each section you complete, such as setting up the tutorial or running jobs. You can calculate the cost by using the pricing calculator.

  • To estimate the cost for setting up this tutorial, use the following specifications:

    • Filestore (standard) capacity per region: 10,240 GiB.
    • Standard persistent disk: 50 GB pd-standard for the Slurm login node.
    • Performance (SSD) persistent disks: 50 GB pd-ssd for the Slurm controller.
    • VM instance: 1 a4-highgpu-8g.

Launch Cloud Shell

In this tutorial you use Cloud Shell which is a shell environment for managing resources hosted on Google Cloud.

Cloud Shell comes preinstalled with the Google Cloud CLI. gcloud CLI provides the primary command-line interface for Google Cloud. Launch Cloud Shell:

  1. Go to the Google Cloud console.

    Google Cloud console

  2. From the upper-right corner of the console, click the Activate Cloud Shell button: Cloud Shell icon

A Cloud Shell session starts and displays a command-line prompt. You use this shell to run gcloud and Cluster Toolkit commands.

Set environment variables

In Cloud Shell, set the following environment variables to use for the remainder of the tutorial. These environment variable set placeholders values for the following tasks:

  • Configures your project with the relevant values to access your reserved a4-highgpu-8g machine .

  • Sets up a Cloud Storage bucket to store Cluster Toolkit modules.

Reservation capacity variables

export A4_RESERVATION_PROJECT_ID=A4_RESERVATION_PROJECT_ID
export A4_RESERVATION_NAME=A4_RESERVATION_NAME
export A4_DEPLOYMENT_NAME=A4_DEPLOYMENT_NAME
export A4_REGION=A4_REGION
export A4_ZONE=A4_ZONE
export A4_DEPLOYMENT_FILE_NAME=A4_DEPLOYMENT_FILE_NAME

Replace the following:

  • A4_RESERVATION_PROJECT_ID - the Google Cloud project ID that was granted the A4 machine type reservation block.
  • A4_RESERVATION_NAME - the name of your GPU reservation block, found in your project. For example, a4high-exr.
  • A4_DEPLOYMENT_NAME - a unique name for your Slurm cluster deployment. For example, my-slurm-cluster-deployment.
  • A4_REGION - the region that is running the reserved A4 machine reservation block. For example, us-central1.
  • A4_ZONE - the zone that contains the reserved machines. This string must contain both the region and zone. For example, us-central1-a.
  • A4_ZONE - a unique name for your Slurm blueprint .YAML. If you run through this tutorial more than once, choose a unique deployment name each time.

Storage capacity variables

Create the environment variables for your Cloud Storage bucket.

Cluster Toolkit uses blueprints to define and deploy clusters of VMs. A blueprint defines one or more Terraform modules to provision Cloud infrastructure. This bucket is used to store these blueprints.

export GOOGLE_CLOUD_BUCKET_NAME=GOOGLE_CLOUD_BUCKET_NAME
export GOOGLE_CLOUD_BUCKET_LOCATION=GOOGLE_CLOUD_BUCKET_LOCATION

Replace the following:

  • GOOGLE_CLOUD_BUCKET_NAME - the name that you want to use for your Cloud Storage bucket that meets the bucket naming requirements.
  • GOOGLE_CLOUD_BUCKET_LOCATION - any Google Cloud region of your choice, where the bucket will be hosted. For example, us-central1.

Switch to your A4-approved project

Run the following command to ensure that you are in the Google Cloud project that has the approved reservation block for the A4 machine type.

gcloud config set project ${A4_RESERVATION_PROJECT_ID}

Create a Cloud Storage bucket

Create the bucket to store your terraform modules, from Cloud Shell using your environment variables, run the following command:

A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.

gcloud storage buckets create gs://${GOOGLE_CLOUD_BUCKET_NAME} \
    --project=${A4_RESERVATION_PROJECT_ID} \
    --default-storage-class=STANDARD \
    --location=${GOOGLE_CLOUD_BUCKET_LOCATION} \
    --uniform-bucket-level-access

gcloud storage buckets update gs://${GOOGLE_CLOUD_BUCKET_NAME} --versioning

Set up the Cluster Toolkit

To create a Slurm cluster in a Google Cloud project, you can use Cluster Toolkit to handle deploying and provisioning the cluster. Cluster Toolkit is open-source software offered by Google Cloud to simplify the process of deploying workloads on Google Cloud.

Use the following steps to set up Cluster Toolkit.

Clone the Cluster Toolkit GitHub repository

  1. In Cloud Shell, clone the GitHub repository:

    git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
  2. Go to the main working directory:

    cd cluster-toolkit/

Build the Cluster Toolkit binary

  1. In Cloud Shell, build the Cluster Toolkit binary from source by running the following command:

    make
  2. To verify the build, run the following command:

    To deploy an A4 high accelerator-optimized machine Slurm Cluster, you must use version v1.47.0 or later of the Cluster Toolkit.

    ./gcluster --version

    After building the binary, you are now ready to deploy clusters to run your jobs or workloads.

Create a deployment file

  1. In the Cluster Toolkit directory, create a file named

    nano ${A4_DEPLOYMENT_FILE_NAME}.yaml
    

  2. Paste the following content into the YAML file.

    ---
    terraform_backend_defaults:
      type: gcs
      configuration:
        bucket: GOOGLE_CLOUD_BUCKET_NAME
    
    vars:
      deployment_name: A4_DEPLOYMENT_FILE_NAME
      project_id: A4_RESERVATION_PROJECT_ID
      region: A4_REGION
      zone: A4_ZONE
      a4h_reservation_name: A4_RESERVATION_NAME
      a4h_cluster_size: 1
    
  3. To save and exit the file, press Ctrl+O > Enter > Ctrl+X.

Provision the Slurm cluster

To provision the Slurm cluster, run the following deployment command. This command provisions the Slurm cluster with the following A4 Cluster Toolkit blueprint.

In Cloud Shell, start the cluster creation.

./gcluster deploy -d ${A4_DEPLOYMENT_FILE_NAME}.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml --auto-approve

Connect to the cluster

After deploying, connect to the Cloud Console to view your cluster.

  1. Go to the Compute Engine > VM instances page in the Google Cloud console.

    Go to VM instances

  2. Locate the login node (a4high-login-001 or similar).

  3. Click SSH to connect.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Destroy the Slurm cluster

We recommend that you clean up your resources when they are no longer needed.

By default the A4 High blueprints enable deletion protection on the Filestore instance. When destroying the Slurm cluster, you must disable deletion protection before running the destroy command.

Disable deletion protection

To disable deletion protection when you update an instance, use a command similar to the following:

  gcloud filestore instances update INSTANCE_NAME \
      --no-deletion-protection

Replace:

  • INSTANCE_NAME: the name of the instance you want to edit. For example, my-genomics-instance.

To find the INSTANCE_NAME, you can run gcloud filestore instances list. This command lists all the Filestore instances in your current Google Cloud project, including their names, locations (zones), tiers, capacity, and status.

Find the name that matches the a4-highgpu-8g machine that's running in this tutorial.

Destroy the Slurm cluster

  1. Before running the destroy command, navigate to the root of the Cluster Toolkit directory. By default, DEPLOYMENT_FOLDER is located at the root of the Cluster Toolkit directory.

  2. To destroy the cluster, run:

    ./gcluster destroy DEPLOYMENT_FOLDER --auto-approve

    DEPLOYMENT_FOLDER is the name of the deployment folder. It's typically the same as DEPLOYMENT_NAME.

When destruction is complete you see a message similar to the following:

Destroy complete! Resources: xx destroyed.

Delete the storage bucket

Delete the Cloud Storage bucket after you make sure that the previous command ended without errors:

gcloud storage buckets delete gs://${GOOGLE_CLOUD_BUCKET_NAME}

Troubleshooting

  • Error: Cloud Shell can't provision the cluster because there is no storage left.

    You might see this error if you are a frequent user of Cloud Shell and you have run out of storage room.

    To resolve this issue, see Disable or reset Cloud Shell.

  • Error: Cluster or blueprint name already exist.

    You might see this error if you are using a project that has already used the exact file names used in this tutorial. For example, if someone else in your organization ran through this tutorial end-to-end.

    To resolve this issue, run through the tutorial again and choose a unique name for the deployment file and rerun the provision the Slurm cluster command with the new deployment file.

What's next