Create an AI-optimized Slurm cluster with an A4 machine type
This page describes how to create an AI-optimized Slurm cluster using A4 high accelerator-optimized machine types with the gcloud CLI and Cluster Toolkit.
A4 accelerator-optimized machine types come with NVIDIA B200 GPUs attached and are specifically engineered for intensive AI computation, ensuring your Slurm cluster can efficiently handle large-scale model training and inference. For more information on A4 accelerator-optimized machine types on Google Cloud, see Create an A3 Ultra or A4 instance.
Tutorial overview
This tutorial describes the steps to set up an AI-optimized Slurm cluster using A4 accelerator-optimized machine types. Specifically, you set up a cluster with Compute Engine virtual machines, create a Cloud Storage bucket to store the necessary Terraform modules, and set up a Filestore instance to provision your Slurm cluster. To complete the steps in this tutorial, you follow this process:
- Set up your Google Cloud project with the required permissions and environmental variables.
- Set up a Cloud Storage bucket.
- Set up Cluster Toolkit.
- Switch to the Cluster Toolkit directory.
- Create a Slurm deployment YAML file.
- Provision a Slurm cluster using a blueprint.
- Connect to the Slurm cluster.
Before you begin
- Request a reserved capacity block
for 1
a4-highgpu-8g
machine. These machines are required for this tutorial. Ensure that you have enough Filestore quota to provision the Slurm cluster. You need a minimum of 10,240 GiB of zonal capacity (also known as high scale SSD capacity).
To check your Filestore quota, view Quotas & System limits in the Google Cloud console and filter the table to only show Filestore resources.
- For detailed instructions on checking Filestore quotas, see View API-specific quota.
- If you don't have enough quota, request a quota increase.
Make sure that billing is enabled for your Google Cloud project.
Enable the Compute Engine, Filestore, Cloud Storage, Service Usage, and Cloud Resource Manager APIs:
Costs
The cost of running this tutorial varies by each section you complete, such as setting up the tutorial or running jobs. You can calculate the cost by using the pricing calculator.
To estimate the cost for setting up this tutorial, use the following specifications:
- Filestore (standard) capacity per region: 10,240 GiB.
- Standard persistent disk: 50 GB
pd-standard
for the Slurm login node. - Performance (SSD) persistent disks: 50 GB
pd-ssd
for the Slurm controller. - VM instance: 1
a4-highgpu-8g
.
Launch Cloud Shell
In this tutorial you use Cloud Shell which is a shell environment for managing resources hosted on Google Cloud.
Cloud Shell comes preinstalled with the Google Cloud CLI. gcloud CLI provides the primary command-line interface for Google Cloud. Launch Cloud Shell:
Go to the Google Cloud console.
From the upper-right corner of the console, click the Activate Cloud Shell button:
A Cloud Shell session starts and displays a command-line prompt.
You use this shell to run gcloud
and Cluster Toolkit commands.
Set environment variables
In Cloud Shell, set the following environment variables to use for the remainder of the tutorial. These environment variable set placeholders values for the following tasks:
Configures your project with the relevant values to access your reserved
a4-highgpu-8g
machine .Sets up a Cloud Storage bucket to store Cluster Toolkit modules.
Reservation capacity variables
export A4_RESERVATION_PROJECT_ID=A4_RESERVATION_PROJECT_ID export A4_RESERVATION_NAME=A4_RESERVATION_NAME export A4_DEPLOYMENT_NAME=A4_DEPLOYMENT_NAME export A4_REGION=A4_REGION export A4_ZONE=A4_ZONE export A4_DEPLOYMENT_FILE_NAME=A4_DEPLOYMENT_FILE_NAME
Replace the following:
A4_RESERVATION_PROJECT_ID
- the Google Cloud project ID that was granted the A4 machine type reservation block.A4_RESERVATION_NAME
- the name of your GPU reservation block, found in your project. For example,a4high-exr
.A4_DEPLOYMENT_NAME
- a unique name for your Slurm cluster deployment. For example,my-slurm-cluster-deployment
.A4_REGION
- the region that is running the reserved A4 machine reservation block. For example,us-central1
.A4_ZONE
- the zone that contains the reserved machines. This string must contain both the region and zone. For example,us-central1-a
.A4_ZONE
- a unique name for your Slurm blueprint .YAML. If you run through this tutorial more than once, choose a unique deployment name each time.
Storage capacity variables
Create the environment variables for your Cloud Storage bucket.
Cluster Toolkit uses blueprints to define and deploy clusters of VMs. A blueprint defines one or more Terraform modules to provision Cloud infrastructure. This bucket is used to store these blueprints.
export GOOGLE_CLOUD_BUCKET_NAME=GOOGLE_CLOUD_BUCKET_NAME export GOOGLE_CLOUD_BUCKET_LOCATION=GOOGLE_CLOUD_BUCKET_LOCATION
Replace the following:
GOOGLE_CLOUD_BUCKET_NAME
- the name that you want to use for your Cloud Storage bucket that meets the bucket naming requirements.GOOGLE_CLOUD_BUCKET_LOCATION
- any Google Cloud region of your choice, where the bucket will be hosted. For example,us-central1
.
Switch to your A4-approved project
Run the following command to ensure that you are in the Google Cloud project that has the approved reservation block for the A4 machine type.
gcloud config set project ${A4_RESERVATION_PROJECT_ID}
Create a Cloud Storage bucket
Create the bucket to store your terraform modules, from Cloud Shell using your environment variables, run the following command:
A best practice when working with Terraform is to store the state remotely in a version enabled file. On Google Cloud, you can create a Cloud Storage bucket that has versioning enabled.
gcloud storage buckets create gs://${GOOGLE_CLOUD_BUCKET_NAME} \ --project=${A4_RESERVATION_PROJECT_ID} \ --default-storage-class=STANDARD \ --location=${GOOGLE_CLOUD_BUCKET_LOCATION} \ --uniform-bucket-level-access gcloud storage buckets update gs://${GOOGLE_CLOUD_BUCKET_NAME} --versioning
Set up the Cluster Toolkit
To create a Slurm cluster in a Google Cloud project, you can use Cluster Toolkit to handle deploying and provisioning the cluster. Cluster Toolkit is open-source software offered by Google Cloud to simplify the process of deploying workloads on Google Cloud.
Use the following steps to set up Cluster Toolkit.
Clone the Cluster Toolkit GitHub repository
In Cloud Shell, clone the GitHub repository:
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
Go to the main working directory:
cd cluster-toolkit/
Build the Cluster Toolkit binary
In Cloud Shell, build the Cluster Toolkit binary from source by running the following command:
make
To verify the build, run the following command:
To deploy an A4 high accelerator-optimized machine Slurm Cluster, you must use version
v1.47.0
or later of the Cluster Toolkit../gcluster --version
After building the binary, you are now ready to deploy clusters to run your jobs or workloads.
Create a deployment file
In the Cluster Toolkit directory, create a file named
nano ${A4_DEPLOYMENT_FILE_NAME}.yaml
Paste the following content into the YAML file.
--- terraform_backend_defaults: type: gcs configuration: bucket: GOOGLE_CLOUD_BUCKET_NAME vars: deployment_name: A4_DEPLOYMENT_FILE_NAME project_id: A4_RESERVATION_PROJECT_ID region: A4_REGION zone: A4_ZONE a4h_reservation_name: A4_RESERVATION_NAME a4h_cluster_size: 1
To save and exit the file, press Ctrl+O > Enter > Ctrl+X.
Provision the Slurm cluster
To provision the Slurm cluster, run the following deployment command. This command provisions the Slurm cluster with the following A4 Cluster Toolkit blueprint.
In Cloud Shell, start the cluster creation.
./gcluster deploy -d ${A4_DEPLOYMENT_FILE_NAME}.yaml examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml --auto-approve
Connect to the cluster
After deploying, connect to the Cloud Console to view your cluster.
Go to the Compute Engine > VM instances page in the Google Cloud console.
Locate the login node (
a4high-login-001
or similar).Click SSH to connect.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
Destroy the Slurm cluster
We recommend that you clean up your resources when they are no longer needed.
By default the A4 High blueprints enable deletion protection on the Filestore instance. When destroying the Slurm cluster, you must disable deletion protection before running the destroy command.
Disable deletion protection
To disable deletion protection when you update an instance, use a command similar to the following:
gcloud filestore instances update INSTANCE_NAME \
--no-deletion-protection
Replace:
INSTANCE_NAME
: the name of the instance you want to edit. For example,my-genomics-instance
.
To find the INSTANCE_NAME
, you can run gcloud filestore instances
list
. This command lists all the Filestore instances in your
current Google Cloud project, including their names, locations (zones),
tiers, capacity, and status.
Find the name that matches the a4-highgpu-8g
machine that's running in
this tutorial.
Destroy the Slurm cluster
Before running the destroy command, navigate to the root of the Cluster Toolkit directory. By default,
DEPLOYMENT_FOLDER
is located at the root of the Cluster Toolkit directory.To destroy the cluster, run:
./gcluster destroy DEPLOYMENT_FOLDER --auto-approve
DEPLOYMENT_FOLDER
is the name of the deployment folder. It's typically the same asDEPLOYMENT_NAME
.
When destruction is complete you see a message similar to the following:
Destroy complete! Resources: xx destroyed.
Delete the storage bucket
Delete the Cloud Storage bucket after you make sure that the previous command ended without errors:
gcloud storage buckets delete gs://${GOOGLE_CLOUD_BUCKET_NAME}
Troubleshooting
Error: Cloud Shell can't provision the cluster because there is no storage left.
You might see this error if you are a frequent user of Cloud Shell and you have run out of storage room.
To resolve this issue, see Disable or reset Cloud Shell.
Error: Cluster or blueprint name already exist.
You might see this error if you are using a project that has already used the exact file names used in this tutorial. For example, if someone else in your organization ran through this tutorial end-to-end.
To resolve this issue, run through the tutorial again and choose a unique name for the deployment file and rerun the provision the Slurm cluster command with the new deployment file.
What's next
- Advanced Slurm tasks:
- Learn how to Redeploy the Slurm cluster
- Learn how to Test network performance on the Slurm cluster
- Learn how to manage host events:
- View VMs topology
- Monitor VMs in your Slurm cluster
- Report faulty host