This document provides information about how to use multiple reservations for an A3 Mega Slurm cluster.
As the jobs running on A3 Mega cluster grow, you might need to span your jobs across more than one reservation. To do this, you need to make a few minor changes to the following files:
- The cluster blueprint: a3mega-slurm-blueprint.yaml
- The cluster deployment file: a3mega-slurm-deployment.yaml
Overview
To update your cluster, we recommend creating a single Slurm partition with
multiple nodesets so that a single job can span across multiple reservations.
To do this, complete the following steps:
- In the deployment file, create a nodeset for each additional reservation
- In the cluster blueprint, add all nodesets to the A3 Mega partition
- Deploy or redeploy the A3 Mega cluster
Switch to the Cluster Toolkit directory
Ensure that you are in the Cluster Toolkit directory. To go to the Cluster Toolkit working directory, run the following command.
cd cluster-toolkit
Create one nodeset for each reservation
To create a nodeset for each reservation, you need to update your
a3mega-slurm-deployment.yaml deployment file to add nodeset
variables for each reservation. This file deployment is located in the A3 Mega
directory: cluster-toolkit/example/machine-learning/a3-megagpu-8g/.
The following example shows how to add three nodeset variables to the
a3mega-slurm-deployment.yaml deployment file. Replace
NUMBER_OFVMS* with the number of VMs in each
reservation.
vars: project_id: customer-project region: customer-region zone: customer-zone ... a3mega_nodeset_a_size: NUMBER_OF_VMS_A a3mega_nodeset_b_size: NUMBER_OF_VMS_B a3mega_nodeset_c_size: NUMBER_OF_VMS_C ...
Add all nodesets to the A3 Mega partition
To add the nodesets to the A3 Mega partition, you need to update the
a3mega-slurm-blueprint.yaml cluster blueprint. This blueprint file is located
in the A3 Mega directory: cluster-toolkit/example/machine-learning/a3-megagpu-8g/.
To add the nodesets, complete the following steps in the
a3mega-slurm-blueprint.yaml blueprint:
- Locate the - id: a3mega_nodesetsection. It should resemble the following:- ‐ id: a3mega_nodeset source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset use: ‐ sysnet ‐ gpunets settings: node_count_static: $(vars.a3mega_cluster_size) node_count_dynamic_max: 0 ...
- Make a copy of the entire - id: a3mega_nodesetsection for each of the new reservations. In each section, change the- node_count_staticsetting to specify the nodeset variable created in the preceding step.- For example, if you had created three nodesets, you would update as follows: - ‐ id: a3mega_nodes_a source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset use: ‐ sysnet ‐ gpunets settings: node_count_static: $(vars.a3mega_nodeset_a_size) node_count_dynamic_max: 0 ... ‐ id: a3mega_nodes_b source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset use: ‐ sysnet ‐ gpunets settings: node_count_static: $(vars.a3mega_nodeset_b_size) node_count_dynamic_max: 0 ... ‐ id: a3mega_nodes_c source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset use: ‐ sysnet ‐ gpunets settings: node_count_static: $(vars.a3mega_nodeset_c_size) node_count_dynamic_max: 0 ... 
- Locate the - id: a3mega_partitionsection.- ‐ id: a3mega_partition source: community/modules/compute/schedmd-slurm-gcp-v6-partition use: - a3mega_nodeset settings: ... 
- Add the new nodesets. - ‐ id: a3mega_partition source: community/modules/compute/schedmd-slurm-gcp-v6-partition use: - a3mega_nodes_a - a3mega_nodes_b - a3mega_nodes_c settings: ... 
Deploy the A3 Mega cluster
- If you are deploying the cluster for the first time, continue with the deployment. To deploy the cluster, see Deploy an A3 Mega Slurm cluster for ML training.
- If you are updating an existing cluster, run the following command from the Cluster Toolkit directory. - The - -wflag specifies that you want to overwrite the previously deployed infrastructure.- ./gcluster deploy -w \ -d example/machine-learning/a3-megagpu-8g/a3mega-slurm-deployment.yaml \ -b example/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml - This process might take approximately 10-30 minutes to delete any existing nodes and create all of the new nodes. 
Connect to the A3 Mega Slurm cluster
To login, you can use either Google Cloud console or Google Cloud CLI.
Console
- Go to the Compute Engine > VM instances page. 
- Locate the login node. It should have a name similar to - a3mega-login-001.
- From the Connect column of the login node, click SSH. 
gcloud
To connect to the login node, use the
gcloud compute ssh command.
gcloud compute ssh $(gcloud compute instances list --filter "name ~ login" --format "value(name)") \ --tunnel-through-iap \ --zone ZONE
Test your multi-nodeset partition
When you connect to the login or controller node, you might see the following:
*** Slurm is currently being configured in the background. ***
If you see this message, wait a few minutes until Slurm has finished configuring
and then reconnect to the cluster. Then you can run sinfo and scontrol
to examine your new partition.
- For the - sinfocommand, the output should resemble the following:- PARTITION AVAIL TIMELIMIT NODES STATE NODELIST a3mega* up infinite 216 idle a3mega-a3meganodesa-[0-79],a3mega-a3meganodesb-[0-63],a3mega-a3meganodesc-[0-71] debug up infinite 4 idle~ a3mega-debugnodeset-[0-3] 
- For the - scontrol show partition a3megacommand, the output should resemble the following:- PartitionName=a3mega AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED NodeSets=a3meganodesa,a3meganodesb,a3meganodesc Nodes=a3mega-a3meganodesa-[0-79],a3mega-a3meganodesb-[0-63],a3mega-a3meganodesc-[0-71] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=44928 TotalNodes=216 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=8944 MaxMemPerNode=UNLIMITED TRES=cpu=44928,mem=392421G,node=216,billing=44928 ResumeTimeout=900 SuspendTimeout=600 SuspendTime=300 PowerDownOnIdle=NO