This document provides best practices for configuring static compute nodes for optimal usability and performance of high performance computing (HPC) and artificial intelligence (AI) workloads.
Use a reservation
To make sure that Compute Engine resources are available when you need them, you can use reservations. Reservations provide a very high level of assurance in obtaining capacity for Compute Engine zonal resources.
For static clusters, it is recommended that you use reservations. Although there are some tradeoffs, a reservation provides guarantees about resource availability. The downside to reservations is that they must be created manually, prior to deploying a cluster. Once a reservation is created, you are billed for the compute resources as if the VMs were in use, whether or not the VMs have actually been created using the reservation.
Use compact placement policy
Using a compact placement policy specifies that your static compute nodes should be physically placed closer to each other, reducing network latency between nodes.
Unlike autoscaling nodes which might create a new compact placement policy specific to each job, the placement of static compute nodes is tied to the lifecycle of the nodeset and not that of the jobs.
However, it is important to note that the topology of static compute nodes might be subject to placement changes when nodes are restarted, recreated, or migrated. This can happen because of explicit actions like updating images, or because of maintenance events.
How to configure static compute nodes to use a reservation and compact placement policy
When using a reservation with a placement policy, the placement policy must be created prior to the reservation and specified during the creation of the reservation. When a reservation with placement is provided to Slurm, it automatically uses the placement policy attached to the reservation.
To configure static compute nodes to use a compact placement policy and reservation, complete the following steps by using the Google Cloud CLI:
To create a compact placement policy, use the
gcloud compute resource-policies create group-placement
command with the--collocation=COLLOCATED
flag.gcloud compute resource-policies create group-placement PLACEMENT_POLICY_NAME \ --collocation=COLLOCATED \ --project=PROJECT_ID \ --region=REGION
Use the
gcloud compute reservations create
command to create an example reservation for a set of VMs, and specify the compact placement policy that you created in the previous step.gcloud compute reservations create RESERVATION_NAME \ --vm-count=VM_COUNT \ --machine-type=MACHINE_TYPE \ --require-specific-reservation \ --project=PROJECT_ID \ --zone=ZONE \ --resource-policies=compact-placement=PLACEMENT_POLICY_NAME
Before you deploy the cluster, update the cluster blueprint to use the reservation for your static compute nodes. The
enable_placement
flag must be set tofalse
. This states that placement is not created by Slurm but instead comes from a reservation.‐ id: static_nodeset source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset use: [network] settings: node_count_static: VM_COUNT node_count_dynamic_max: 0 enable_placement: false reservation_name: RESERVATION_NAME machine_type: MACHINE_TYPE
‐ id: static_partition source: community/modules/compute/schedmd-slurm-gcp-v6-partition use: [static_nodeset] settings: partition_name: static exclusive: false
Replace the following:
RESERVATION_NAME
: the name of your reservation.VM_COUNT
: the number of VMs in the reservation.MACHINE_TYPE
: a machine type from the compute-optimized or accelerator-optimized machine family.PLACEMENT_POLICY_NAME
: the name of your placement policy.PROJECT_ID
: your project ID.REGION
: the region where your VMs are located.ZONE
: the zone where your VMs are located.
Summary of best practices
The following is a summary of the recommended best practices for clusters that use static compute nodes.
Requirement |
Recommendation |
---|---|
VM availability |
|
Reduce network latency |
|
What's next
- Learn how to Manage static compute node.