This document provides an overview of the Slurm deployment of an
A3 accelerator-optimized machine family
cluster on Google Cloud. This solution uses the A3 Mega (a3-megagpu-8g
)
machine type, each of which has eight NVIDIA H100 GPUs, offers 80 GB GPU memory
per GPU and can be configured to use GPUDirect-TCPXO. Clusters created by using
these machine types are ideal for running large-scale artificial intelligence
(AI) and machine learning (ML) training workloads.
Deployment architecture
This section provides an overview of the deployment architecture.
Cluster blueprints
The deployment of an A3 Mega Slurm cluster uses three cluster blueprints to complete each of the following tasks as follows:
Base (networking & home filesystem) setup: the
slurm-a3mega-base.yaml
blueprint provisions a Virtual Private Cloud network and one Filestore file system for mounting/home
across the cluster.Image building: the
slurm-a3mega-image.yaml
blueprint builds a custom Debian 12 OS image that has Slurm pre-installed. This OS image also includes the latest kernel modules and configurations that are necessary to support the highest network performance.Cluster deployment: the
slurm-a3mega-cluster.yaml
blueprint provisions a Slurm cluster using a custom Debian 12 OS image. You can update and re-provision the cluster deployment blueprint as needed.In addition to a system network card, each
a3-megagpu-8g
machine type has eight network interfaces (NICs) dedicated to GPU communication. This blueprint also creates one Virtual Private Cloud network for each GPU and sets the MTU for each network to8244
.
Deployment files
This solution uses two deployment files
to centralize the configuration needed across the three blueprints for each
deployment, minimizing the number of changes needed in each individual
blueprint file. The deployment files are as follows:
deployment-base.yaml
and deployment-image-cluster.yaml
.
With this approach, the lifecycle of the Filestore instance and the lifecycle of the cluster are separated which allows the cluster to be deleted while retaining access to data and home directories.
Network performance components
The following components are used to optimize the network performance for
your a3-megagpu-8g
Slurm cluster. After deploying the cluster, see
Enable GPUDirect-TCPXO optimized NCCL communication
for an example of configuring a workload to use GPUDirect-TCPXO.
- GPUDirect-TCPXO
- GPUDirect-TCPXO is a custom, remote direct memory access (RDMA)
networking stack that increases the network performance of your VMs by
allowing data packet payloads to transfer directly from GPU memory to the
network interface without having to go through the CPU and system memory.
a3-megagpu-8g
VMs can use GPUDirect-TCPXO combined with Google Virtual NIC (gVNIC) to deliver higher throughput between VMs in a cluster when compared to the A2 accelerator-optimized machine types on Google Cloud. - The Receive Data Path Manager (RxDM)
To achieve optimal application performance, an additional service called the Receive Data Path Manager (RxDM) runs alongside the applications that use GPUDirect-TCPXO.
Additionally, a NCCL net plugin must be installed into the execution environment of the workload. Both the RxDM and plugin are distributed by a PyTorch Docker image.
- The cluster deployment blueprint
The
slurm-a3mega-cluster.yaml
blueprint includes a Slurm Prolog and Epilog script that runs before and after every job running on more than onea3-megagpu-8g
compute node.The Prolog performs the following actions:
- Ensures that the
import-helper
kernel module is loaded. - Installs the NCCL net plugin into
/var/lib/tcpxo/lib64/
of the host. - Runs the RxDM service, which is a long-lived service that runs alongside the
job. Starting the RxDM can take 10-20 seconds, which blocks the start of the
job until the RxDM service is initialized. Because of this you won't see the
slurm job
output/error
logs until RxDM has started.
The Epilog performs the following actions:
- Stops the RxDM service.
- Prunes any stopped containers and frees up disk space.
For more information about Prologs and Epilogs, see the Slurm documentation.
- Ensures that the