This document summarizes how you create a cluster for your AI workloads on AI Hypercomputer. Specifically, this document guides you through the process and choices to make when starting a cluster.
Before you begin
You must have a pre-existing workload that you want to support.
You must be familiar with commonly used terminology for AI and ML workloads, such as model training and inference.
Start a cluster
Starting a cluster involves the following steps:
- Determine your workload and choose a machine type
- Choose a consumption option and obtain capacity
- Choose a deployment option
- Choose an orchestrator
- Choose the operating system and cluster image
- Create your cluster
Determine your workload and choose a machine type
Select a machine type for your AI workload. AI Hypercomputer supports cluster creation by using the A4X, A4, and A3 machine series. Consider the following recommendations for machine usage:
- For foundational model training and inference: A4X
- For large model training, fine-tuning, and inference: A4 or A3 Ultra
- For mainstream model inference and fine tuning: A3 Mega or A3 High
- For serving inference: A3 Edge
For detailed information about each machine series, see GPU machine types. For detailed information about workload recommendations for each machine, see Recommended configurations.
Choose a consumption option and obtain capacity
Select a consumption option for your GPU resources based on your workload availability and chosen machine type. For example, to use the A4X machine type, you must use the future reservations consumption model to reserve capacity for a specific date and time. The following options summarize the consumption models:
Future reservations: Available for A4X, A4, and A3 Ultra machine types, with dense resource allocation and up to 53% discount for vCPUs and GPUs. Future reservations are ideal for workloads that require stability for an extended period of time, such as pre-training foundation models or multi-host foundation model inference. To use this consumption option, you must request capacity through your account team for a future start date and time.
Future reservations in calendar mode: Available for A4 and A3 Ultra machine types, with dense resource allocation and up to 53% discount for vCPUs and GPUs. Future reservations in calendar mode help you reserve resources for workloads that run for up to 90 days and require stability, such as pre-training or fine-tuning models. However, to use this consumption option, you must create a reservation request to reserve resources at a future date and time, and Google Cloud must approve your request.
Flex-start: Available for all GPU machine types, except A4X. Flex-start lets you create short-lived, dense clusters that last up to seven days and come with discounts up to 53% for vCPUs and GPUs for A2 machine types and later. You can create Flex-start clusters directly through Compute Engine, Google Kubernetes Engine, or Cluster Toolkit. However, clusters are not available immediately; Google will automatically create clusters as soon as resources are available.
Spot: Available for all GPU machine types, except A4X. Spot VMs lets you create compute resources immediately based on availability; however, Compute Engine can preempt VMs at any time. Spot VMs are offered at the deepest discount possible on Compute Engine (61-90%).
For more information about consumption options, see Comparison of consumption options.
Choose a deployment option
Depending on the level of control that you need over your cluster deployment, choose between a highly managed or a less managed deployment that gives you more control over your infrastructure.
Highly managed
If you want Google to deploy and set up your infrastructure, then use Cluster Toolkit or GKE.
Cluster Toolkit: an open source tool offered by Google that simplifies cluster configuration and deployment for GKE or Compute Engine. You use predefined blueprints to deploy common configurations, such as A4 machine types with Slurm. You can modify blueprints to customize deployments and your software stack.
GKE: a managed Kubernetes service and open source container orchestration platform. GKE offers features like auto scaling and high availability. These features make GKE a good fit for deploying and managing AI or ML workloads, including its ability to orchestrate containerized applications, support of specialized hardware, and compatibility with the Google Cloud ecosystem. You can deploy GKE clusters by using GKE directly or by using Cluster Toolkit. You can choose between GKE Standard or Autopilot mode.
Less managed, more control
For more granular control over your clusters and the software installed on them, create a Compute Engine cluster by using managed instance groups (MIGs) or by creating VMs in bulk. Then, manually install any key software you need on the VMs.
Choose an orchestrator
An orchestrator automates the management of your clusters. With an orchestrator, you don't have to manage each VM in the cluster. An orchestrator, such as Slurm or GKE, handles tasks like job queueing, resource allocation, auto scaling (in case of GKE), and other day-to-day cluster management tasks.
Slurm: Slurm is an open source orchestrator commonly used for HPC, AI, or ML workloads. To use Slurm, you can use Cluster Toolkit (which offers cluster blueprints that automatically install Slurm on your clusters), or you can manually install Slurm on a Compute Engine cluster.
GKE: GKE is a managed service built on top of Kubernetes, an open-source container orchestration platform. GKE is ideal for deploying and managing AI or ML workloads, because of its ability to orchestrate containerized applications, its support of specialized hardware, and its place in the Google Cloud ecosystem. You can deploy GKE clusters by using GKE directly or by using Cluster Toolkit.
Bring your own orchestrator: If you want to use other orchestrators, then you must use them on your Compute Engine clusters. However, creating a Compute Engine cluster is the least managed option offered on Google Cloud. This choice means that you're responsible for setting up, maintaining, and updating your VMs.
Choose the operating system image
Depending on whether you use GKE or Compute Engine, select an image that contains your selected operating system, such as Container-Optimized OS for GKE clusters, or an accelerator OS image for Compute Engine clusters. In addition, you can also select a Deep Learning Software Layer (DSLS) image for your containers.
For detailed information, review AI Hypercomputer images.
Images for GKE clusters
To create GKE clusters, we recommend that you use the default container OS images for both Standard and Autopilot modes. However, in Standard mode, you can also choose to use other available images, like Ubuntu.
If you use Cluster Toolkit to deploy your cluster, then you can only use container OS images, as these are the images built into the cluster blueprints. For more information about each node image, see Node images in the GKE documentation.
GKE also offers Deep Learning Software Layer (DLSL) container images that install packages like NVIDIA CUDA, and NCCL, as well as ML frameworks like PyTorch, providing a ready-to-use environment for deep learning workloads. These prebuilt DLSL container images are tested and verified to work seamlessly on GKE clusters.
OS images for Compute Engine clusters
AI Hypercomputer offers images optimized for running AI and ML workloads using Compute Engine. Choose the OS you are most familiar with:
- Rocky Linux 9 accelerator
- Rocky Linux 8 accelerator
- Ubuntu 24.04 LTS accelerator
- Ubuntu 22.04 LTS accelerator
If you use Cluster Toolkit, then these accelerator images are already bundled into Cluster Toolkit blueprints, because Cluster Toolkit creates custom images that extend the Ubuntu LTS Accelerator OS images.
For more information about each OS image, see Operating system details in the Compute Engine documentation.
Create your cluster
After you review the cluster creation process and make preliminary decisions for your workload, create your cluster by using one of the following options:
- Create a GKE cluster:
- Create a Slurm cluster by using Cluster Toolkit
- Create a cluster with Compute Engine: