This document outlines the deployment steps for provisioning an A3 Mega
(a3-megagpu-8g
) Google Kubernetes Engine (GKE) cluster that is ideal for running
large-scale artificial intelligence (AI) and machine learning (ML) training
workloads.
Before you begin
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Identify the regions and zones where the
a3-megagpu-8g
machine type is available, run the following command:gcloud compute machine-types list --filter="name=a3-megagpu-8g"
Ensure that you have enough GPU quotas. Each
a3-megagpu-8g
machine has 8 H100 80GB GPUs attached, so you'll need at least 8 NVIDIA H100 80GB GPUs in your selected region.- To view quotas, see
View the quotas for your project.
In the Filter field,
select Dimensions (e.g. location) and specify
gpu_family:NVIDIA_H100_MEGA
. - If you don't have enough quota, request a higher quota.
- To view quotas, see
View the quotas for your project.
In the Filter field,
select Dimensions (e.g. location) and specify
Ensure that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity. If you don't have enough quota, request a quota increase.
Overview
To deploy the cluster, you must complete the following:
- Install Cluster Toolkit
- Create a reservation or get a reservation name from your Technical Account Manager (TAM)
- Create a cluster
- Clean up resources created by Cluster Toolkit
Install Cluster Toolkit
From the CLI, complete the following steps:
- Install dependencies.
- Set up Cluster Toolkit.
Create a reservation
If you don't have a reservation provided by a Technical Account Manager (TAM), we recommend creating a reservation. For more information, see Choose a reservation type.
Reservations incur ongoing costs even after the GKE cluster is destroyed. To manage your costs, we recommend the following options:
- Track spending by using budget alerts.
- Delete reservations when you're done with them. To delete a reservation, see delete your reservation.
To create a reservation, run the
gcloud compute reservations create
command
and ensure that you specify the --require-specific-reservation
flag.
gcloud compute reservations create RESERVATION_NAME \ --require-specific-reservation \ --project=PROJECT_ID \ --machine-type=a3-megagpu-8g \ --vm-count=NUMBER_OF_VMS \ --zone=ZONE
Replace the following:
RESERVATION_NAME
: a name for your reservation.PROJECT_ID
: your project ID.NUMBER_OF_VMS
: the number of VMs needed for the cluster.ZONE
: a zone that hasa3-megagpu-8g
machine types. To find supported zones for a specific VM machine type, see Regions and zones.
Create a cluster
Use the following instructions to create a cluster using Cluster Toolkit.
After you have installed the Cluster Toolkit, ensure that you are in the Cluster Toolkit directory. To go to the main Cluster Toolkit blueprint's working directory, run the following command from the CLI.
cd cluster-toolkit
Create a Cloud Storage bucket to store the state of the Terraform deployment:
gcloud storage buckets create gs://BUCKET_NAME \ --default-storage-class=STANDARD \ --project=PROJECT_ID \ --location=COMPUTE_REGION_TERRAFORM_STATE \ --uniform-bucket-level-access gcloud storage buckets update gs://BUCKET_NAME --versioning
Replace the following variables:
BUCKET_NAME
: the name of the new Cloud Storage bucket.PROJECT_ID
: your Google Cloud project ID.COMPUTE_REGION_TERRAFORM_STATE
: the compute region where you want to store the state of the Terraform deployment.
Update the blueprint deployment file. In the
examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml
file, fill in the following settings in theterraform_backend_defaults
andvars
sections to match the specific values for your deployment:DEPLOYMENT_NAME
: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.BUCKET_NAME
: the name of the Cloud Storage bucket you created in the previous step.PROJECT_ID
: your Google Cloud project ID.COMPUTE_REGION
: the compute region for the cluster.COMPUTE_ZONE
: the compute zone for the node pool of A3 Mega machines.NODE_COUNT
: the number of A3 Mega nodes in your cluster.IP_ADDRESS/SUFFIX
: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work. To get the IP address for your host machine, run the following command.curl ifconfig.me
For the
extended_reservation
field, use one of the following, depending on whether you want to target specific blocks in a reservation when provisioning the node pool:- To place the node pool anywhere in the reservation, provide the
name of your reservation (
RESERVATION_NAME
). To target a specific block within your reservation, use the reservation and block names in the following format:
RESERVATION_NAME/reservationBlocks/BLOCK_NAME
If you don't know which blocks are available in your reservation, see View a reservation topology.
- To place the node pool anywhere in the reservation, provide the
name of your reservation (
To modify advanced blueprint settings, edit
examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml
.Deploy the blueprint to provision the GKE infrastructure using A3 Mega machine types:
cd ~/cluster-toolkit ./gcluster deploy -d \ examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml \ examples/gke-a3-megagpu/gke-a3-megagpu.yaml
When prompted, select (A)pply to deploy the blueprint.
- The blueprint creates VPC networks, service accounts, a cluster, and a nodepool.
Clean up resources created by Cluster Toolkit
To avoid recurring charges for the resources used on this page, clean up the resources provisioned by Cluster Toolkit, including the VPC networks and GKE cluster:
cd ~/cluster-toolkit
./gcluster destroy CLUSTER_NAME/
Replace CLUSTER_NAME
with the name of your cluster.
For the clusters created with Cluster Toolkit, the cluster names
are based on the DEPLOYMENT_NAME
name.