Deploy an A3 Mega GKE cluster for ML training

This document outlines the deployment steps for provisioning an A3 Mega (a3-megagpu-8g) Google Kubernetes Engine (GKE) cluster that is ideal for running large-scale artificial intelligence (AI) and machine learning (ML) training workloads.

Before you begin

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Identify the regions and zones where the a3-megagpu-8g machine type is available, run the following command:
```
gcloud compute machine-types list --filter="name=a3-megagpu-8g"
```
Ensure that you have enough GPU quotas. Each a3-megagpu-8g machine has 8 H100 80GB GPUs attached, so you'll need at least 8 NVIDIA H100 80GB GPUs in your selected region.
1. To view quotas, see View the quotas for your project. In the Filter field, select Dimensions (e.g. location) and specify gpu_family:NVIDIA_H100_MEGA.
2. If you don't have enough quota, request a higher quota.
Ensure that you have enough Filestore quota. You need a minimum of 10,240 GiB of zonal (also known as high scale SSD) capacity. If you don't have enough quota, request a quota increase.

Overview

To deploy the cluster, you must complete the following:

Install Cluster Toolkit
Create a reservation or get a reservation name from your Technical Account Manager (TAM)
Create a cluster
Clean up resources created by Cluster Toolkit

Install Cluster Toolkit

From the CLI, complete the following steps:

Install dependencies.
Set up Cluster Toolkit.

Create a reservation

If you don't have a reservation provided by a Technical Account Manager (TAM), we recommend creating a reservation. For more information, see Choose a reservation type.

Reservations incur ongoing costs even after the GKE cluster is destroyed. To manage your costs, we recommend the following options:

Track spending by using budget alerts.
Delete reservations when you're done with them. To delete a reservation, see delete your reservation.

To create a reservation, run the gcloud compute reservations create command and ensure that you specify the --require-specific-reservation flag.

gcloud compute reservations create RESERVATION_NAME \
    --require-specific-reservation \
    --project=PROJECT_ID \
    --machine-type=a3-megagpu-8g \
    --vm-count=NUMBER_OF_VMS \
    --zone=ZONE

Replace the following:

RESERVATION_NAME: a name for your reservation.
PROJECT_ID: your project ID.
NUMBER_OF_VMS: the number of VMs needed for the cluster.
ZONE: a zone that has a3-megagpu-8g machine types. To find supported zones for a specific VM machine type, see Regions and zones.

Create a cluster

Use the following instructions to create a cluster using Cluster Toolkit.

After you have installed the Cluster Toolkit, ensure that you are in the Cluster Toolkit directory. To go to the main Cluster Toolkit blueprint's working directory, run the following command from the CLI.
```
cd cluster-toolkit
```
Create a Cloud Storage bucket to store the state of the Terraform deployment:
```
gcloud storage buckets create gs://BUCKET_NAME \
      --default-storage-class=STANDARD \
      --project=PROJECT_ID \
      --location=COMPUTE_REGION_TERRAFORM_STATE \
      --uniform-bucket-level-access
gcloud storage buckets update gs://BUCKET_NAME --versioning
```
Replace the following variables:
- BUCKET_NAME: the name of the new Cloud Storage bucket.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION_TERRAFORM_STATE: the compute region where you want to store the state of the Terraform deployment.
Update the blueprint deployment file. In the examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml file, fill in the following settings in the terraform_backend_defaults and vars sections to match the specific values for your deployment:
- DEPLOYMENT_NAME: a unique name for the deployment. If the deployment name isn't unique within a project, cluster creation fails.
- BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
- PROJECT_ID: your Google Cloud project ID.
- COMPUTE_REGION: the compute region for the cluster.
- COMPUTE_ZONE: the compute zone for the node pool of A3 Mega machines.
- NODE_COUNT: the number of A3 Mega nodes in your cluster.
- IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform. For more information, see How authorized networks work. To get the IP address for your host machine, run the following command.
```
curl ifconfig.me
```
- For the extended_reservation field, use one of the following, depending on whether you want to target specific blocks in a reservation when provisioning the node pool:
  - To place the node pool anywhere in the reservation, provide the name of your reservation (RESERVATION_NAME).
  - To target a specific block within your reservation, use the reservation and block names in the following format:
```
RESERVATION_NAME/reservationBlocks/BLOCK_NAME
```
    If you don't know which blocks are available in your reservation, see View a reservation topology.
To modify advanced blueprint settings, edit examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml.

Deploy the blueprint to provision the GKE infrastructure using A3 Mega machine types:

cd ~/cluster-toolkit
./gcluster deploy -d \
examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml \
examples/gke-a3-megagpu/gke-a3-megagpu.yaml

When prompted, select (A)pply to deploy the blueprint.
- The blueprint creates VPC networks, service accounts, a cluster, and a nodepool.

Clean up resources created by Cluster Toolkit

To avoid recurring charges for the resources used on this page, clean up the resources provisioned by Cluster Toolkit, including the VPC networks and GKE cluster:

   cd ~/cluster-toolkit
   ./gcluster destroy CLUSTER_NAME/

Replace CLUSTER_NAME with the name of your cluster. For the clusters created with Cluster Toolkit, the cluster names are based on the DEPLOYMENT_NAME name.