Create a Hypercompute Cluster with GKE with default configuration

This page shows you how to create your own Hypercompute Cluster with Google Kubernetes Engine (GKE) to support your AI and ML workloads, using A3 Ultra GPUs.

GKE is the open, portable, extensible, and highly scalable platform for Hypercompute Cluster. GKE provides a single platform surface to run a diverse set of workloads for your organization's needs. This includes high performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services. GKE reduces the operational burden of managing multiple platforms.

The following instructions use Cluster Toolkit, which lets you create your GKE cluster quickly while incorporating best practices. Through Cluster Toolkit, you have access to reference design blueprints that codify the Hypercompute Cluster environment on GKE including compute, storage, and networking resources. Additionally, Cluster Toolkit sets up the cluster to use GPUDirect RDMA-over-Converged-Ethernet (RoCE) for distributed AI workloads.

Alternatively, for greater flexibility in configuring your cluster based on the needs of your workload, you can choose to create your GKE cluster manually. To create a Hypercompute Cluster with GKE manually, see Create a custom Hypercompute Cluster with GKE.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
  • Ensure that you have enough quota for A3 Ultra GPUs. To request more quota, follow the instructions in GPU quota. To ensure that your cluster has capacity, you can follow the instructions to reserve capacity.

Requirements

The following requirements apply to GKE Hypercompute Cluster:

  • The H200 GPUs in A3 Ultra VMs require a minimum of 550 GPU driver version, which is available in GKE 1.31 as latest driver version. For A3 Ultra, you must set gpu-driver-version=latest with GKE 1.31.
  • For A3 Ultra node pools, you must set the disk type to hyperdisk-balanced.
  • To use GPUDirect RDMA, use GKE patch version 1.31.4-gke.1183000 or higher.
  • To use GPUDirect RDMA, the GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.

Reserve capacity

To ensure that your workloads have the A3 Ultra GPU resources required for these instructions, you can create a future reservation request. With this request, you can reserve blocks of capacity for a defined duration in the future. At that date and time in the future, Compute Engine automatically provisions the blocks of capacity by creating on-demand reservations that you can immediately consume by provisioning node pools for this cluster.

Additionally, as your reserved capacity might span multiple blocks, we recommend that you create GKE nodes on a specific block within your reservation.

Do the following steps to request capacity and gather the required information to create nodes on a specific block within your reservation:

  1. Request capacity.

  2. Query the topology information for GPUs provisioned as dense reservations:

    gcloud beta compute reservations blocks list RESERVATION_NAME \
        --zone=COMPUTE_ZONE | grep "selfLink"
    

    Replace the following:

    • RESERVATION_NAME: the name of your reservation.
    • COMPUTE_ZONE: the compute zone of your reservation.
  3. Extract the string from the output that is in the following format:

    RESERVATION_NAME/reservationBlocks/BLOCK_NAME
    

    Use this string when provisioning GKE node pools to target specific blocks within a reservation.

Create a cluster using Cluster Toolkit

This section guides you through the cluster creation process, ensuring that your project follows best practices and meets the requirements for GKE Hypercompute Cluster.

  1. Launch Cloud Shell. You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the instructions to install dependencies to prepare a different environment.
  2. Clone the Cluster Toolkit from the git repository:

    cd ~
    git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
    
  3. Install the Cluster Toolkit:

    cd cluster-toolkit && git checkout main && make
    
  4. Create a Cloud Storage bucket to store the state of the Terraform deployment:

    gcloud storage buckets create gs://BUCKET_NAME \
     --default-storage-class=STANDARD \
     --location=COMPUTE_REGION \
     --uniform-bucket-level-access
    gcloud storage buckets update gs://BUCKET_NAME --versioning
    

    Replace the following variables:

    • BUCKET_NAME: the name of the new Cloud Storage bucket.
    • COMPUTE_REGION: the compute region where you want to store the state of the Terraform deployment.
  5. In the cluster-toolkit directory, fetch the directory containing the A3 Ultra blueprint and associated files:

    git fetch origin a3ultra-preview
    
  6. Download the directory:

    git checkout origin/a3ultra-preview -- examples/gke-a3-ultragpu
    
  7. In the examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yaml file, replace the following variables in the terraform_backend_defaults and vars sections to match the specific values for your deployment:

    • BUCKET_NAME: the name of the Cloud Storage bucket you created in the previous step.
    • PROJECT_ID: your Google Cloud project ID.
    • COMPUTE_REGION: the compute region for the cluster.
    • COMPUTE_ZONE: the compute zone for the node pool of A3 Ultra machines.
    • IP_ADDRESS/SUFFIX: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform.
    • RESERVATION_NAME: the name of your reservation.
    • BLOCK_NAME: the name of a specific block within the reservation.
    • NODE_COUNT: the number of A3 Ultra nodes in your cluster.

    To modify advanced settings, edit examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml.

  8. Generate Application Default Credentials (ADC) to provide access to Terraform.

  9. Deploy the blueprint to provision the GKE infrastructure using A3 Ultra machine types:

    cd ~/cluster-toolkit
    ./gcluster deploy -d \
     examples/gke-a3-ultragpu/gke-a3-ultragpu-deployment.yaml \
     examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml
    

Deploy and run a NCCL test

To validate the functionality of the provisioned cluster, you can run a NCCL test. Run a basic two node test, or, if you have a large number of nodes, we recommend using the NCCL test with Topology Aware Scheduling.

Two node test

  1. Connect to your cluster:

    gcloud container clusters get-credentials gke-a3ultra
    
  2. To deploy a NCCL test workload of two test pods running on two A3 Ultra nodes, apply the following manifest:

    kubectl apply -f ~/cluster-toolkit/examples/gke-a3-ultragpu/nccl-test.yaml
    
  3. Trigger a NCCL all-gather test for the A3 Ultra nodes:

    kubectl exec nccl-test-host-1 -it -- /usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1K -e 8G nccl-host-1 nccl-host-2
    

    The output should be similar to the following:

    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
            1024            16     float    none      -1    56.00    0.02    0.02      0    55.59    0.02    0.02      0
            2048            32     float    none      -1    55.79    0.04    0.03      0    55.57    0.04    0.03      0
            4096            64     float    none      -1    56.29    0.07    0.07      0    57.35    0.07    0.07      0
            8192           128     float    none      -1    56.44    0.15    0.14      0    56.32    0.15    0.14      0
           16384           256     float    none      -1    57.57    0.28    0.27      0    57.60    0.28    0.27      0
           32768           512     float    none      -1    57.92    0.57    0.53      0    59.35    0.55    0.52      0
           65536          1024     float    none      -1    59.92    1.09    1.03      0    60.15    1.09    1.02      0
          131072          2048     float    none      -1    59.21    2.21    2.08      0    61.82    2.12    1.99      0
          262144          4096     float    none      -1    63.58    4.12    3.87      0    63.34    4.14    3.88      0
          524288          8192     float    none      -1    64.89    8.08    7.57      0    65.09    8.06    7.55      0
         1048576         16384     float    none      -1    80.90   12.96   12.15      0    77.49   13.53   12.69      0
         2097152         32768     float    none      -1    80.22   26.14   24.51      0    79.88   26.25   24.61      0
         4194304         65536     float    none      -1    82.86   50.62   47.45      0    82.47   50.86   47.68      0
         8388608        131072     float    none      -1    95.83   87.53   82.06      0    93.27   89.94   84.32      0
        16777216        262144     float    none      -1    122.8  136.58  128.04      0    121.7  137.86  129.24      0
        33554432        524288     float    none      -1    180.6  185.75  174.14      0    179.2  187.19  175.49      0
        67108864       1048576     float    none      -1    279.7  239.90  224.90      0    277.0  242.26  227.12      0
       134217728       2097152     float    none      -1    507.5  264.46  247.93      0    485.1  276.66  259.37      0
       268435456       4194304     float    none      -1    866.3  309.88  290.51      0    864.0  310.70  291.28      0
       536870912       8388608     float    none      -1   1576.1  340.62  319.33      0   1558.2  344.54  323.01      0
      1073741824      16777216     float    none      -1   3096.6  346.75  325.08      0   3047.5  352.33  330.31      0
      2147483648      33554432     float    none      -1   6148.0  349.30  327.47      0   6034.3  355.88  333.64      0
      4294967296      67108864     float    none      -1    12226  351.29  329.33      0    12000  357.92  335.55      0
      8589934592     134217728     float    none      -1    24391  352.17  330.16      0    23920  359.11  336.67      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 120.94
    

Test with Topology Aware Scheduling (TAS)

If you have a larger number of nodes, we recommend using the following test, which uses TAS:

  1. Connect to your cluster:

    gcloud container clusters get-credentials gke-a3ultra
    
  2. Deploy an all-gather NCCL performance test with Topology Aware Scheduling enabled by using the nccl-jobset-example.yaml file.

    By default, this test uses four nodes. You can use a different number of nodes, with a minimum of three; however, powers of two are recommended. To change the number of nodes, modify the YAML file to change the following values from 4:

    • parallelism
    • completions
    • N_NODES

    Create the resources to run the test:

    kubectl create -f ~/cluster-toolkit/examples/gke-a3-ultragpu/nccl-jobset-example.yaml
    

    This command returns a JobSet name.

    The output should be similar to the following:

    jobset.jobset.x-k8s.io/all-gather8t7dt created
    
  3. To view the results of the NCCL test, run this command to view all of the running Pods:

    kubectl get pods
    

    The output should be similar to the following:

    NAME                          READY   STATUS      RESTARTS   AGE
    all-gather8t7dt-w-0-0-n9s6j   0/1     Completed   0          9m34s
    all-gather8t7dt-w-0-1-rsf7r   0/1     Completed   0          9m34s
    
  4. Find a Pod name matching the pattern jobset-name-w-0-0-*. The logs of this Pod contain the results of the NCCL test.

    To fetch the logs for this Pod, run this command:

    kubectl logs all-gather8t7dt-w-0-0-n9s6j
    

    The output should be similar to the following:

    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
            1024            16     float    none      -1    54.07    0.02    0.02      0    55.80    0.02    0.02      0
            2048            32     float    none      -1    55.46    0.04    0.03      0    55.31    0.04    0.03      0
            4096            64     float    none      -1    55.59    0.07    0.07      0    55.38    0.07    0.07      0
            8192           128     float    none      -1    56.05    0.15    0.14      0    55.92    0.15    0.14      0
           16384           256     float    none      -1    57.08    0.29    0.27      0    57.75    0.28    0.27      0
           32768           512     float    none      -1    57.49    0.57    0.53      0    57.22    0.57    0.54      0
           65536          1024     float    none      -1    59.20    1.11    1.04      0    59.20    1.11    1.04      0
          131072          2048     float    none      -1    59.58    2.20    2.06      0    63.57    2.06    1.93      0
          262144          4096     float    none      -1    63.87    4.10    3.85      0    63.61    4.12    3.86      0
          524288          8192     float    none      -1    64.83    8.09    7.58      0    64.40    8.14    7.63      0
         1048576         16384     float    none      -1    79.74   13.15   12.33      0    76.66   13.68   12.82      0
         2097152         32768     float    none      -1    78.41   26.74   25.07      0    79.05   26.53   24.87      0
         4194304         65536     float    none      -1    83.21   50.41   47.26      0    81.25   51.62   48.39      0
         8388608        131072     float    none      -1    94.35   88.91   83.35      0    99.07   84.68   79.38      0
        16777216        262144     float    none      -1    122.9  136.55  128.02      0    121.7  137.83  129.21      0
        33554432        524288     float    none      -1    184.2  182.19  170.80      0    178.1  188.38  176.60      0
        67108864       1048576     float    none      -1    294.7  227.75  213.51      0    277.7  241.62  226.52      0
       134217728       2097152     float    none      -1    495.4  270.94  254.00      0    488.8  274.60  257.43      0
       268435456       4194304     float    none      -1    877.5  305.92  286.80      0    861.3  311.65  292.17      0
       536870912       8388608     float    none      -1   1589.8  337.71  316.60      0   1576.2  340.61  319.33      0
      1073741824      16777216     float    none      -1   3105.7  345.74  324.13      0   3069.2  349.85  327.98      0
      2147483648      33554432     float    none      -1   6161.7  348.52  326.74      0   6070.7  353.75  331.64      0
      4294967296      67108864     float    none      -1    12305  349.03  327.22      0    12053  356.35  334.08      0
      8589934592     134217728     float    none      -1    24489  350.77  328.85      0    23991  358.05  335.67      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 120.248
    

Run reproducible benchmarks

You can use reproduce pre-training benchmarks for large machine learning open models on A3 Ultra GPUs on GKE.

Each recipe provides you with the instructions to complete the following tasks:

  • Prepare your environment.
  • Run the benchmark.
  • Analyze the benchmarks results. This includes the benchmark results and detailed logs for further analysis.

To view all the recipes available, see the GPU recipes repository.

Models Framework Recipe
Llama-3.1-70B MaxText 32 node workload
Llama-3.1-70B NeMo 32 node workload
Mixtral-8-7B MaxText 32 node workload
Mixtral-8-7B NeMo 32 node workload

Clean up

To avoid recurring charges for the resources used on this page, clean up the resources provisioned by Cluster Toolkit, including the VPC networks and GKE cluster:

   ./gcluster destroy gke-a3-ultra/

What's next