Plan TPUs in GKE


This page describes how to plan your usage of Tensor Processing Units (TPUs) in Google Kubernetes Engine (GKE) to reduce the risk of TPU misconfiguration, non-availability errors, or out-of-quota interruptions.

Before you use TPUs in GKE, ensure that you are familiar with TPUs definitions and terminology in GKE.

Plan your TPU configuration

To work with TPUs in GKE clusters, you must plan their configuration. We recommend that you follow these steps:

  1. Choose a GKE mode of operation: Run your workloads on TPUs in a GKE Autopilot or Standard cluster.

    Best practice:

    Use an Autopilot cluster for a fully managed Kubernetes experience.

  2. Choose the TPU version: Different TPU types have different capabilities like price-performance ratios, training throughput, and serving latency. The TPU types affect the available CPU and memory capacities.

  3. Validate TPU availability: TPUs are available in specific Google Cloud regions. To use a TPU type in your GKE workload, your cluster must be in a supported region for that type.

  4. Choose the TPU Topology: The physical arrangement of the TPUs within a TPU slice. Select a topology that matches your model's parallelism requirements.

Use the reference tables on this page to identify if your node pools are single-host or multi-host TPU slice nodes.

Choose a GKE mode of operation

You can use TPUs in the available GKE modes of operation for clusters:

  • Autopilot mode (recommended): GKE manages the underlying infrastructure such as node configuration, autoscaling, auto-upgrades, baseline security configurations, and baseline networking configuration. In Autopilot, you choose a TPU type and topology, then specify them in your Kubernetes manifest. GKE manages provisioning nodes with TPUs and scheduling your workloads.
  • Standard mode: You manage the underlying infrastructure, including configuring the individual nodes.

To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.

Choose the TPU version

The VMs in a TPU slice have the following technical characteristics.

Autopilot

TPU version Machine type Number of vCPUs Memory (GiB) Number of NUMA nodes Maximum TPU chips in a TPU slice node
TPU Trillium (v6e) (Preview)
tpu-v6e-slice 44 to 180 176 to 1440 1 to 2 256
TPU v5p
tpu-v5p-slice 208 448 2 6,144
TPU v5e
tpu-v5-lite-podslice 24 to 224 48 to 384 1 256
TPU v5e (single-host only)
tpu-v5-lite-device 24 to 224 48 to 384 1 to 2 8
TPU v4
tpu-v4-podslice 240 407 2 4,096
TPU v3 (single-host only)
tpu-v3-device 96 340 2 8
TPU v3
tpu-v3-slice 48 340 1 256

Standard

TPU version Machine type Number of vCPUs Memory (GiB) Number of NUMA nodes Likelihood of being preempted
TPU Trillium (v6e) (Preview)
ct6e-standard-1t 44 448 2 Higher
TPU v6e (Preview)
ct6e-standard-4t 180 720 1 Medium
TPU v6e (Preview)
ct6e-standard-8t 180 1440 2 Lower
TPU v5p
ct5p-hightpu-4t 208 448 2
TPU v5e
ct5l-hightpu-1t 24 48 1 Higher
TPU v5e
ct5l-hightpu-4t 112 192 1 Medium
TPU v5e
ct5l-hightpu-8t 224 384 2 Lower
TPU v5e
ct5lp-hightpu-1t 24 48 1 Higher
TPU v5e
ct5lp-hightpu-4t 112 192 1 Medium
TPU v5e
ct5lp-hightpu-8t 224 384 1 Low
TPU v4
ct4p-hightpu-4t 240 407 2
TPU v3 (single-host only)
ct3-hightpu-4t 96 340 2 8
TPU v3
ct3p-hightpu-4t 48 340 1 256

Consider the following configurations when evaluating which machine type to use based on your model:

  • ct5l- machine types are suitable for serving small-to-medium size models, and are less suitable for large models. The ct5l- machine types are single-host and therefore don't have any high-speed interconnect links between multiple hosts.
  • Multi-host ct5lp- machine types are more suitable for serving large models or training. Multi-host ct5lp- machines are interconnected with high-speed links.

Review the TPU specifications and pricing in the Cloud TPU pricing documentation to decide which TPU configuration to use.

Limitations

Consider these limitations when choosing the TPU to use:

  • TPU v6e is in Preview and available in the following versions:
    • Standard clusters in version 1.31.1-gke.1846000 and later.
    • Autopilot clusters in version 1.31.2-gke.1115000 and later.
  • TPU v6e doesn't support configuring SMT set to 2 on ct6e-standard-8t.
  • GKE cost allocation and usage metering doesn't include any data about the usage or costs of reserved TPU v4.
  • TPU v5p and v5e don't support riptide/image streaming in us-east5.
  • TPU v5p autoscaling is supported on GKE clusters with control planes running at least version 1.29.2-gke.1035000 or 1.28.7-gke.1020000.
  • For capacity reservations, use a specific reservation.

Validate TPU availability in GKE

TPUs are available in specific Google Cloud regions. To use a TPU type in your GKE cluster, your cluster must be in a supported region for that type.

Autopilot

See TPU regions and zones in the Cloud TPU documentation.

Standard

The following table lists the TPU availability for each TPU version and machine type:

TPU version Machine type beginning with Minimum GKE version Availability Zone
TPU v6e ct6e- 1.31.2-gke.1115000 Preview us-east5-b
europe-west4-a
us-east1-d
asia-northeast1-b
us-south1-a
TPU v5e ct5l- 1.27.2-gke.2100 Generally Available europe-west4-b
us-central1-a
TPU v5e ct5lp- 1.27.2-gke.2100 Generally Available europe-west4-a
us-central1-a
us-east1-c
us-east5-b
us-west1-c
us-west4-a
us-west4-b
TPU v5p ct5p- 1.28.3-gke.1024000 Generally Available us-east1-d
us-east5-a
us-east5-c
TPU v4 ct4p- 1.26.1-gke.1500 Generally Available us-central2-b
TPU v3 ct3p- 1.31.1-gke.1146000 Generally Available us-east1-d
europe-west4-a
TPU v3 ct3- 1.31.0-gke.1500 Generally Available us-east1-d
europe-west4-a
us-central1-a
us-central1-b
us-central1-f

Consider the following caveats when configuring a TPU:

  • You can create a single-host TPU v5e node pool with a machine type beginning with ct5lp- but not beginning with ct5l- in certain zones (europe-west4-a, us-east5-b, and us-west4-b). You can use ct5lp-hightpu-4t with a topology of at least 2x4 or larger in those zones.
  • To create a single-host TPU v5e in the us-west4 region, choose the zone us-west4-a and use machine types beginning with ct5lp-, such as ct5lp-hightpu-1t.
  • To create a single-host TPU v5e in the other regions listed in the preceding table, use machine types beginning with ct5l- (such as ct5l-hightpu-1t, ct5l-hightpu-4t, or ct5l-hightpu-8t).
  • Machine types beginning with ct5l- require different quota than machine types beginning with ct5lp-.

Choose a topology

After you decide on a TPU version, select a topology that's supported by that TPU type. Depending on the TPU type, the topology is two- or three-dimensional. Your model's parallelism requirements help you to decide on a topology. You can identify the number of TPU chips in the slice by calculating the product of each size in the topology. For example:

  • 2x2x2 is an 8-chip multi-host TPU v4 slice
  • 2x2 is a 4-chip single-host TPU v5e slice

If a specific topology supports both single-host and multi-host TPU slice nodes, the number of TPU chips that your workload requests determines the host type.

For example, TPU v5e (tpu-v5-lite-podslice) supports the 2x4 topology as both single- and multi-host. If you:

  • Request 4 chips in your workload, you get a multi-host node that has 4 TPU chips.
  • Request 8 chips in your workload, you get a single-host node that has 8 TPU chips.

Use the following table to choose the TPU machine type and topology for your use case:

  • For small-scale model training or inference, use TPU v4 or TPU v5e with single-host TPU slice node pools.
  • For large-scale model training or inference, use TPU v4 or TPU v5e with multi-host TPU slice node pools.

Autopilot

TPU version Machine type Topology Number of TPU chips in a slice Number of nodes Node pool type
TPU v6e (Preview) tpu-v6e-slice 1x1 1 1 Single-host
2x2 4 4 Single-host
2x4 8 8 Single-host
4x4 16 4 Multi-host
4x8 32 8 Multi-host
8x8 64 16 Multi-host
8x16 128 32 Multi-host
16x16 256 64 Multi-host
TPU v5p tpu-v5p-slice 2x2x1 4 1 Single-host
2x2x2 8 2 Multi-host
2x2x4 16 4 Multi-host
2x4x4 32 8 Multi-host
4x4x4 64 16 Multi-host
{A}x{B}x{C} A*B*C (A*B*C/4)1 Multi-host
TPU v5e tpu-v5-lite-podslice2 1x1 1 1 Single-host
2x2 4 1
2x4 8 1
2x4 8 2 Multi-host
4x4 16 4
4x8 32 8
8x8 64 16
8x16 128 32
16x16 256 64
TPU v5e (single-host only) tpu-v5-lite-device 1x1 1 1 Single-host
2x2 4 1
2x4 8 1
TPU v4 tpu-v4-podslice2 2x2x1 4 1 Single-host
2x2x2 8 2 Multi-host
2x2x4 16 4 Multi-host
2x4x4 32 8 Multi-host
4x4x4 64 16 Multi-host
{A}x{B}x{C} A*B*C (A*B*C/4)1 Multi-host
TPU v3 tpu-v3-slice 4x4 16 2 Multi-host
4x8 32 4 Multi-host
8x8 64 8 Multi-host
8x16 128 16 Multi-host
16x16 256 32 Multi-host
TPU v3 tpu-v3-device 2x2 4 1 Single-host
  1. Calculated by the topology product divided by four.

    Custom topologies for more than 64 chips are supported. The following conditions apply:

    • For more than 64 chips, {A}, {B}, and {C} must be multiples of 4
    • The largest topology is 16x16x24
    • The values must be {A}{B}{C}, like 8x12x16.
  2. Custom topologies aren't supported.

After you choose a TPU type and topology, specify these in your workload manifest. For instructions, see Deploy TPU workloads on GKE Autopilot.

Standard

TPU version Machine type Topology Number of TPU chips Number of VMs Node pool type
TPU v6e (Preview) ct6e-standard-1t 1x1 1 1 Single-host
ct6e-standard-8t 2x4 8 1 Single-host
ct6e-standard-4t 2x2 4 1 Single-host
2x4 8 2 Multi-host
4x4 16 4 Multi-host
4x8 32 8 Multi-host
8x8 64 16 Multi-host
8x16 128 32 Multi-host
16x16 256 64 Multi-host
TPU v5p ct5p-hightpu-4t 2x2x1 4 1 Single-host
2x2x2 8 2 Multi-host
2x2x4 16 4 Multi-host
2x4x4 32 8 Multi-host
{A}x{B}x{C} A*B*C (A*B*C/4)1 Multi-host
TPU v5e ct5l-hightpu-1t 1x1 1 1 Single-host
ct5l-hightpu-4t 2x2 4 1 Single-host
ct5l-hightpu-8t 2x4 8 1 Single-host
ct5lp-hightpu-1t 1x1 1 1 Single-host
ct5lp-hightpu-4t 2x2 4 1 Single-host
ct5lp-hightpu-8t 2x4 8 1 Single-host
ct5lp-hightpu-4t 2x4 8 2 Multi-host
4x4 16 4 Multi-host
4x8 32 8 Multi-host
8x8 64 16 Multi-host
8x16 128 32 Multi-host
16x16 256 64 Multi-host
TPU v4 ct4p-hightpu-4t 2x2x1 4 1 Single-host
2x2x2 8 2 Multi-host
2x2x4 16 4 Multi-host
2x4x4 32 8 Multi-host
{A}x{B}x{C} A*B*C (A*B*C/4)1 Multi-host
TPU v3 ct3-hightpu-4t 2x2 4 1 Single-host
TPU v3 ct3p-hightpu-4t 4x4 16 4 Multi-host
4x8 32 8 Multi-host
8x8 64 16 Multi-host
8x16 128 32 Multi-host
16x16 256 64 Multi-host
16x32 512 128 Multi-host
32x32 1024 256 Multi-host
  1. Calculated by the topology product divided by four.

Advanced configurations

The following sections describe scheduling best practices for advanced TPU configurations.

TPU reservation

TPU reservations are available when purchasing a commitment. Any TPU reservation can be used with GKE.

When creating a TPU slice node pool, use the --reservation and --reservation-affinity=specific flags to consume a reserved TPU instance.

Autoscaling TPUs in GKE

GKE supports Tensor Processing Units (TPUs) to accelerate machine learning workloads. Both single-host TPU slice node pool and multi-host TPU slice node pool support autoscaling and auto-provisioning.

With the --enable-autoprovisioning flag on a GKE cluster, GKE creates or deletes single-host or multi-host TPU slice node pools with a TPU version and topology that meets the requirements of pending workloads.

When you use --enable-autoscaling, GKE scales the node pool based on its type, as follows:

  • Single-host TPU slice node pool: GKE adds or removes TPU nodes in the existing node pool. The node pool may contain any number of TPU nodes between zero and the maximum size of the node pool as determined by the --max-nodes and the --total-max-nodes flags. When the node pool scales, all the TPU nodes in the node pool have the same machine type and topology. To learn more how to create a single-host TPU slice node pool, see Create a node pool.

  • Multi-host TPU slice node pool: GKE atomically scales up the node pool from zero to the number of nodes required to satisfy the TPU topology. For example, with a TPU node pool with a machine type ct5lp-hightpu-4t and a topology of 16x16, the node pool contains 64 nodes. The GKE autoscaler ensures that this node pool has exactly 0 or 64 nodes. When scaling back down, GKE evicts all scheduled pods, and drains the entire node pool to zero. To learn more how to create a multi-host TPU slice node pool, see Create a node pool.

CPU for Standard clusters

This section doesn't apply to Autopilot clusters because GKE places each TPU slice on its own node. To learn more, see How TPUs work in Autopilot mode.

For Standard clusters, consider the following scheduling best practices.

To schedule a non-TPU workload on a VM in a TPU slice node, ensure that your GKE Pod can tolerate the google.com/tpu taint. If you want the workload to be deployed to specific nodes, use node selectors.

Kubernetes resource management and priority treats VMs in TPUs the same as other VM types. To give scheduling priority to Pods that require TPUs over other Pods on the same nodes, request the maximum CPU or memory for those TPU slices. Low-priority TPU slices should do the following:

  1. Set low CPU and memory requests to ensure that the node has enough allocatable resources for the TPU workloads. To learn more, see How Kubernetes applies resource requests and limits.
  2. Set no CPU limit (unlimited) to ensure that Pods can burst to use all unused cycles.
  3. Set appropriate memory limits to ensure Pods can function correctly without risking node-pressure eviction.

If a Kubernetes Pod doesn't request CPU and memory (even if it is requesting TPUs), then Kubernetes considers it a best-effort Pod, and there is no guarantee that it needed any CPU and memory. Only Pods that explicitly request CPU and memory have such guarantees. For specific Kubernetes scheduling, configure the Pod needs with explicit CPU and memory request. For more information, see Resource Management for Pods and Containers.

To learn more best practices, see Kubernetes best practices: Resource requests and limits.

Reduce workload interruption

If you are using TPUs to train a machine learning model and your workload is interrupted, all work performed since the last checkpoint is lost. To decrease the probability that your workload is interrupted, do the following:

  • Set a higher priority for this Job than for all other Jobs: If resources are scarce, the GKE scheduler preempts lower priority Jobs to schedule a higher priority Job. This also ensures that your higher priority workload receives all the resources that it needs (up to the total resources available in the cluster). To learn more, see Pod priority and preemption.
  • Configure maintenance exclusion: A maintenance exclusion is a non-repeating window of time during which automatic maintenance is forbidden. To learn more, see Maintenance exclusions.
  • Use extended run time Pods in Autopilot: Use extended run time Pods for a grace period of up to seven days before GKE terminates your Pods for scale-downs or node upgrades.

These recommendations help to minimize interruptions, but not to prevent them. For example, a preemption due to a hardware failure or preemption for defragmentation can still occur. Similarly, setting a GKE maintenance exclusion doesn't prevent Compute Engine maintenance events.

Best practice:

Save checkpoints frequently and add code to your training script to start from the last checkpoint when resumed.

Handle disruption due to node maintenance

The GKE nodes that host the TPUs are subject to maintenance events or other disruptions that might cause node shutdown. In GKE clusters with the control plane running version 1.29.1-gke.1425000 and later, you can reduce disruption to workloads by configuring GKE to terminate your workloads gracefully.

To understand, configure, and monitor disruption events that might occur on GKE nodes running AI/ML workloads, see Manage GKE node disruption for GPUs and TPUs.

Maximize TPU utilization

To maximize your investment in TPUs, schedule a mix of Job priorities and queue them to maximize the amount of time that your TPUs are operating. For Job-level scheduling and preemption, you need to use an add-on to Kubernetes that orchestrates Jobs into queues.

Best practice:

Use Kueue to orchestrate Jobs into queues.

What's next