This page introduces Cloud TPU with Google Kubernetes Engine (GKE). Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning (ML) workloads that use frameworks such as TensorFlow, PyTorch, and JAX.
Before you use TPUs in GKE, we recommend that you learn how machine learning accelerators work with the Introduction to Cloud TPU.
This page helps you understand the basics of Cloud TPU with Google Kubernetes Engine (GKE), including terminology, benefits of TPUs, and workload scheduling considerations.
To learn how to set up Cloud TPU in GKE, see the following resources:
Benefits of using TPUs in GKE
GKE provides full support for TPU node and node pool lifecycle management, including creating, configuring, and deleting TPU VMs. GKE also supports Spot VMs and using reserved Cloud TPU. The benefits of using TPUs in GKE include:
- Consistent operational environment: You can use a single platform for all machine learning and other workloads.
- Automatic upgrades: GKE automates version updates which reduces operational overhead.
- Load balancing: GKE distributes the load, thus reducing latency and improving reliability.
- Responsive scaling: GKE automatically scales TPU resources to meet the needs of your workloads.
- Resource management: With Kueue, a Kubernetes-native job queuing system, you can manage resources across multiple tenants within your organization using queuing, preemption, prioritization, and fair sharing.
Benefits of using TPU Trillium (v6e)
TPU Trillium (v6e) is Cloud TPU's latest generation AI accelerator. On all technical interfaces (API, logs) and throughout GKE documentation, TPU Trillium (v6e) is referred to as Trillium, representing Google's 6th generation of TPUs.
- TPU Trillium increases compute performance per chip compared to TPU v5e.
- TPU Trillium increases the High Bandwidth Memory (HBM) capacity and bandwidth, and also increases the Interchip Interconnect (ICI) bandwidth over TPU v5e.
- TPU Trillium is equipped with third-generation SparseCore, a specialized accelerator for processing ultra-large embeddings common in advanced ranking and recommendation workloads.
- TPU Trillium is over 67% more energy-efficient than TPU v5e.
- TPU Trillium can scale up to 256 TPUs in a single high-bandwidth, low-latency TPU slice.
- TPU Trillium supports collection scheduling. Collection scheduling lets you declare a group of TPUs (single-host and multi-host TPU slice node pools) to ensure high availability for the demands of your inference workloads.
To learn more about the benefits of TPU Trillium, read the TPU Trillium announcement blog post. To start your TPU setup, see Plan TPUs in GKE.
Terminology related to TPU in GKE
This page uses the following terminology related to TPUs:
- TPU type: the Cloud TPU type, like v5e.
- TPU slice node: a Kubernetes node represented by a single VM that has one or more interconnected TPU chips.
- TPU slice node pool: a group of Kubernetes nodes within a cluster that all have the same TPU configuration.
- TPU topology: the number and physical arrangement of the TPU chips in a TPU slice.
- Atomic: GKE treats all the interconnected nodes as a single unit. During scaling operations, GKE scales the entire set of nodes to 0 and creates new nodes. If a machine in the group fails or terminates, GKE recreates the entire set of nodes as a new unit.
- Immutable: You can't manually add new nodes to the set of interconnected nodes. You can, however, a new node pool with the TPU topology you want and schedule workloads on the new node pool.
Type of TPU slice node pool
GKE supports two types of TPU node pools:
The TPU type and topology determine whether your TPU slice node can be multi-host or single-host. We recommend:
- For large-scale models, use multi-host TPU slice nodes
- For small-scale models, use single-host TPU slice nodes
Multi-host TPU slice node pool
A multi-host TPU slice node pool is a node pool that contains two or more
interconnected TPU VMs. Each VM has a TPU device connected to it. The TPUs in
a multi-host TPU slice are connected over a high speed interconnect (ICI). Once a
multi-host TPU slice node pool is created, you can't add nodes to it. For example,
you can't create a v4-32
node pool and then later add an additional Kubernetes
node (TPU VM) to the node pool. To add an additional TPU slice to a
GKE cluster, you must create a new node pool.
The VMs in a multi-host TPU slice node pool are treated as a single atomic unit. If GKE is unable to deploy one node in the slice, no nodes in the TPU slice node are deployed.
If a node within a multi-host TPU slice needs to be repaired, GKE shuts down all VMs in the TPU slice, forcing all Kubernetes Pods in the workload to be evicted. Once all VMs in the TPU slice are up and running, the Kubernetes Pods can be scheduled on the VMs in the new TPU slice.
The following diagram shows a v5litepod-16
(v5e) multi-host TPU
slice. This TPU slice has four VMs. Each VM in the TPU slice has four TPU v5e chips connected
with high-speed interconnects (ICI), and each TPU v5e chip has one TensorCore.
The following diagram shows a GKE cluster that contains one
TPU v5litepod-16
(v5e) TPU slice (topology: 4x4
) and one TPU v5litepod-8
(v5e)
slice (topology: 2x4
):
Single-host TPU slice node pools
A single-host slice node pool is a node pool that contains one or more independent TPU VMs. Each VM has a TPU device connected to it. While the VMs within a single-host slice node pool can communicate over the data center network (DCN), the TPUs attached to the VMs are not interconnected.
The following diagram shows an example of a single-host TPU slice that contains seven
v4-8
machines:
Characteristics of TPUs in GKE
TPUs have unique characteristics that require special planning and configuration.
Topology
The topology defines the physical arrangement of TPUs within a TPU slice. GKE provisions a TPU slice in two- or three-dimensional topologies, depending on the TPU version. You specify a topology as the number of TPU chips in each dimension, as follows:
For TPU v4 and v5p scheduled in multi-host TPU slice node pools, you define the
topology in 3-tuples ({A}x{B}x{C}
), for example 4x4x4
. The product of
{A}x{B}x{C}
defines the number of TPU chips in the node pool. For example, you can
define small topologies that have fewer than 64 TPU chips with topology forms such as
2x2x2
,2x2x4
, or 2x4x4
. If you use topologies larger that have more than 64 TPU chips, the
values you assign to {A},{B}, and {C} must meet the following conditions:
- {A},{B}, and {C} must be multiples of four.
- The largest topology supported for v4 is
12x16x16
and v5p is16x16x24
. - The assigned values must keep the A ≤ B ≤ C
pattern. For example,
4x4x8
or8x8x8
.
Machine type
Machine types that support TPU resources follow a naming convention that
includes the TPU version and the number of TPU chips per node slice, such as
ct<version>-hightpu-<node-chip-count>t
. For example, the machine
type ct5lp-hightpu-1t
supports TPU v5e and contains just one TPU chip.
Privileged mode
Privileged mode overrides many of the other security settings in the
securityContext
. To access TPUs, containers running in GKE nodes
in:
- Version 1.28 and earlier must to enable privileged mode.
- Versions 1.28 or later don't need privileged mode.
How TPUs in GKE work
Kubernetes resource management and prioritization treat VMs on TPUs the same as other VM
types. To request TPU chips, use the resource name google.com/tpu
:
resources:
requests:
google.com/tpu: 4
limits:
google.com/tpu: 4
When you use TPUs in GKE, you must consider the following TPU characteristics:
- A VM can access up to 8 TPU chips.
- A TPU slice contains a fixed number of TPU chips, with the number depending on the TPU machine type you choose.
- The number of requested
google.com/tpu
must be equal to the total number of available TPU chips on the TPU slice node. Any container in a GKE Pod that requests TPUs must consume all the TPU chips in the node. Otherwise, your Deployment fails, because GKE can't partially consume TPU resources. Consider the following scenarios:- The machine type
ct5l-hightpu-8t
has a single TPU slice node with 8 TPU chips so on a node you:- Can deploy one GKE Pod that requires eight TPU chips.
- Can't deploy two GKE Pods that require four TPU chips each.
- The machine type
ct5lp-hightpu-4t
with a2x4
topology contains two TPU slice nodes with four TPU chips each, for a total of eight TPU chips. With this machine type, you:- Can't deploy a GKE Pod that requires eight TPU chips on the nodes in this node pool.
- Can deploy two Pods that require four TPU chips each, each Pod on one of the two nodes in this node pool.
- TPU v5e with topology 4x4 has 16 TPU chips in four nodes. The GKE Autopilot workload that selects this configuration must request four TPU chips in each replica, for one to four replicas.
- The machine type
- In Standard clusters, multiple Kubernetes Pods can be scheduled on a VM, but only one container in each Pod can access the TPU chips.
- To create kube-system Pods, such as kube-dns, each Standard cluster must have at least one non-TPU slice node pool.
- By default, TPU slice nodes have the
google.com/tpu
taint which prevents non-TPU workloads from being scheduled on the TPU slice nodes. Workloads that don't use TPUs are run on non-TPU nodes, freeing up compute on TPU slice nodes for code that uses TPUs. Note that the taint does not guarantee TPU resources are fully utilized. - GKE collects the logs emitted by containers running on TPU slice nodes. To learn more, see Logging.
- TPU utilization metrics, such as runtime performance, are available in Cloud Monitoring. To learn more, see Observability and metrics.
How collection scheduling works
In TPU Trillium (v6e), you can use collection scheduling to group TPU slice nodes. Grouping these TPU slice nodes makes it easier to adjust the number of replicas to meet the workload demand. Google Cloud controls software updates to ensure that sufficient slices within the collection are always available to serve traffic.
Collection scheduling has the following limitations:
- You can only schedule collections for TPU Trillium.
- You can define collections only during node pool creation.
- Spot VMs are not supported.
You can configure collection scheduling in the following scenarios:
- When creating a TPU slice node pool in GKE Standard
- When deploying workloads on GKE Autopilot
- When creating a cluster that enables node auto-provisioning
What's next
- Plan TPUs in GKE to start your TPU setup.
- Learn about best practices for using Cloud TPU for your machine learning tasks.
- Build large-scale machine learning on Cloud TPU with GKE.
- Serve Large Language Models with KubeRay on TPUs.