Deploy TPU workloads on GKE Autopilot

Autopilot

This page describes how to accelerate machine learning (ML) workloads by using Cloud TPU accelerators (TPUs) in Google Kubernetes Engine (GKE) Autopilot clusters. This guidance can help you to select the correct libraries for your ML application frameworks, set up your TPU workloads to run optimally on GKE, and monitor your workloads after deployment.

This page is for Platform admins and operators, Data and AI specialists, and Application developers who want to prepare and run ML workloads on TPUs. To learn more about the common roles, responsibilities, and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

Before reading this page, ensure that you're familiar with the following resources:

How TPUs work in Autopilot

To use TPUs in Autopilot workloads, you specify the following in your workload manifest:

The TPU version in the spec.nodeSelector field.
The TPU topology in the spec.nodeSelector field. The topology must be supported by the specified TPU version.
The number of TPU chips in the spec.containers.resources.requests and the spec.containers.resources.limits fields.

When you deploy the workload, GKE provisions nodes that have the requested TPU configuration and schedules your Pods on the nodes. GKE places each workload on its own node so that each Pod can access the full resources of the node with minimized risk of disruption.

TPUs in Autopilot are compatible with the following capabilities:

Plan your TPU configuration

Before you use this guide to deploy TPU workloads, plan your TPU configuration based on your model and how much memory it requires. For details, see Plan your TPU configuration.

Pricing

For pricing information, see Autopilot pricing.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Ensure that you have an Autopilot cluster running GKE version 1.32.3-gke.1927000 or later. For instructions, see Create an Autopilot cluster.
To use reserved TPUs, ensure that you have an existing specific capacity reservation. For instructions, see Consume a reservation.

Ensure quota for TPUs and other GKE resources

The following sections help you ensure that you have enough quota when using TPUs in GKE.

To create TPU slice nodes, you must have TPU quota available unless you're using an existing capacity reservation. If you're using reserved TPUs, skip this section.

Creating TPU slice nodes in GKE requires Compute Engine API quota (compute.googleapis.com), not Cloud TPU API quota (tpu.googleapis.com). The name of the quota is different in regular Autopilot Pods and in Spot Pods.

To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:

Go to the Quotas page in the Google Cloud console:

Go to Quotas

In the Filter box, do the following:

Use the following table to select and copy the property of the quota based on the TPU version and value in the cloud.google.com/gke-tpu-accelerator node selector. For example, if you plan to create on-demand TPU v5e nodes whose value in the cloud.google.com/gke-tpu-accelerator node selector is tpu-v5-lite-podslice, enter Name: TPU v5 Lite PodSlice chips.

TPU version, `cloud.google.com/gke-tpu-accelerator`	Property and name of the quota for on-demand instances	Property and name of the quota for Spot² instances
TPU v3, `tpu-v3-device`	Dimensions (e.g. location): tpu_family:CT3	Not applicable
TPU v3, `tpu-v3-slice`	Dimensions (e.g. location): tpu_family:CT3P	Not applicable
TPU v4, `tpu-v4-podslice`	Name: TPU v4 PodSlice chips	Name: Preemptible TPU v4 PodSlice chips
TPU v5e, `tpu-v5-lite-podslice`	Name: TPU v5 Lite PodSlice chips	Name: Preemptible TPU v5 Lite Podslice chips
TPU v5p, `tpu-v5p-slice`	Name: TPU v5p chips	Name: Preemptible TPU v5p chips
TPU Trillium, `tpu-v6e-slice`	Dimensions (e.g. location): tpu_family:CT6E	Name: Preemptible TPU slices v6e

Select the Dimensions (e.g. locations) property and enter region: followed by the name of the region in which you plan to create TPUs in GKE. For example, enter region:us-west4 if you plan to create TPU slice nodes in the zone us-west4-a. TPU quota is regional, so all zones within the same region consume the same TPU quota.

If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the region that you need, and you must request a TPU quota adjustment.

When a TPU reservation is created, both the limit and current use values for the corresponding quota increase by the number of chips in the TPU reservation. For example, when a reservation is created for 16 TPU v5e chips whose value in the cloud.google.com/gke-tpu-accelerator node selector is tpu-v5-lite-podslice, then both the Limit and Current usage for the TPU v5 Lite PodSlice chips quota in the relevant region increase by 16.

Quotas for additional GKE resources

You may need to increase the following GKE-related quotas in the regions where GKE creates your resources.

Persistent Disk SSD (GB) quota: The boot disk of each Kubernetes node requires 100GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating and 100GB (nodes * 100GB).
In-use IP addresses quota: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating.
Ensure that max-pods-per-node aligns with the subnet range: Each Kubernetes node uses secondary IP ranges for Pods. For example, max-pods-per-node of 32 requires 64 IP addresses which translates to a /26 subnet per node. Note that this range shouldn't be shared with any other cluster. To avoid exhausting the IP address range, use the --max-pods-per-node flag to limit the number of pods allowed to be scheduled on a node. The quota for max-pods-per-node should be set at least as high as the maximum number of GKE nodes you anticipate creating.

To request an increase in quota, see Request a quota adjustment.

Options for provisioning TPUs in GKE

GKE Autopilot lets you use TPUs directly in individual workloads by using Kubernetes nodeSelectors.

Alternatively, you can request TPUs by using custom compute classes. Custom compute classes let platform administrators define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware.

For instructions, see the Centrally provision TPUs with custom compute classes section.

Prepare your TPU application

TPU workloads have the following preparation requirements.

Frameworks like JAX, PyTorch, and TensorFlow access TPU VMs using the libtpu shared library. libtpu includes the XLA compiler, TPU runtime software, and the TPU driver. Each release of PyTorch and JAX requires a certain libtpu.so version. To use TPUs in GKE, ensure that you use the following versions:

TPU type	`libtpu.so` version
TPU Trillium (v6e) `tpu-v6e-slice`	Recommended jax[tpu] version: v0.4.9 or later Recommended torchxla[tpuvm] version: v2.1.0 or later
TPU v5e `tpu-v5-lite-podslice`	Recommended jax[tpu] version: v0.4.9 or later Recommended torchxla[tpuvm] version: v2.1.0 or later
TPU v5p `tpu-v5p-slice`	Recommended jax[tpu] version: 0.4.19 or later. Recommended torchxla[tpuvm] version: suggested to use a nightly version build on October 23, 2023.
TPU v4 `tpu-v4-podslice`	Recommended jax[tpu]: v0.4.4 or later Recommended torchxla[tpuvm]: v2.0.0 or later
TPU v3 `tpu-v3-slice` `tpu-v3-device`	Recommended jax[tpu]: v0.4.4 or later Recommended torchxla[tpuvm]: v2.0.0 or later

Set the following environment variables for the container requesting the TPU resources:
- TPU_WORKER_ID: A unique integer for each Pod. This ID denotes a unique worker-id in the TPU slice. The supported values for this field range from zero to the number of Pods minus one.
- TPU_WORKER_HOSTNAMES: A comma-separated list of TPU VM hostnames or IP addresses that need to communicate with each other within the slice. There should be a hostname or IP address for each TPU VM in the slice. The list of IP addresses or hostnames are ordered and zero indexed by the TPU_WORKER_ID.

After you complete the workload preparation, you can run a Job that uses TPUs.

Request TPUs in a workload

This section shows you how to create a Job that requests TPUs in Autopilot. In any workload that needs TPUs, you must specify the following:

Node selectors for the TPU version and topology
The number of TPU chips for a container in your workload

For a list of supported TPU versions, topologies, and the corresponding number of TPU chips and nodes in a slice, see Choose the TPU version.

Considerations for TPU requests in workloads

Only one container in a Pod can use TPUs. The number of TPU chips that a container requests must be equal to the number of TPU chips attached to a node in the slice. For example, if you request TPU v5e (tpu-v5-lite-podslice) with a 2x4 topology, you can request any of the following:

4 chips, which creates two multi-host nodes with 4 TPU chips each
8 chips, which creates one single-host node with 8 TPU chips

As a best practice to maximize your cost efficiency, always consume all of the TPU in the slice that you request. If you request a multi-host slice of two nodes with 4 TPU chips each, you should be deploying a workload that runs on both nodes and consumes all 8 TPU chips in the slice.

Create a workload that requests TPUs

The following steps create a Job that requests TPUs. If you have workloads that run on multi-host TPU slices, you must also create a headless Service that selects your workload by name. This headless Service lets Pods on different nodes in the multi-host slice to communicate with each other by updating the Kubernetes DNS configuration to point at the Pods in the workload.

Save the following manifest as tpu-autopilot.yaml:

apiVersion: v1
kind: Service
metadata:
  name: headless-svc
spec:
  clusterIP: None
  selector:
    job-name: tpu-job
---
apiVersion: batch/v1
kind: Job
metadata:
  name: tpu-job
spec:
  backoffLimit: 0
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      # Optional: Run in GKE Sandbox
      # runtimeClassName: gvisor
      subdomain: headless-svc
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/gke-tpu-accelerator: TPU_TYPE
        cloud.google.com/gke-tpu-topology: TOPOLOGY
      containers:
      - name: tpu-job
        image: python:3.10
        ports:
        - containerPort: 8471 # Default port using which TPU VMs communicate
        - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
        command:
        - bash
        - -c
        - |
          pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
          python -c 'import jax; print("TPU cores:", jax.device_count())'
        resources:
          requests:
            cpu: 10
            memory: MEMORY_SIZE
            google.com/tpu: NUMBER_OF_CHIPS
          limits:
            cpu: 10
            memory: MEMORY_SIZE
            google.com/tpu: NUMBER_OF_CHIPS

Replace the following:

TPU_TYPE: the TPU type to use, like tpu-v4-podslice. Must be a value supported by GKE.
TOPOLOGY: the arrangement of TPU chips in the slice, like 2x2x4. Must be a supported topology for the selected TPU type.
NUMBER_OF_CHIPS: the number of TPU chips for the container to use. Must be the same value for limits and requests.
MEMORY_SIZE: The maximum amount of memory that the TPU uses. Memory limits depend on the TPU version and topology that you use. To learn more, see Minimums and maximums for accelerators.
** Optional runtimeClassname: gvisor*: the setting that lets your run this Pod in GKE Sandbox. To use, uncomment this line. GKE Sandbox supports TPUs version v4 and later. To learn more, see GKE Sandbox.

Deploy the Job:
```
kubectl create -f tpu-autopilot.yaml
```
When you create this Job, GKE automatically does the following:
1. Provisions nodes to run the Pods. Depending on the TPU type, topology, and resource requests that you specified, these nodes are either single-host slices or multi-host slices.
2. Adds taints to the Pods and tolerations to the nodes to prevent any of your other workloads from running on the same nodes as TPU workloads.
When you finish this section, you can avoid continued billing by deleting the workload you created:
```
kubectl delete -f tpu-autopilot.yaml
```

Create a workload that requests TPUs and collection scheduling

In TPU Trillium, you can use collection scheduling to group TPU slice nodes. Grouping these TPU slice nodes makes it easier to adjust the number of replicas to meet the workload demand. Google Cloud controls software updates to ensure that sufficient slices within the collection are always available to serve traffic.

TPU Trillium supports collection scheduling for single-host and multi-host node pools that run inference workloads. The following describes how collection scheduling behavior depends on the type of TPU slice that you use:

Multi-host TPU slice: GKE groups multi-host TPU slices to form a collection. Each GKE node pool is a replica within this collection. To define a collection, create a multi-host TPU slice and assign a unique name to the collection. To add more TPU slices to the collection, create another multi-host TPU slice node pool with the same collection name and workload type.
Single-host TPU slice: GKE considers the entire single-host TPU slice node pool as a collection. To add more TPU slices to the collection, you can resize the single-host TPU slice node pool.

To learn about the limitation of collection scheduling, see How collection scheduling works

Use a multi-host TPU slice

Collection schedulling in multi-host TPU slice nodes is available for Autopilot clusters in version 1.31.2-gke.1537000 and later. Multi-host TPU slice nodes with a 2x4 topology are only supported in 1.31.2-gke.1115000 or later. To create multi-host TPU slice nodes and group it as a collection, add the following Kubernetes labels to your workload specification:

cloud.google.com/gke-nodepool-group-name: each collection should have a unique name at the cluster level. The value in the cloud.google.com/gke-nodepool-group-name label must adhere to requirements for cluster labels.

cloud.google.com/gke-workload-type: HIGH_AVAILABILITY

For example, the following code block defines a collection with a multi-host TPU slice:

  nodeSelector:
    cloud.google.com/gke-nodepool-group-name: ${COLLECTION_NAME}
    cloud.google.com/gke-workload-type: HIGH_AVAILABILITY
    cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
    cloud.google.com/gke-tpu-topology: 4x4
...

Use a single-host TPU slice

Collection schedulling in single-host TPU slice nodes is available for Autopilot clusters in version 1.31.2-gke.1088000 and later. To create single-host TPU slice nodes and group it as a collection, add the cloud.google.com/gke-workload-type:HIGH_AVAILABILITY label in your workload specification.

For example, the following code block defines a collection with a single-host TPU slice:

  nodeSelector:
    cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
    cloud.google.com/gke-tpu-topology: 2x2
    cloud.google.com/gke-workload-type: HIGH_AVAILABILITY
  ...

Use custom compute classes to deploy a collection

For more information about deploying a workload that requests TPU workload and collection scheduling using custom compute classes see TPU multi-host collection and Define workload type for TPU SLO.

Centrally provision TPUs with custom compute classes

To provision TPUs with a custom compute class that follows the TPU rules and deploy the workload, complete the following steps:

Save the following manifest as tpu-compute-class.yaml:

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: tpu-class
spec:
  priorities:
  - tpu:
      type: tpu-v5-lite-podslice
      count: 4
      topology: 2x4
  - spot: true
    tpu:
      type: tpu-v5-lite-podslice
      count: 4
      topology: 2x4
  - flexStart:
      enabled: true
    tpu:
      type: tpu-v6e-slice
      count: 4
      topology: 2x4
  nodePoolAutoCreation:
    enabled: true

Deploy the compute class:
```
kubectl apply -f tpu-compute-class.yaml
```
For more information about custom compute classes and TPUs, see TPU configuration.

Save the following manifest as tpu-job.yaml:

apiVersion: v1
kind: Service
metadata:
  name: headless-svc
spec:
  clusterIP: None
  selector:
    job-name: tpu-job
---
apiVersion: batch/v1
kind: Job
metadata:
  name: tpu-job
spec:
  backoffLimit: 0
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      subdomain: headless-svc
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/compute-class: tpu-class
      containers:
      - name: tpu-job
        image: python:3.10
        ports:
        - containerPort: 8471 # Default port using which TPU VMs communicate
        - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
        command:
        - bash
        - -c
        - |
          pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
          python -c 'import jax; print("TPU cores:", jax.device_count())'
        resources:
          requests:
            cpu: 10
            memory: MEMORY_SIZE
            google.com/tpu: NUMBER_OF_CHIPS
          limits:
            cpu: 10
            memory: MEMORY_SIZE
            google.com/tpu: NUMBER_OF_CHIPS

Replace the following:

NUMBER_OF_CHIPS: the number of TPU chips for the container to use. Must be the same value for limits and requests, equal to the value in the tpu.count field in the selected custom compute class.
MEMORY_SIZE: The maximum amount of memory that the TPU uses. Memory limits depend on the TPU version and topology that you use. To learn more, see Minimums and maximums for accelerators.
NUMBER_OF_CHIPS: the number of TPU chips for the container to use. Must be the same value for limits and requests.

Deploy the Job:
```
kubectl create -f tpu-job.yaml
```
When you create this Job, GKE automatically does the following:
- Provisions nodes to run the Pods. Depending on the TPU type, topology, and resource requests that you specified, these nodes are either single-host slices or multi-host slices. Depending on the availability of TPU resources in the top priority, GKE might fall back to lower priorities to maximize obtainability.
- Adds taints to the Pods and tolerations to the nodes to prevent any of your other workloads from running on the same nodes as TPU workloads.
To learn more, see About custom compute classes.
When you finish this section, you can avoid continued billing by deleting the resources you created:
```
kubectl delete -f tpu-job.yaml
```

Example: Display the total TPU chips in a multi-host slice

The following workload returns the number of TPU chips across all of the nodes in a multi-host TPU slice. To create a multi-host slice, the workload has the following parameters:

TPU version: TPU v4
Topology: 2x2x4

This version and topology selection result in a multi-host slice.

Save the following manifest as available-chips-multihost.yaml:

apiVersion: v1
kind: Service
metadata:
  name: headless-svc
spec:
  clusterIP: None
  selector:
    job-name: tpu-available-chips
---
apiVersion: batch/v1
kind: Job
metadata:
  name: tpu-available-chips
spec:
  backoffLimit: 0
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      subdomain: headless-svc
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
        cloud.google.com/gke-tpu-topology: 2x2x4
      containers:
      - name: tpu-job
        image: python:3.10
        ports:
        - containerPort: 8471 # Default port using which TPU VMs communicate
        - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
        command:
        - bash
        - -c
        - |
          pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
          python -c 'import jax; print("TPU cores:", jax.device_count())'
        resources:
          requests:
            cpu: 10
            memory: 407Gi
            google.com/tpu: 4
          limits:
            cpu: 10
            memory: 407Gi
            google.com/tpu: 4

Deploy the manifest:
```
kubectl create -f available-chips-multihost.yaml
```
GKE runs a TPU v4 slice with four VMs (multi-host TPU slice). The slice has 16 interconnected TPU chips.

Verify that the Job created four Pods:

kubectl get pods

The output is similar to the following:

NAME                       READY   STATUS      RESTARTS   AGE
tpu-job-podslice-0-5cd8r   0/1     Completed   0          97s
tpu-job-podslice-1-lqqxt   0/1     Completed   0          97s
tpu-job-podslice-2-f6kwh   0/1     Completed   0          97s
tpu-job-podslice-3-m8b5c   0/1     Completed   0          97s

Get the logs of one of the Pods:
```
kubectl logs POD_NAME
```
Replace POD_NAME with the name of one of the created Pods. For example, tpu-job-podslice-0-5cd8r.

The output is similar to the following:
```
TPU cores: 16
```

Optional: Remove the workload:

kubectl delete -f available-chips-multihost.yaml

Example: Display the TPU chips in a single node

The following workload is a static Pod that displays the number of TPU chips that are attached to a specific node. To create a single-host node, the workload has the following parameters:

TPU version: TPU v5e
Topology: 2x4

This version and topology selection result in a single-host slice.

Save the following manifest as available-chips-singlehost.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: tpu-job-jax-v5
spec:
  restartPolicy: Never
  nodeSelector:
    cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
    cloud.google.com/gke-tpu-topology: 2x4
  containers:
  - name: tpu-job
    image: python:3.10
    ports:
    - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
    command:
    - bash
    - -c
    - |
      pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
      python -c 'import jax; print("Total TPU chips:", jax.device_count())'
    resources:
      requests:
        google.com/tpu: 8
      limits:
        google.com/tpu: 8

Deploy the manifest:
```
kubectl create -f available-chips-singlehost.yaml
```
GKE provisions nodes with eight single-host TPU slices that use TPU v5e. Each TPU node has eight TPU chips (single-host TPU slice).
Get the logs of the Pod:
```
kubectl logs tpu-job-jax-v5
```
The output is similar to the following:
```
Total TPU chips: 8
```

Optional: Remove the workload:

  kubectl delete -f available-chips-singlehost.yaml

Observe and monitor TPUs

Dashboard

Node pool observability in the Google Cloud console is generally available. To view the status of your TPU multi-host node pools on GKE, go to GKE TPU Node Pool Status dashboard provided by Cloud Monitoring:

Go to GKE TPU Node Pool Status

This dashboard gives you comprehensive insights into the health of your multi-host TPU node pools. For more information, see Monitor health metrics for TPU nodes and node pools.

In the Kubernetes Clusters page in the Google Cloud console, the Observability tab also displays TPU observability metrics, such as TPU usage, under the Accelerators > TPU heading. For more information, see View observability metrics.

The TPU dashboard is populated only if you have system metrics enabled in your GKE cluster.

Runtime metrics

In GKE version 1.27.4-gke.900 or later, TPU workloads that both use JAX version 0.4.14 or later and specify containerPort: 8431 export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU workload's runtime performance:

Duty cycle: percentage of time over the past sampling period (60 seconds) during which the TensorCores were actively processing on a TPU chip. Larger percentage means better TPU utilization.
Memory used: amount of accelerator memory allocated in bytes. Sampled every 60 seconds.
Memory total: total accelerator memory in bytes. Sampled every 60 seconds.

These metrics are located in the Kubernetes node (k8s_node) and Kubernetes container (k8s_container) schema.

Kubernetes container:

kubernetes.io/container/accelerator/duty_cycle
kubernetes.io/container/accelerator/memory_used
kubernetes.io/container/accelerator/memory_total

Kubernetes node:

kubernetes.io/node/accelerator/duty_cycle
kubernetes.io/node/accelerator/memory_used
kubernetes.io/node/accelerator/memory_total

Monitor health metrics for TPU nodes and node pools

When a training job has an error or terminates in failure, you can check metrics related to the underlying infrastructure to figure out if the interruption was caused by an issue with the underlying node or node pool.

Node status

In GKE version 1.32.1-gke.1357001 or later, the following GKE system metric exposes the condition of a GKE node:

kubernetes.io/node/status_condition

The condition field reports conditions on the node, such as Ready, DiskPressure, and MemoryPressure. The status field shows the reported status of the condition, which can be True, False, or Unknown. This is a metric with the k8s_node monitored resource type.

This PromQL query shows if a particular node is Ready:

kubernetes_io:node_status_condition{
    monitored_resource="k8s_node",
    cluster_name="CLUSTER_NAME",
    node_name="NODE_NAME",
    condition="Ready",
    status="True"}

To help troubleshoot issues in a cluster, you might want to look at nodes that have exhibited other conditions:

kubernetes_io:node_status_condition{
    monitored_resource="k8s_node",
    cluster_name="CLUSTER_NAME",
    condition!="Ready",
    status="True"}

You might want to specifically look at nodes that aren't Ready:

kubernetes_io:node_status_condition{
    monitored_resource="k8s_node",
    cluster_name="CLUSTER_NAME",
    condition="Ready",
    status="False"}

If there is no data, then the nodes are ready. The status condition is sampled every 60 seconds.

You can use the following query to understand the node status across the fleet:

avg by (condition,status)(
  avg_over_time(
    kubernetes_io:node_status_condition{monitored_resource="k8s_node"}[${__interval}]))

Node pool status

The following GKE system metric for the k8s_node_pool monitored resource exposes the status of a GKE node pool:

kubernetes.io/node_pool/status

This metric is reported only for multi-host TPU node pools.

The status field reports the status of the node pool, such as Provisioning, Running, Error, Reconciling, or Stopping. Status updates happen after GKE API operations complete.

To verify if a particular node pool has Running status, use the following PromQL query:

kubernetes_io:node_pool_status{
    monitored_resource="k8s_node_pool",
    cluster_name="CLUSTER_NAME",
    node_pool_name="NODE_POOL_NAME",
    status="Running"}

To monitor the number of node pools in your project grouped by their status, use the following PromQL query:

count by (status)(
  count_over_time(
    kubernetes_io:node_pool_status{monitored_resource="k8s_node_pool"}[${__interval}]))

Node pool availability

The following GKE system metric shows whether a multi-host TPU node pool is available:

kubernetes.io/node_pool/multi_host/available

The metric has a value of True if all of the nodes in the node pool are available, and False otherwise. The metric is sampled every 60 seconds.

To check the availability of multi-host TPU node pools in your project, use the following PromQL query:

avg by (node_pool_name)(
  avg_over_time(
    kubernetes_io:node_pool_multi_host_available{
      monitored_resource="k8s_node_pool",
      cluster_name="CLUSTER_NAME"}[${__interval}]))

Node interruption count

The following GKE system metric reports the count of interruptions for a GKE node since the last sample (the metric is sampled every 60 seconds):

kubernetes.io/node/interruption_count

The interruption_type (such as TerminationEvent, MaintenanceEvent, or PreemptionEvent) and interruption_reason (like HostError, Eviction, or AutoRepair) fields can help provide the reason for why a node was interrupted.

To get a breakdown of the interruptions and their causes in TPU nodes in the clusters in your project, use the following PromQL query:

  sum by (interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_interruption_count{monitored_resource="k8s_node"}[${__interval}]))

To only see the host maintenance events, update the query to filter the HW/SW Maintenance value for the interruption_reason. Use the following PromQL query:

  sum by (interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_interruption_count{monitored_resource="k8s_node", interruption_reason="HW/SW Maintenance"}[${__interval}]))

To see the interruption count aggregated by node pool, use the following PromQL query:

  sum by (node_pool_name,interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_pool_interruption_count{monitored_resource="k8s_node_pool", interruption_reason="HW/SW Maintenance", node_pool_name=NODE_POOL_NAME }[${__interval}]))

Node pool times to recover (TTR)

The following GKE system metric reports the distribution of recovery period durations for GKE multi-host TPU node pools:

kubernetes.io/node_pool/accelerator/times_to_recover

Each sample recorded in this metric indicates a single recovery event for the node pool from a downtime period.

This metric is useful for tracking the multi-host TPU node pool time to recover and time between interruptions.

You can use the following PromQL query to calculate the mean time to recovery (MTTR) for the last 7 days in your cluster:

sum(sum_over_time(
  kubernetes_io:node_pool_accelerator_times_to_recover_sum{
    monitored_resource="k8s_node_pool", cluster_name="CLUSTER_NAME"}[7d]))
/
sum(sum_over_time(
  kubernetes_io:node_pool_accelerator_times_to_recover_count{
    monitored_resource="k8s_node_pool",cluster_name="CLUSTER_NAME"}[7d]))

Node pool times between interruptions (TBI)

Node pool times between interruptions measures how long your infrastructure runs before experiencing an interruption. It is computed as the average over a window of time, where the numerator measures the total time that your infrastructure was up and the denominator measures the total interruptions to your infrastructure.

The following PromQL example shows the 7-day mean time between interruptions (MTBI) for the given cluster:

sum(count_over_time(
  kubernetes_io:node_memory_total_bytes{
    monitored_resource="k8s_node", node_name=~"gke-tpu.*|gk3-tpu.*", cluster_name="CLUSTER_NAME"}[7d]))
/
sum(sum_over_time(
  kubernetes_io:node_interruption_count{
    monitored_resource="k8s_node", node_name=~"gke-tpu.*|gk3-tpu.*", cluster_name="CLUSTER_NAME"}[7d]))

Host metrics

In GKE version 1.28.1-gke.1066000 or later, VMs in a TPU slice export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU host's performance:

TensorCore utilization: current percentage of the TensorCore that is utilized. The TensorCore value equals the sum of the matrix-multiply units (MXUs) plus the vector unit. The TensorCore utilization value is the division of the TensorCore operations that were performed over the past sample period (60 seconds) by the supported number of TensorCore operations over the same period. Larger value means better utilization.
Memory bandwidth utilization: current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period (60s) by the maximum supported bandwidth over the same sample period.