This page shows you how to resolve issues related to TPUs in Google Kubernetes Engine (GKE).
If you need additional assistance, reach out to Cloud Customer Care.Insufficient quota to satisfy the TPU request
An error similar to Insufficient quota to satisfy the request
indicates your
Google Cloud project has insufficient quota available to satisfy the
request.
To resolve this issue, check your project's quota limit and current usage. If needed, request an increase to your TPU quota.
Check quota limit and current usage
The following sections help you ensure that you have enough quota when using TPUs in GKE.
To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:
Go to the Quotas page in the Google Cloud console:
In the
Filter box, do the following:Select the Service property, enter Compute Engine API, and press Enter.
Select the Type property and choose Quota.
Select the Name property and enter the name of the quota based on the TPU version and . For example, if you plan to create on-demand TPU v5e nodes whose , enter
TPU v5 Lite PodSlice chips
.TPU version Name of the quota for on-demand instances Name of the quota for Spot2 instances TPU v3 TPU v3 Device chips
Preemptible TPU v3 Device chips
TPU v3 TPU v3 PodSlice chips
Preemptible TPU v3 PodSlice chips
TPU v4 TPU v4 PodSlice chips
Preemptible TPU v4 PodSlice chips
TPU v5e TPU v5 Lite Device chips
Preemptible TPU v5 Lite Device chips
TPU v5e TPU v5 Lite PodSlice chips
Preemptible TPU v5 Lite PodSlice chips
TPU v5p TPU v5p chips
Preemptible TPU v5p chips
TPU v6e (Preview) TPU v6e Slice chips
Preemptible TPU v6e Lite PodSlice chips
Select the Dimensions (e.g. locations) property and enter
region:
followed by the name of the region in which you plan to create TPUs in GKE. For example, enterregion:us-west4
if you plan to create TPU slice nodes in the zoneus-west4-a
. TPU quota is regional, so all zones within the same region consume the same TPU quota.
If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the region that you need, and you must request a TPU quota increase.
When a TPU reservation is created, both the limit and current use values for
the corresponding quota increase by the number of chips in the TPU
reservation. For example, when a reservation is created for 16 TPU v5e chips
whose
,
then both the Limit and
Current usage for the TPU v5 Lite PodSlice chips
quota in the relevant
region increase by 16.
Quotas for additional GKE resources
You may need to increase the following GKE-related quotas in the regions where GKE creates your resources.
- Persistent Disk SSD (GB) quota: The boot disk of each Kubernetes node requires 100GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating and 100GB (nodes * 100GB).
- In-use IP addresses quota: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating.
- Ensure that
max-pods-per-node
aligns with the subnet range: Each Kubernetes node uses secondary IP ranges for Pods. For example,max-pods-per-node
of 32 requires 64 IP addresses which translates to a /26 subnet per node. Note that this range shouldn't be shared with any other cluster. To avoid exhausting the IP address range, use the--max-pods-per-node
flag to limit the number of pods allowed to be scheduled on a node. The quota formax-pods-per-node
should be set at least as high as the maximum number of GKE nodes you anticipate creating.
To request an increase in quota, see Request higher quota.
Error when enabling node auto-provisioning in a TPU slice node pool
The following error occurs when you are enabling node auto-provisioning in a GKE cluster that doesn't support TPUs.
The error message is similar to the following:
ERROR: (gcloud.container.clusters.create) ResponseError: code=400,
message=Invalid resource: tpu-v4-podslice.
To resolve this issue, upgrade your GKE cluster to version 1.27.6 or later.
GKE doesn't automatically provision TPU slice nodes
The following sections describe the cases where GKE doesn't automatically provision TPU slice nodes and how to fix them.
Limit misconfiguration
GKE doesn't automatically provision TPU slice nodes if the auto-provisioning limits you defined for a cluster are too low. You may observe the following errors in such scenarios:
If a TPU slice node pool exists, but GKE can't scale up the nodes due to violating resource limits, you can see the following error message when running the
kubectl get events
command:11s Normal NotTriggerScaleUp pod/tpu-workload-65b69f6c95-ccxwz pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 max cluster cpu, memory limit reached
Also, in this scenario, you can see warning messages similar to the following in the Google Cloud console:
"Your cluster has one or more unschedulable Pods"
When GKE attempts to auto-provision a TPU slice node pool that exceeds resource limits, the cluster autoscaler visibility logs will display the following error message:
messageId: "no.scale.up.nap.pod.zonal.resources.exceeded"
Also, in this scenario, you can see warning messages similar to the following in the Google Cloud console:
"Can't scale up because node auto-provisioning can't provision a node pool for the Pod if it would exceed resource limits"
To resolve these issues, increase the maximum number of TPU chips, CPU cores, and memory in the cluster.
To complete these steps:
- Calculate the resource requirements for a given TPU machine type and count. Note that you need to add resources for non-TPU slice node pools, like system workloads.
Obtain a description of the available TPU, CPU, and memory for a specific machine type and zone. Use the gcloud CLI:
gcloud compute machine-types describe MACHINE_TYPE \ --zone COMPUTE_ZONE
Replace the following:
MACHINE_TYPE
: The type of machine to search.COMPUTE_ZONE
: The name of the compute zone.
The output includes a description line similar to the following:
description: 240 vCPUs, 407 GB RAM, 4 Google TPUs ```
Calculate the total number of CPU and memory by multiplying these amounts by the required number of nodes. For example, the
ct4p-hightpu-4t
machine type uses 240 CPU cores and 407 GB RAM with 4 TPU chips. Assuming that you require 20 TPU chips, which corresponds to five nodes, you must define the following values:--max-accelerator=type=tpu-v4-podslice,count=20
.CPU = 1200
(240 times 5 )memory = 2035
(407 times 5)
You should define the limits with some margin to accommodate non-TPU slice nodes such as system workloads.
Update the cluster limits:
gcloud container clusters update CLUSTER_NAME \ --max-accelerator type=TPU_ACCELERATOR \ count=MAXIMUM_ACCELERATOR \ --max-cpu=MAXIMUM_CPU \ --max-memory=MAXIMUM_MEMORY
Replace the following:
CLUSTER_NAME
: The name of the cluster.TPU_ACCELERATOR
: The name of the TPU accelerator.MAXIMUM_ACCELERATOR
: The maximum number of TPU chips in the cluster.MAXIMUM_CPU
: The maximum number of cores in the cluster.MAXIMUM_MEMORY
: The maximum number of gigabytes of memory in the cluster.
Not all instances running
ERROR: nodes cannot be created due to lack of capacity. The missing nodes
will be created asynchronously once capacity is available. You can either
wait for the nodes to be up, or delete the node pool and try re-creating it
again later.
This error may appear when GKE operation is timed out or the request cannot be fulfilled and queued for provisioning single-host or multi-host TPU node pools. To mitigate capacity issues, you may use reservations, or consider Spot VMs.
Workload misconfiguration
This error occurs due to misconfiguration of the workload. The following are some of the most common causes of the error:
- The
cloud.google.com/gke-tpu-accelerator
andcloud.google.com/gke-tpu-topology
labels are incorrect or missing in the Pod spec. GKE won't provision TPU slice node pools and the node auto-provision won't be able to scale up the cluster. - The Pod spec doesn't specify
google.com/tpu
in their resource requirements.
To resolve this issue do one of the following:
- Check that there are no unsupported labels in your workload node selector.
For example, a node selector for
cloud.google.com/gke-nodepool
label will prevent GKE from creating additional node pools for your Pods. - Ensure the Pod template specifications, where your TPU workload runs, include
the following values:
cloud.google.com/gke-tpu-accelerator
andcloud.google.com/gke-tpu-topology
labels in itsnodeSelector
.google.com/tpu
in its request.
To learn how to deploy TPU workloads in GKE, see Run a workload that displays the number of available TPU chips in a TPU slice node pool.
Scheduling errors when deploying Pods that consume TPUs in GKE
The following issue occurs when GKE can't schedule Pods requesting TPUs on TPU slice nodes. For example, this might occur if some non-TPU slices were already scheduled on TPU nodes.
The error message, emitted as a FailedScheduling
event on the Pod, is similar to the following:
Cannot schedule pods: Preemption is not helpful for scheduling.
Error message: 0/2 nodes are available: 2 node(s) had untolerated taint
{google.com/tpu: present}. preemption: 0/2 nodes are available: 2 Preemption is
not helpful for scheduling
To resolve this issue, do the following:
Check that you have at least one CPU node pool in your cluster so the system critical Pods can run in the non-TPU nodes. To learn more, see Deploy a Pod to a specific node pool.
Troubleshooting common issues with JobSets in GKE
For common issues with JobSet, and troubleshooting suggestions, see the JobSet Troubleshooting page. This page covers common issues such as "Webhook not available" error, child job, or Pods that are not created, and resuming issue of preempted workloads using JobSet and Kueue.
TPU initialization failed
The following issue occurs when GKE can't provision new TPU workloads due to lack of permission to access TPU devices.
The error message is similar to the following:
TPU platform initialization failed: FAILED_PRECONDITION: Couldn't mmap: Resource
temporarily unavailable.; Unable to create Node RegisterInterface for node 0,
config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: ""
dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true;
could not create driver instance
To resolve this issue, make sure you either run your TPU container in
privileged mode or you increase the ulimit
inside your container.
Scheduling deadlock
Two or more Jobs scheduling might fail in deadlock. For example, in the scenario where all of the following occurs:
- You have two Jobs (Job A and Job B) with Pod affinity rules.
GKE schedules the TPU slices for both Jobs with a TPU topology
of
v4-32
. - You have two
v4-32
TPU slices in the cluster. - Your cluster has ample capacity to schedule both Jobs and, in theory, each Job can be quickly scheduled on each TPU slice.
- The Kubernetes scheduler schedules one Pod from Job A on one slice, and then schedules one Pod from Job B on the same slice.
In this case, given the Pod affinity rules for Job A, the scheduler attempts to schedule all remaining Pods for Job A and for Job B, on a single TPU slice each. As a result, GKE won't be able to fully schedule either Job A or Job B. Hence, the status of both Jobs will remain Pending.
To resolve this issue, use
Pod anti-affinity
with cloud.google.com/gke-nodepool
as the topologyKey
, as shown in the following example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
parallelism: 2
template:
metadata:
labels:
job: pi
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: job
operator: In
values:
- pi
topologyKey: cloud.google.com/gke-nodepool
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: job
operator: NotIn
values:
- pi
topologyKey: cloud.google.com/gke-nodepool
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values:
- kube-system
containers:
- name: pi
image: perl:5.34.0
command: ["sleep", "60"]
restartPolicy: Never
backoffLimit: 4
Permission denied during cluster creation in us-central2
If you are attempting to create a cluster in us-central2
(the only region
where TPU v4 is available), then you may encounter an error message similar to
the following:
ERROR: (gcloud.container.clusters.create) ResponseError: code=403,
message=Permission denied on 'locations/us-central2' (or it may not exist).
This error is because the region us-central2
is a private region.
To resolve this issue, file a support case or reach out to your
account team to ask for us-central2
to be made visible within your
Google Cloud project.
Insufficient quota during TPU node pool creation in us-central2
If you are attempting to create a TPU slice node pool in us-central2
(the only
region where TPU v4 is available), then you may need to increase the following
GKE-related quotas when you first create TPU v4 node pools:
- Persistent Disk SSD (GB) quota in us-central2: The boot disk of each
Kubernetes node requires 100 GB by default. Therefore, this quota should be set
at least as high as the product of the maximum number of GKE
nodes you anticipate creating in
us-central2
and 100 GB (maximum_nodes
X100 GB
). - In-use IP addresses quota in us-central2: Each Kubernetes node consumes
one IP address. Therefore, this quota should be set at least as high as the
maximum number of GKE nodes you anticipate creating in
us-central2
.
Missing subnet during GKE cluster creation
If you are attempting to create a cluster in us-central2
(the only region
where TPU v4 is available), then you may encounter an error message similar to
the following:
ERROR: (gcloud.container.clusters.create) ResponseError: code=404,
message=Not found: project <PROJECT> does not have an auto-mode subnetwork
for network "default" in region <REGION>.
A subnet is required in your VPC network to provide connectivity
to your GKE nodes. However, in certain regions such as
us-central2
, a default subnet may not be created, even when you use the
default VPC network in auto-mode (for subnet creation).
To resolve this issue, ensure that you have created a custom subnet in the region before creating your GKE cluster. This subnet must not overlap with other subnets created in other regions in the same VPC network.
View GKE TPU logs
To view all TPU-related logs for a specific workload, Cloud Logging offers a centralized location to query these logs when GKE system and workload logging are enabled. In Cloud Logging, logs are organized into log entries, and each individual log entry has a structured format. The following is an example of a TPU training job log entry.
{
insertId: "gvqk7r5qc5hvogif"
labels: {
compute.googleapis.com/resource_name: "gke-tpu-9243ec28-wwf5"
k8s-pod/batch_kubernetes_io/controller-uid: "443a3128-64f3-4f48-a4d3-69199f82b090"
k8s-pod/batch_kubernetes_io/job-name: "mnist-training-job"
k8s-pod/controller-uid: "443a3128-64f3-4f48-a4d3-69199f82b090"
k8s-pod/job-name: "mnist-training-job"
}
logName: "projects/gke-tpu-demo-project/logs/stdout"
receiveTimestamp: "2024-06-26T05:52:39.652122589Z"
resource: {
labels: {
cluster_name: "tpu-test"
container_name: "tensorflow"
location: "us-central2-b"
namespace_name: "default"
pod_name: "mnist-training-job-l74l8"
project_id: "gke-tpu-demo-project"
}
type: "k8s_container"
}
severity: "INFO"
textPayload: "
1/938 [..............................] - ETA: 13:36 - loss: 2.3238 - accuracy: 0.0469
6/938 [..............................] - ETA: 9s - loss: 2.1227 - accuracy: 0.2995
13/938 [..............................] - ETA: 8s - loss: 1.7952 - accuracy: 0.4760
20/938 [..............................] - ETA: 7s - loss: 1.5536 - accuracy: 0.5539
27/938 [..............................] - ETA: 7s - loss: 1.3590 - accuracy: 0.6071
36/938 [>.............................] - ETA: 6s - loss: 1.1622 - accuracy: 0.6606
44/938 [>.............................] - ETA: 6s - loss: 1.0395 - accuracy: 0.6935
51/938 [>.............................] - ETA: 6s - loss: 0.9590 - accuracy: 0.7160
……
937/938 [============================>.] - ETA: 0s - loss: 0.2184 - accuracy: 0.9349"
timestamp: "2024-06-26T05:52:38.962950115Z"
}
Each log entry from the TPU slice nodes have the label
compute.googleapis.com/resource_name
with the value set as the node name.
If you want to view the logs from a particular node and you know the node name,
you can filter the logs by that node in your query. For example, the following
query shows the logs from the TPU node gke-tpu-9243ec28-wwf5
:
resource.type="k8s_container"
labels."compute.googleapis.com/resource_name" = "gke-tpu-9243ec28-wwf5"
GKE attaches label cloud.google.com/gke-tpu-accelerator
and
cloud.google.com/gke-tpu-topology
to all nodes containing TPUs. So, if you are
not sure about the node name or you want to list all the TPU slice nodes, you can run
the following command:
kubectl get nodes -l cloud.google.com/gke-tpu-accelerator
Sample output:
NAME STATUS ROLES AGE VERSION
gke-tpu-9243ec28-f2f1 Ready <none> 25m v1.30.1-gke.1156000
gke-tpu-9243ec28-wwf5 Ready <none> 7d22h v1.30.1-gke.1156000
You can do additional filtering based on the node labels and their values. For example, the following command lists TPU node with a specific type and topology:
kubectl get nodes -l cloud.google.com/gke-tpu-accelerator=tpu-v5-lite-podslice,cloud.google.com/gke-tpu-topology=1x1
To view all the logs across the TPU slice nodes, you can use the query that matches the label to the TPU slice node suffix. For example, use the following query:
resource.type="k8s_container"
labels."compute.googleapis.com/resource_name" =~ "gke-tpu-9243ec28.*"
log_id("stdout")
To view the logs associated with a particular TPU workload using a
Kubernetes Job,
you can filter the logs using the batch.kubernetes.io/job-name
label. For
example, for the job mnist-training-job
, you can run the following query for
the STDOUT logs:
resource.type="k8s_container"
labels."k8s-pod/batch_kubernetes_io/job-name" = "mnist-training-job"
log_id("stdout")
To view the logs for a TPU workload using a Kubernetes JobSet,
you can filter the logs using the k8s-pod/jobset_sigs_k8s_io/jobset-name
label.
For example:
resource.type="k8s_container"
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="multislice-job"
To drill down further, you can filter based on the other workload labels.
For example, to view the logs for a multislice workload from worker 0 and
slice 1, you can filter based on the labels: job-complete-index
and job-index
:
resource.type="k8s_container"
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="multislice-job"
labels."k8s-pod/batch_kubernetes_io/job-completion-index"="0"
labels."k8s-pod/jobset_sigs_k8s_io/job-index"="1"
You can also filter using the Pod name pattern:
resource.labels.pod_name:<jobSetName>-<replicateJobName>-<job-index>-<worker-index>
For example, in the following query the jobSetName
is multislice-job, and
the replicateJobName
is slice. Both job-index
and worker-index
are 0:
resource.type="k8s_container"
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="multislice-job"
resource.labels.pod_name:"multislice-job-slice-0-0"
Other TPU workloads, such as a single GKE Pod workload, you can filter the logs by Pod names. For example:
resource.type="k8s_container"
resource.labels.pod_name="tpu-job-jax-demo"
If you want to check if the TPU device plugin is running correctly, you can use the following query to check its container logs:
resource.type="k8s_container"
labels.k8s-pod/k8s-app="tpu-device-plugin"
resource.labels.namespace_name="kube-system"
Run the following query to check the related events:
jsonPayload.involvedObject.name=~"tpu-device-plugin.*"
log_id("events")
For all queries, you can add additional filters, such as cluster name, location, and project ID. You can also combine conditions to narrow down the results. For example:
resource.type="k8s_container" AND
resource.labels.project_id="gke-tpu-demo-project" AND
resource.labels.location="us-west1" AND
resource.labels.cluster_name="tpu-demo" AND
resource.labels.namespace_name="default" AND
labels."compute.googleapis.com/resource_name" =~ "gke-tpu-9243ec28.*" AND
labels."k8s-pod/batch_kubernetes_io/job-name" = "mnist-training-job" AND
log_id("stdout")
The AND
operator is optional between comparisons and it can be omitted. For more
information about the query language, you can read the Logging query language specification.
You can also read Kubernetes related log queries
for more query examples.
If you prefer SQL using Log Analytics, you can find query examples at SQL query with Log Analytics. Alternatively, you can also run the queries using the Google Cloud CLI instead of in the Logs Explorer. For example:
gcloud logging read 'resource.type="k8s_container" labels."compute.googleapis.com/resource_name" =~ "gke-tpu-9243ec28.*" log_id("stdout")' --limit 10 --format json