Troubleshoot your Dataflow TPU job

If you run into problems running your Dataflow job with TPUs, use the following troubleshooting steps to resolve your issue.

Troubleshoot your container image

It can be helpful to debug your container and TPU software on a standalone VM. You can debug with a VM created by a GKE nodepool, or you can debug on a running Dataflow worker VM.

Debug with a standalone VM

To debug your container on a standalone VM, you can create a GKE node pool that uses the same TPU VM for local experimentation. For example, creating a GKE node pool with one TPU V5 Lite device in us-west1-c would look like the following:

Create a GKE cluster.

gcloud container clusters create TPU_CLUSTER_NAME \
  --project PROJECT_ID \
  --release-channel=stable \
  --scopes=cloud-platform \
  --enable-ip-alias \
  --location us-west1-c

Create a GKE node pool.

gcloud container node-pools create TPU_NODE_POOL_NAME \
  --project PROJECT_ID \
  --location=us-west1-c \
  --cluster=TPU_CLUSTER_NAME \
  --node-locations=us-west1-c \
  --machine-type=ct5lp-hightpu-1t \
  --num-nodes=1 \
  [ --reservation RESERVATION_NAME \
  --reservation-affinity=specific ]

Find the VM name of the TPU node in the nodepool in the GKE UI or with the following command.

gcloud compute instances list --filter='metadata.kube-labels:"cloud.google.com/gke-nodepool=TPU_NODEPOOL_NAME"'

Connect to a VM created by the GKE node pool using SSH:

gcloud compute ssh --zone "us-west1-c" "VM_NAME" --project PROJECT_ID

After connecting to a VM using SSH, configure Docker for the Artifact Registry you are using.
```
docker-credential-gcr configure-docker --registries=us-west1-docker.pkg.dev
```

Then, start a container from the image that you use.

docker run --privileged --network=host -it --rm --entrypoint=/bin/bash IMAGE_NAME

Inside the container, test that TPUs are accessible.

For example, if you have an image that uses PyTorch to utilize TPUs, open a Python interpreter:

python3

Then, perform a computation on a TPU device:

import torch
import torch_xla.core.xla_model as xm
dev = xm.xla_device()
t1 = torch.randn(3,3,device=dev)
t2 = torch.randn(3,3,device=dev)
print(t1 + t2)

Sample output:

>>> tensor([[ 0.3355, -1.4628, -3.2610],
>>>        [-1.4656,  0.3196, -2.8766],
>>>        [ 0.8667, -1.5060,  0.7125]], device='xla:0')

If the computation fails, your image might not be properly configured.

For example, you might need to set the required environment variables in the image Dockerfile. To confirm, retry the computation after setting the environment variables manually as follows:

export TPU_SKIP_MDS_QUERY=1 # Don't query metadata
export TPU_HOST_BOUNDS=1,1,1 # There's only one host
export TPU_CHIPS_PER_HOST_BOUNDS=1,1,1 # 1 chips per host
export TPU_WORKER_HOSTNAMES=localhost
export TPU_WORKER_ID=0 # Always 0 for single-host TPUs
export TPU_ACCELERATOR_TYPE=v5litepod-1 # Since we use v5e 1x1 accelerator.

If PyTorch or LibTPU dependencies are missing, you could retry the computation after installing them using the following command:

# Install PyTorch with TPU support
pip install torch torch_xla[tpu] torchvision -f https://storage.googleapis.com/libtpu-releases/index.html

Debug by using a Dataflow VM

As an alternative, you can connect to the Dataflow worker VM instance using SSH while a job is running. Because Dataflow worker VMs shut down after pipeline completion, you might need to artificially increase the runtime by doing a computation that waits for a prolonged period of time.

Because a TPU device cannot be shared between multiple processes, you might need to run a pipeline that doesn't make any computations on a TPU.

Find a VM for the running TPU job by searching for the Dataflow job ID in the Google Cloud console search bar or by using the following gcloud command:
```
gcloud compute instances list --project PROJECT_ID --filter "STATUS='RUNNING' AND description ~ 'Created for Dataflow job: JOB_ID'"
```
After connecting to a VM with TPUs using SSH, start a container from the image that you use. For an example, see Debug with a standalone VM.
Inside the container, reconfigure the TPU settings and install necessary libraries to test your setup. For an example, see Debug with a standalone VM.

Workers don't start

Before troubleshooting, verify the following pipeline options are set correctly:

the --dataflow_service_option=worker_accelerator option
the --worker_zone option
the --machine_type option

Check if the console logs show that workers are starting, but the job fails with a message similar to the following:

  Workflow failed. Causes: The Dataflow job appears to be stuck because no worker
  activity has been seen in the last 25m.

The cause of these issues might be related to capacity or worker startup issues.

Capacity: If you use on-demand TPU capacity, or a reservation that is exhausted, new pipelines might not start until capacity is available. If you use a reservation, check its remaining capacity on the Compute Reservations page in the Google Cloud console or with the following command:
```
gcloud compute reservations describe RESERVATION_NAME --zone ZONE
```
Check whether your job has started any worker VMs. When your job starts a worker, loggers such as worker, worker_startup, kubelet, and others generally provide output. Additionally, on the Job metrics page in the Google Cloud console, the number of current workers should be greater than zero.
Worker startup: Check the job-message and launcher logs. If your pipeline starts workers but they can't boot, you might have errors in your custom container.
Disk space: Verify that sufficient disk space is available for your job. To increase disk space, use the --disk_size_gb option.

Job fails with an error

Use the following troubleshooting advice when your job fails with an error.

Startup of worker pool failed

If you see the following error, verify that your pipeline specifies --worker_zone and that the zone matches the zone for your reservation.

JOB_MESSAGE_ERROR: Startup of the worker pool in zone ZONE failed to
bring up any of the desired 1 workers. [...] INVALID_FIELD_VALUE:
Instance 'INSTANCE_NAME' creation failed: Invalid value for field
'resource.reservationAffinity': '{ "consumeReservationType":
"SPECIFIC_ALLOCATION", "key":
"compute.googleapis.com/RESERVATION_NAME...'. Specified reservations
[RESERVATION_NAME] do not exist.

Managed instance groups don't support Cloud TPUs

If you see the following error, contact your account team to verify whether your project has been enrolled to use TPUs, or file a bug using the Google Issue Tracker.

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow
pipeline failed. State: FAILED, Error: Workflow failed. Causes: One or more
operations had an error [...]: [INVALID_FIELD_VALUE] 'Invalid value
for field 'resource.instanceTemplate': Managed Instance Groups do not support
Cloud TPUs. '.

Invalid value for field

If you see the following error, verify that your pipeline invocation sets the worker_accelerator Dataflow service option.

JOB_MESSAGE_ERROR: Workflow failed. Causes: One or more operations had an error:
'operation-[...]': [INVALID_FIELD_VALUE] 'Invalid value for field
'resource.instanceTemplate': 'projects/[...]-harness'. Regional
Managed Instance Groups do not support Cloud TPUs.'

Device or resource busy

If you see the following error, then a Dataflow worker processing your pipeline likely is running more than one process that is accessing the TPU at the same time. This is not supported. For more information, see TPUs and worker parallelism.

RuntimeError: TPU initialization failed: open(/dev/vfio/0): Device or resource
busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0

If you see the preceding error while debugging your pipeline on a VM, you can inspect and terminate the process that is holding up the TPU by using the following commands:

apt update ; apt install lsof
lsof -w /dev/vfio/0
kill -9 PROCESS_ID    # to terminate the process.

Instances with guest accelerators do not support live migration

If you see the following error, the pipeline was likely launched with an explicitly-set machine type that has accelerators, but didn't specify accelerator configuration correctly. Verify that your pipeline invocation sets the worker_accelerator Dataflow service option, and make sure the option name doesn't contain typos.

JOB_MESSAGE_ERROR: Startup of the worker pool in zone ZONE failed to
bring up any of the desired 1 workers. [...] UNSUPPORTED_OPERATION:
Instance INSTANCE_ID creation failed: Instances with guest
accelerators do not support live migration.

The workflow was automatically rejected by the service

The following errors might also appear if some of the required pipeline options are missing or incorrect:

The workflow was automatically rejected by the service. The requested
accelerator type tpu-v5-lite-podslice;topology:1x1 requires setting
the worker machine type to ct5lp-hightpu-1t. Learn more at:
https://cloud.google.com/dataflow/docs/guides/configure-worker-vm

Timed out waiting for an update from the worker

If you launch pipelines on TPU VMs with a lot of vCPU and don't reduce the default number of worker threads, the job might encounter errors like the following:

Workflow failed. Causes WORK_ITEM failed.
The job failed because a work item has failed 4 times.
Root cause: Timed out waiting for an update from the worker.

To avoid this error, reduce the number of threads. For example, you could set: --number_of_worker_harness_threads=50.

No TPU usage

If your pipeline runs successfully but TPU devices aren't used or aren't accessible, verify that the frameworks you are using, such as JAX or PyTorch, can access the attached devices. To troubleshoot your container image on a single VM, see Debug with a standalone VM.