After you have your PyTorch code running on a single TPU VM, you can scale up
your code by running it on a TPU slice.
TPU slices are multiple TPU boards connected to each other over dedicated
high-speed network connections. This document is an introduction to running
PyTorch code on TPU slices.
Create a Cloud TPU slice
Define some environment variables to make the commands easier to use.
Your Google Cloud project ID. Use an existing project or
create a new one.
TPU_NAME
The name of the TPU.
ZONE
The zone in which to create the TPU VM. For more information about supported zones, see
TPU regions and zones.
ACCELERATOR_TYPE
The accelerator type specifies the version and size of the Cloud TPU you want to
create. For more information about supported accelerator types for each TPU version, see
TPU versions.
After creating the TPU slice, you must install PyTorch on all hosts in the
TPU slice. You can do this using the gcloud compute tpus tpu-vm ssh command using
the --worker=all and --commamnd parameters.
If the following commands fail due to an SSH connection error, it might be
because the TPU VMs don't have external IP addresses. To access a TPU VM without
an external IP address, follow the instructions in Connect to a TPU VM without
a public IP address.
Run the training script on all workers. The training script uses a Single Program
Multiple Data (SPMD) sharding strategy. For more information on SPMD, see
PyTorch/XLA SPMD User Guide.
The training takes about 15 minutes. When it completes, you should see a message
similar to the following:
Epoch 1 test end 23:49:15, Accuracy=100.00
10.164.0.11 [0] Max Accuracy: 100.00%
Clean up
When you are done with your TPU VM, follow these steps to clean up your resources.
Disconnect from the Cloud TPU instance, if you have not already
done so:
(vm)$exit
Your prompt should now be username@projectname, showing you are in the
Cloud Shell.
Delete your Cloud TPU resources.
$gcloudcomputetpustpu-vmdelete\--zone=${ZONE}
Verify the resources have been deleted by running gcloud compute tpus tpu-vm list. The
deletion might take several minutes. The output from the following command
shouldn't include any of the resources created in this tutorial:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-11 UTC."],[],[],null,["# Run PyTorch code on TPU slices\n==============================\n\nBefore running the commands in this document, make sure you have followed the\ninstructions in [Set up an account and Cloud TPU project](/tpu/docs/setup-gcp-account).\n\nAfter you have your PyTorch code running on a single TPU VM, you can scale up\nyour code by running it on a [TPU slice](/tpu/docs/system-architecture-tpu-vm#slices).\nTPU slices are multiple TPU boards connected to each other over dedicated\nhigh-speed network connections. This document is an introduction to running\nPyTorch code on TPU slices.\n\nCreate a Cloud TPU slice\n------------------------\n\n1. Define some environment variables to make the commands easier to use.\n\n\n ```bash\n export PROJECT_ID=your-project-id\n export TPU_NAME=your-tpu-name\n export ZONE=europe-west4-b\n export ACCELERATOR_TYPE=v5p-32\n export RUNTIME_VERSION=v2-alpha-tpuv5\n ``` \n\n #### Environment variable descriptions\n\n \u003cbr /\u003e\n\n2. Create your TPU VM by running the following command:\n\n ```bash\n $ gcloud compute tpus tpu-vm create ${TPU_NAME} \\\n --zone=${ZONE} \\\n --project=${PROJECT_ID} \\\n --accelerator-type=${ACCELERATOR_TYPE} \\\n --version=${RUNTIME_VERSION}\n ```\n\nInstall PyTorch/XLA on your slice\n---------------------------------\n\nAfter creating the TPU slice, you must install PyTorch on all hosts in the\nTPU slice. You can do this using the `gcloud compute tpus tpu-vm ssh` command using\nthe `--worker=all` and `--commamnd` parameters.\n\nIf the following commands fail due to an SSH connection error, it might be\nbecause the TPU VMs don't have external IP addresses. To access a TPU VM without\nan external IP address, follow the instructions in [Connect to a TPU VM without\na public IP address](/tpu/docs/tpu-iap).\n\n1. Install PyTorch/XLA on all TPU VM workers:\n\n ```bash\n gcloud compute tpus tpu-vm ssh ${TPU_NAME} \\\n --zone=${ZONE} \\\n --project=${PROJECT_ID} \\\n --worker=all \\\n --command=\"pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html\"\n ```\n2. Clone XLA on all TPU VM workers:\n\n ```bash\n gcloud compute tpus tpu-vm ssh ${TPU_NAME} \\\n --zone=${ZONE} \\\n --project=${PROJECT_ID} \\\n --worker=all \\\n --command=\"git clone https://github.com/pytorch/xla.git\"\n ```\n\nRun a training script on your TPU slice\n---------------------------------------\n\nRun the training script on all workers. The training script uses a Single Program\nMultiple Data (SPMD) sharding strategy. For more information on SPMD, see\n[PyTorch/XLA SPMD User Guide](https://pytorch.org/xla/release/r2.4/spmd.html). \n\n```bash\ngcloud compute tpus tpu-vm ssh ${TPU_NAME} \\\n --zone=${ZONE} \\\n --project=${PROJECT_ID} \\\n --worker=all \\\n --command=\"PJRT_DEVICE=TPU python3 ~/xla/test/spmd/test_train_spmd_imagenet.py \\\n --fake_data \\\n --model=resnet50 \\\n --num_epochs=1 2\u003e&1 | tee ~/logs.txt\"\n```\n\nThe training takes about 15 minutes. When it completes, you should see a message\nsimilar to the following: \n\n```\nEpoch 1 test end 23:49:15, Accuracy=100.00\n 10.164.0.11 [0] Max Accuracy: 100.00%\n```\n\nClean up\n--------\n\nWhen you are done with your TPU VM, follow these steps to clean up your resources.\n\n1. Disconnect from the Cloud TPU instance, if you have not already\n done so:\n\n ```bash\n (vm)$ exit\n ```\n\n Your prompt should now be `username@projectname`, showing you are in the\n Cloud Shell.\n2. Delete your Cloud TPU resources.\n\n ```bash\n $ gcloud compute tpus tpu-vm delete \\\n --zone=${ZONE}\n ```\n3. Verify the resources have been deleted by running `gcloud compute tpus tpu-vm list`. The\n deletion might take several minutes. The output from the following command\n shouldn't include any of the resources created in this tutorial:\n\n ```bash\n $ gcloud compute tpus tpu-vm list --zone=${ZONE}\n ```"]]