Run TensorFlow code on TPU Pod slices

This document shows you how to perform a calculation using TensorFlow on a TPU Pod. You will perform the following steps:

  1. Create a TPU Pod slice with TensorFlow software
  2. Connect to the TPU VM using SSH
  3. Create and run an example script

The TPU VM relies on a Service Accounts for permissions to call Cloud TPU API. By default, your TPU VM will use the default Compute Engine service account which includes all needed Cloud TPU permissions. If you use your own service account you need to add the TPU Viewer role to your service account. For more information on Google Cloud roles, see Understanding roles. You can specify your own service account using the --service-account flag when creating your TPU VM.

Set up your environment

  1. In the Cloud Shell, run the following command to make sure you are running the current version of gcloud:

    $ gcloud components update

    If you need to install gcloud, use the following command:

    $ sudo apt install -y google-cloud-sdk
  2. Create some environment variables:

    $ export PROJECT_ID=project-id
    $ export TPU_NAME=tpu-name
    $ export ZONE=europe-west4-a
    $ export RUNTIME_VERSION=tpu-vm-tf-2.17.0-pod-pjrt
    $ export ACCELERATOR_TYPE=v3-32

Create a v3-32 TPU Pod slice with TensorFlow runtime

$ gcloud compute tpus tpu-vm create ${TPU_NAME}} \
  --zone=${ZONE} \
  --accelerator-type=${ACCELERATOR_TYPE} \
  --version=${RUNTIME_VERSION}

Command flag descriptions

zone
The zone where you plan to create your Cloud TPU.
accelerator-type
The accelerator type specifies the version and size of the Cloud TPU you want to create. For more information about supported accelerator types for each TPU version, see TPU versions.
version
The Cloud TPU software version.

Connect to your Cloud TPU VM using SSH

$ gcloud compute tpus tpu-vm ssh ${TPU_NAME} \
      --zone=${ZONE}

Create and run an example script

  1. Set the following environment variables.

    (vm)$ export TPU_NAME=tpu-name
    (vm)$ export TPU_LOAD_LIBRARY=0
  2. Create a file named tpu-test.pyin the current directory and copy and paste the following script into it.

    import tensorflow as tf
    print("Tensorflow version " + tf.__version__)
    
    cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', cluster_resolver.cluster_spec().as_dict()['worker'])
    
    tf.config.experimental_connect_to_cluster(cluster_resolver)
    tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
    strategy = tf.distribute.experimental.TPUStrategy(cluster_resolver)
    
    @tf.function
    def add_fn(x,y):
      z = x + y
      return z
    
    x = tf.constant(1.)
    y = tf.constant(1.)
    z = strategy.run(add_fn, args=(x,y))
    print(z)
    
  3. Run this script with the following command:

    (vm)$ python3 tpu-test.py

    This script performs a calculation on a each TensorCore of a TPU Pod slice. The output will look similar to the following:

    PerReplica:{
      0: tf.Tensor(2.0, shape=(), dtype=float32),
      1: tf.Tensor(2.0, shape=(), dtype=float32),
      2: tf.Tensor(2.0, shape=(), dtype=float32),
      3: tf.Tensor(2.0, shape=(), dtype=float32),
      4: tf.Tensor(2.0, shape=(), dtype=float32),
      5: tf.Tensor(2.0, shape=(), dtype=float32),
      6: tf.Tensor(2.0, shape=(), dtype=float32),
      7: tf.Tensor(2.0, shape=(), dtype=float32),
      8: tf.Tensor(2.0, shape=(), dtype=float32),
      9: tf.Tensor(2.0, shape=(), dtype=float32),
      10: tf.Tensor(2.0, shape=(), dtype=float32),
      11: tf.Tensor(2.0, shape=(), dtype=float32),
      12: tf.Tensor(2.0, shape=(), dtype=float32),
      13: tf.Tensor(2.0, shape=(), dtype=float32),
      14: tf.Tensor(2.0, shape=(), dtype=float32),
      15: tf.Tensor(2.0, shape=(), dtype=float32),
      16: tf.Tensor(2.0, shape=(), dtype=float32),
      17: tf.Tensor(2.0, shape=(), dtype=float32),
      18: tf.Tensor(2.0, shape=(), dtype=float32),
      19: tf.Tensor(2.0, shape=(), dtype=float32),
      20: tf.Tensor(2.0, shape=(), dtype=float32),
      21: tf.Tensor(2.0, shape=(), dtype=float32),
      22: tf.Tensor(2.0, shape=(), dtype=float32),
      23: tf.Tensor(2.0, shape=(), dtype=float32),
      24: tf.Tensor(2.0, shape=(), dtype=float32),
      25: tf.Tensor(2.0, shape=(), dtype=float32),
      26: tf.Tensor(2.0, shape=(), dtype=float32),
      27: tf.Tensor(2.0, shape=(), dtype=float32),
      28: tf.Tensor(2.0, shape=(), dtype=float32),
      29: tf.Tensor(2.0, shape=(), dtype=float32),
      30: tf.Tensor(2.0, shape=(), dtype=float32),
      31: tf.Tensor(2.0, shape=(), dtype=float32)
    }
    

Clean up

When you are done with your TPU VM follow these steps to clean up your resources.

  1. Disconnect from the Compute Engine:

    (vm)$ exit
  2. Delete your Cloud TPU.

    $ gcloud compute tpus tpu-vm delete ${TPU_NAME} \
      --zone=${ZONE}
  3. Verify the resources have been deleted by running the following command. Make sure your TPU is no longer listed. The deletion might take several minutes.

    $ gcloud compute tpus tpu-vm list \
      --zone=${ZONE}