Switching software versions on your Cloud TPU

Overview

The software version of the framework running on your TPU must match the version running on your local VM. This software version can now be switched on a running Cloud TPU, without deleting and recreating the TPU. This also enables configuring the Cloud TPU with specific nightly versions of software frameworks. It is still recommended to select a supported version of these frameworks.

Usage

The recommended way to switch versions is to use the cloud-tpu-client python library.

Example usage for TensorFlow.

from cloud_tpu_client import Client
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--tpu-name',
                    type=str,
                    required=True,
                    help='Name of the TPU Instance')
parser.add_argument('--target-version',
                    type=str,
                    required=True,
                    help='target TPU Runtime version')

args = parser.parse_args()

c = Client(args.tpu_name)
c.configure_tpu_version(args.target_version, restart_type='ifNeeded')
c.wait_for_healthy()

This configures the Cloud TPU to match the TensorFlow version running on your local VM, this includes official releases as well as dated nightly builds.

The restart_type parameter of the configure_tpu_version API defines the TPU restart behavior when switching versions. Options are 'always' (the default) and 'ifNeeded'.

  • 'always' can be used to fix a TPU with, for example, status UNHEALTHY_TENSORFLOW, or that is returning Out of Memory (OOM) errors due to leaked resources from a previous run. When this option is set, the TPU is restarted even when a new framework version is not installed.

  • 'ifNeeded' can be useful because it does not restart the runtime if it is already on the right version, so it will not add any significant startup time to a training script. When this option is set, the TPU is only restarted if it does not have the correct framework version installed.

The library communicates directly with the Cloud TPU so this code needs to be run in a VM in the same network. It is recommended to run this within the code for the rest of your model.

Additional software options

TensorFlow includes a tf.__version__ string which is the simplest way to configure the correct version. Other software options include:

  • PyTorch - pytorch-1.13, pytorch-nightly-dev20220930, pytorch-nightly
  • Jax - tpu_driver, tpu_driver0.1-dev20200320, tpu_driver_nightly

For example to configure a TPU to run with the latest nightly build of PyTorch.

from cloud_tpu_client import Client
c = Client()
c.configure_tpu_version('pytorch-nightly', restart_type='ifNeeded')
c.wait_for_healthy()