Google Cloud provides access to custom-designed machine learning accelerators called Tensor Processing Units (TPUs). TPUs are optimized to accelerate the training and inference of machine learning models, making them ideal for a variety of applications, including natural language processing, computer vision, and speech recognition.
This page describes how to deploy your models to a single host Cloud TPU v5e or v6e for online inference in Vertex AI.
Only Cloud TPU version v5e and v6e are supported. Other Cloud TPU generations are not supported.
To learn which locations Cloud TPU version v5e and v6e are available in, see locations.
Import your model
For deployment on Cloud TPUs, you must import your model to Vertex AI and configure it to use one of the following containers:
- prebuilt optimized TensorFlow runtime container either the nightlyversion, or version2.15or later
- prebuilt PyTorch TPU container version 2.1or later
- your own custom container that supports TPUs
Prebuilt optimized TensorFlow runtime container
To import and run a
TensorFlow SavedModel
on a Cloud TPU, the model must be TPU-optimized. If your TensorFlow
SavedModel isn't already TPU optimized, you can optimize your model
automatically. To do this, import your model and then Vertex AI
optimizes your unoptimized model by using an automatic partitioning algorithm.
This optimization doesn't work on all models. If optimization fails, you
must manually optimize your model.
The following sample code demonstrates how to use automatic model optimization with automatic partitioning:
  model = aiplatform.Model.upload(
      display_name='TPU optimized model with automatic partitioning',
      artifact_uri="gs://model-artifact-uri",
      serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest",
      serving_container_args=[
      ]
  )
For more information on importing models, see importing models to Vertex AI.
Prebuilt PyTorch container
The instructions to import and run a PyTorch model on Cloud TPU are the same as the instructions to import and run a PyTorch model.
For example, TorchServe for Cloud TPU v5e Inference Then, upload the model artifacts to your Cloud Storage folder and upload your model as shown:
model = aiplatform.Model.upload(
    display_name='DenseNet TPU model from SDK PyTorch 2.1',
    artifact_uri="gs://model-artifact-uri",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-tpu.2-1:latest",
    serving_container_args=[],
    serving_container_predict_route="/predictions/model",
    serving_container_health_route="/ping",
    serving_container_ports=[8080]
)
For more information, see export model artifacts for PyTorch and the tutorial notebook for Serve a PyTorch model using a prebuilt container.
Custom container
For custom containers, your model does not need to be a TensorFlow model, but it must be TPU optimized. For information on producing a TPU optimized model, see the following guides for common ML frameworks:
For information on serving models trained with JAX, TensorFlow, or PyTorch on Cloud TPU v5e, see Cloud TPU v5e Inference.
Make sure your custom container meets the custom container requirements.
You must raise the locked memory limit so the driver can communicate with the TPU chips over direct memory access (DMA). For example:
Command line
ulimit -l 68719476736Python
import resource
resource.setrlimit(
    resource.RLIMIT_MEMLOCK,
    (
        68_719_476_736_000,  # soft limit
        68_719_476_736_000,  # hard limit
    ),
  )
Then, see Use a custom container for inference for information on importing a model with a custom container. If you have want to implement pre or post processing logic, consider using Custom inference routines.
Create an endpoint
The instructions for creating an endpoint for Cloud TPUs are the same as the instructions for creating any endpoint.
For example, the following command creates an endpoint
resource:
endpoint = aiplatform.Endpoint.create(display_name='My endpoint')
The response contains the new endpoint's ID, which you use in subsequent steps.
For more information on creating an endpoint, see deploy a model to an endpoint.
Deploy a model
The instructions for deploying a model to Cloud TPUs are the same as the instructions for deploying any model, except you specify one of the following supported Cloud TPU machine types:
| Machine Type | Number of TPU chips | 
|---|---|
| ct6e-standard-1t | 1 | 
| ct6e-standard-4t | 4 | 
| ct6e-standard-8t | 8 | 
| ct5lp-hightpu-1t | 1 | 
| ct5lp-hightpu-4t | 4 | 
| ct5lp-hightpu-8t | 8 | 
TPU accelerators are built-in to the machine type. You don't have to specify accelerator type or accelerator count.
For example, the following command deploys a model by calling
deployModel:
machine_type = 'ct5lp-hightpu-1t'
deployed_model = model.deploy(
    endpoint=endpoint,
    deployed_model_display_name='My deployed model',
    machine_type=machine_type,
    traffic_percentage=100,
    min_replica_count=1
    sync=True,
)
For more information, see deploy a model to an endpoint.
Get online inferences
The instruction for getting online inferences from a Cloud TPU is the same as the instruction for getting online inferences.
For example, the following command sends an online inference request by calling
predict:
deployed_model.predict(...)
For custom containers, see the inference request and response requirements for custom containers.
Securing capacity
For most regions, the TPU v5e and v6e cores per region
quota for custom model serving
 is 0. In some regions, it is limited.
To request a quota increase, see Request a quota adjustment.
Pricing
TPU machine types are billed per hour, just like all other machine type in Vertex Prediction. For more information, see Prediction pricing.
What's next
- Learn how to get an online inference