Cloud TPU inference

Serving refers to the process of deploying a trained machine learning model to a production environment, where it can be used for inference. Inference is supported on TPU v5e and newer versions. Latency SLOs are a priority for serving.

This document discusses serving a model on a single-host TPU. TPU slices with 8 or less chips have one TPU VM or host and are called single-host TPUs. For information about multi-host inference, see Perform multihost inference using Pathways.

Get started

You need a Google Cloud account and project to use Cloud TPU. For more information, see Set up a Cloud TPU environment.

Ensure that you have sufficient quota for the number of TPU cores that you plan to use for inference. TPU v5e uses separate quotas for training and serving. The serving-specific quotas for TPU v5e are:

  • On-demand v5e resources: TPUv5 lite pod cores for serving per project per zone
  • Preemptible v5e resources: Preemptible TPU v5 lite pod cores for serving per project per zone

For other TPU versions, training and serving workloads use the same quota. For more information, see Cloud TPU quotas.

Serve LLMs with vLLM

vLLM is an open-source library designed for fast inference and serving of large language models (LLMs). Cloud TPU integrates with vLLM using the tpu-inference plugin, which supports JAX and PyTorch models. For more information, see the tpu-inference GitHub repository.

For examples of using vLLM to serve a model on TPUs, see the following:

Profiling

After setting up inference, you can use profilers to analyze the performance and TPU utilization. For more information about profiling, see: