Cloud TPU inference
Serving refers to the process of deploying a trained machine learning model to a production environment, where it can be used for inference. Inference is supported on TPU v5e and newer versions. Latency SLOs are a priority for serving.
This document discusses serving a model on a single-host TPU. TPU slices with 8 or less chips have one TPU VM or host and are called single-host TPUs. For information about multi-host inference, see Perform multihost inference using Pathways.
Get started
You need a Google Cloud account and project to use Cloud TPU. For more information, see Set up a Cloud TPU environment.
Ensure that you have sufficient quota for the number of TPU cores that you plan to use for inference. TPU v5e uses separate quotas for training and serving. The serving-specific quotas for TPU v5e are:
- On-demand v5e resources:
TPUv5 lite pod cores for serving per project per zone
- Preemptible v5e resources:
Preemptible TPU v5 lite pod cores for serving per project per zone
For other TPU versions, training and serving workloads use the same quota. For more information, see Cloud TPU quotas.
Serve LLMs with vLLM
vLLM is an open-source library designed for
fast inference and serving of large language models (LLMs). Cloud TPU
integrates with vLLM using the tpu-inference
plugin, which supports JAX and
PyTorch models. For more information, see the tpu-inference
GitHub
repository.
For examples of using vLLM to serve a model on TPUs, see the following:
- Get started with vLLM TPU
- Serve an LLM using TPU Trillium on GKE with vLLM.
- Recipes for serving vLLM on Trillium TPUs (v6e)
Profiling
After setting up inference, you can use profilers to analyze the performance and TPU utilization. For more information about profiling, see: