This guide shows you how to optimize Tensor Processing Unit (TPU) provisioning by using future reservation in calendar mode. Future reservation in calendar mode is a built-in calendar advisor and recommender that can help you locate TPU capacity and plan ahead. You can request capacity for a specified start time and duration, between 1 and 90 days, and the recommender will provide suggested dates.
This guide is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for running batch workloads. For more information about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
For more information, see About future reservation in calendar mode.
Use cases
Future reservation in calendar mode works best for workloads with scheduled, short-term, high-demand requests, like training, or batch inference models that require high availability at the requested start time.
If your workload requires dynamically provisioned resources as needed, for up to 7 days without long-term reservations or complex quota management, consider using flex-start. For more information, see About GPU and TPU provisioning with flex-start.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- Ensure that you have either of the following:
- an existing Standard cluster that's running version 1.28.3-gke.1098000 or later.
- an existing Autopilot cluster that's running version 1.30.3-gke.1451000 or later.
Request future reservation in calendar mode for TPUs
The process to request TPUs with future reservation in calendar mode involves the following steps:
- Ensure that you have sufficient quota for any resources that aren't part of a reservation when VMs are created, such as disks or IP addresses. Future reservation requests in calendar mode don't require Compute Engine quota.
- Complete the steps in create a request in calendar mode. These steps include the following:
- View TPU future availability.
- Create and submit a future reservation request in calendar mode for TPUs.
- Wait for Google Cloud to approve your request.
- Create a TPU node pool that uses your reservation.
Create a node pool
This section applies to Standard clusters only.
You can use your reservation when you create single-host or multi-host TPU slice node pools. For example, you can create a single-host TPU slice node pool using the Google Cloud CLI.
gcloud container node-pools create NODE_POOL_NAME \
--location=LOCATION \
--cluster=CLUSTER_NAME \
--node-locations=NODE_ZONES \
--machine-type=MACHINE_TYPE \
--reservation-affinity=specific \ This is required
--reservation=RESERVATION
Replace the following:
NODE_POOL_NAME
: the name of the new node pool.LOCATION
: the name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.CLUSTER_NAME
: the name of the cluster.NODE_ZONES
: the comma-separated list of one or more zones where GKE creates the node pool.MACHINE_TYPE
: the type of machine to use for nodes. For more information about TPU compatible machine types, use the table in Choose the TPU version.RESERVATION
: the name of the calendar reservation to consume.
For a full list of all the flags that you can specify, see the
gcloud container clusters create
reference.
After you create a node pool with the calendar reservation, you can deploy your workload like any other TPU node pool. For example, you can create a Job that specifies the TPU node pool that consumes the reserved TPUs.
What's next
Try GKE deployment examples for generative AI models that use the TPU resources that you reserved:
- Serve an LLM using TPU Trillium on GKE with vLLM
- Serve an LLM using TPUs on GKE with KubeRay
- Serve an LLM using TPUs on GKE with JetStream and PyTorch
- Serve Gemma using TPUs on GKE with JetStream
- Serve Stable Diffusion XL (SDXL) using TPUs on GKE with MaxDiffusion
- Serve open source models using TPUs on GKE with Optimum TPU
Explore experimental samples for leveraging GKE to accelerate your AI/ML initiatives in GKE AI Labs.