Run best practice inference with GKE Inference Quickstart recipes

Autopilot Standard

This page describes how you can use GKE Inference Quickstart to simplify the deployment of AI/ML inference workloads on Google Kubernetes Engine (GKE). Inference Quickstart is a utility that lets you specify your inference business requirements and get optimized Kubernetes configurations based on best practices and Google's benchmarks for models, model servers, accelerators (GPUs, TPUs), and scaling. This helps you avoid the time-consuming process of manually adjusting and testing configurations.

This page is for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who want to understand how to efficiently manage and optimize GKE for AI/ML inference. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

To learn more about model serving concepts and terminology, and how GKE Gen AI capabilities can enhance and support your model serving performance, see About model inference on GKE.

Before reading this page, ensure you're familiar with Kubernetes, GKE, and model serving.

Using Inference Quickstart

The high-level steps to use Inference Quickstart are as follows. Click the links for detailed instructions.

View tailored best practices: using the GKE AI/ML page in the Google Cloud console, or the Google Cloud CLI in the terminal, start by providing inputs such as your preferred open model (for example, Llama, Gemma, or Mistral).
- You can specify your application's latency target, indicating whether it is latency-sensitive (like a chatbot) or throughput-sensitive (like batch analytics).
- Based on your requirements, Inference Quickstart provides accelerator choices, performance metrics, and Kubernetes manifests, which give you full control for deployment or further modifications. Generated manifests reference public model server images, so you don't have to create these images yourself.
Deploy manifests: using the Google Cloud console, or using the kubectl apply command, deploy the recommended manifests. Before you deploy, you need to ensure you have sufficient accelerator quota for the selected GPUs or TPUs in your Google Cloud project.
Monitor performance: use Cloud Monitoring to monitor workload performance metrics provided by GKE. You can view model server dashboards and fine-tune your deployment as needed.

Benefits

Inference Quickstart helps you save time and resources by providing optimized configurations. These optimizations improve performance and reduce infrastructure costs, in the following ways:

You receive detailed tailored best practices for setting the accelerator (GPU and TPU), model server, and scaling configurations. GKE routinely updates the tool with the latest fixes, images, and performance benchmarks.
You can specify your workload's latency and throughput requirements by using the Google Cloud console UI or a command-line interface, and get detailed tailored best practices as Kubernetes deployment manifests.

Use cases

Inference Quickstart is suitable for scenarios like the following:

Discover optimal GKE inference architectures: if you're transitioning from another environment, such as on-premises or a different cloud provider, and want the most up-to-date recommended inference architectures on GKE for your specific performance needs.
Expedite AI/ML inference deployments: if you're an experienced Kubernetes user and want to quickly start deploying AI inference workloads, Inference Quickstart helps you discover and implement best practice deployments on GKE, with detailed YAML configurations based on best practices.
Explore TPUs for enhanced performance: if you're already utilizing Kubernetes on GKE with GPUs, you can use Inference Quickstart to explore the benefits of using TPUs to potentially achieve better performance.

How it works

Inference Quickstart provides tailored best practices based on Google's exhaustive internal benchmarks of single-replica performance for model, model server, and accelerator topology combinations. These benchmarks graph latency versus throughput, including queue size and KV cache metrics, which map out performance curves for each combination.

How tailored best practices are generated

We measure latency in Normalized Time per Output Token (NTPOT) in milliseconds and throughput in output tokens per second, by saturating accelerators. To learn more about these performance metrics, see About model inference on GKE.

The following example latency profile illustrates the inflection point where throughput plateaus (green), the post-inflection point where latency worsens (red), and the ideal zone (blue) for optimal throughput at the latency target. Inference Quickstart provides performance data and configurations for this ideal zone.

Latency profile with green marker less than 2000 output tokens per second and red marker higher than 2000 output tokens per second

Based on an inference application's latency requirements, Inference Quickstart identifies suitable combinations and determines the optimal operating point on the latency-throughput curve. This point sets the Horizontal Pod Autoscaler (HPA) threshold, with a buffer to account for scale-up latency. The overall threshold also informs the initial number of replicas needed, though the HPA dynamically adjusts this number based on workload.

Benchmarking

The provided configurations and performance data are based on benchmarks that use the ShareGPT dataset to send traffic with the following input and output distribution.

Input Tokens						Output Tokens
Min	Median	Mean	P90	P99	Max	Min	Median	Mean	P90	P99	Max
4	108	226	635	887	1024	1	132	195	488	778	1024

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.
Make sure that billing is enabled for your Google Cloud project.
Ensure you have sufficient accelerator capacity for your project:
- If you use GPUs: Check the Quotas page.
- If you use TPUs: Refer to Ensure quota for TPUs and other GKE resources.
Generate a Hugging Face access token and a corresponding Kubernetes Secret, if you don't have one already. To create a Kubernetes Secret that contains the Hugging Face token, run the following command:
```
kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=HUGGING_FACE_TOKEN \
    --namespace=NAMESPACE
```
Replace the following values:
- HUGGING_FACE_TOKEN: the Hugging Face token you created earlier.
- NAMESPACE: the Kubernetes namespace where you want to deploy your model server.
Some models might also require you to accept and sign their consent license agreement.

Prepare to use the GKE AI/ML user interface

If you use the Google Cloud console, you also need to create an Autopilot cluster, if one is not already created in your project. Follow the instructions in Create an Autopilot cluster.

Prepare to use the command line interface

If you use the gcloud CLI to run Inference Quickstart, you also need to run these additional commands:

Enable the gkerecommender.googleapis.com API:

gcloud services enable gkerecommender.googleapis.com

Set the billing quota project that you use for API calls:
```
gcloud config set billing/quota_project PROJECT_ID
```
Check that your gcloud CLI version is at least 526.0.0. Note that 530.0.0 and 531.0.0 are not supported. If an update is needed, run the following command:
```
gcloud components update
```

Limitations

Be aware of the following limitations before you start using Inference Quickstart:

Google Cloud console model deployment supports deployment only to Autopilot clusters.
Inference Quickstart does not provide profiles for all models supported by a given model server.
If you don't set the HF_HOME environment variable when you use a generated manifest for a large model (90 GiB or greater) from Hugging Face, you must either use a cluster with larger-than-default boot disks or modify the manifest to set HF_HOME to /dev/shm/hf_cache. This will use RAM for the cache instead of the node's boot disk. For more information, see the Troubleshooting section.

View optimized configurations for model inference

This section describes how to generate and view configuration recommendations by using either the Google Cloud console or the command line.

Console

Go the GKE AI/ML page in Google Cloud console
Click Deploy Models.
Select a model you want to view. Models that are supported by Inference Quickstart are displayed with the Optimized tag.
- If you selected a foundation model, it opens a model page. Click Deploy. You can still modify the configuration before actual deployment.
- You are prompted to create an Autopilot cluster, if there isn't one in your project. Follow the instructions in Create an Autopilot cluster. After creating the cluster, return to the GKE AI/ML page in the Google Cloud console to select a model.
The model deployment page prepopulates with your selected model as well as the recommended model server and accelerator. You can also configure settings like maximum latency.
To view the manifest with the recommended configuration, click View YAML.

gcloud

Use the gcloud alpha container ai profiles command to explore and view optimized combinations of model, model server, model server version, and accelerators:

Models

To explore and select a model, use the models option.

  gcloud alpha container ai profiles models list

Model servers

To explore recommended model servers for the model you are interested in, use the model-servers option. For example:

  gcloud alpha container ai profiles model-servers list \
      --model=meta-llama/Meta-Llama-3-8B

The output looks similar to the following:

  Supported model servers:
  -  vllm

Server versions

Optionally, to explore supported versions of the model server you are interested in, use the model-server-versions option. If you skip this step, Inference Quickstart defaults to the latest version. For example:

  gcloud alpha container ai profiles model-server-versions list \
      --model=meta-llama/Meta-Llama-3-8B \
      --model-server=vllm

The output looks similar to the following:

  Supported model server versions:
  -  e92694b6fe264a85371317295bca6643508034ef
  -  v0.7.2

Accelerators

To explore recommended accelerators for the model and model server combination you are interested in, use the accelerators option. For example:

  gcloud alpha container ai profiles accelerators list \
      --model=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
      --model-server-version=v0.7.2

The output looks similar to the following:

  Supported accelerators:
  accelerator          | model                                   | model server | model server version                     | accelerator count | output tokens per second | ntpot ms
  ---------------------|-----------------------------------------|--------------|------------------------------------------|-------------------|--------------------------|---------
  nvidia-tesla-a100    | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | vllm         | v0.7.2                                   | 1                 | 3357                     | 72
  nvidia-h100-80gb     | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | vllm         | v0.7.2                                   | 1                 | 6934                     | 30

  For more details on each accelerator, use --format=yaml

The output returns a list of accelerator types, and these metrics:

Throughput, in output tokens per second
Normalized time per output token (NTPOT), in milliseconds

The values represent the performance observed at the point where throughput stops increasing and latency starts dramatically increasing (that is, the inflection or saturation point) for a given profile with this accelerator type. To learn more about these performance metrics, see About model inference on GKE.

For additional options, see the Google Cloud CLI documentation.

After you choose a model, model server, model server version, and accelerator, you can proceed to create a deployment manifest.

Deploy recommended configurations

This section describes how to generate and deploy configuration recommendations by using either the Google Cloud console or the command line.

Console

Go the GKE AI/ML page in Google Cloud console
Click Deploy Models.
Select a model you want to deploy. Models that are supported by Inference Quickstart are displayed with the Optimized tag.
- If you selected a foundation model, it opens a model page. Click Deploy. You can still modify the configuration before actual deployment.
- You are prompted to create an Autopilot cluster, if there isn't one in your project. Follow the instructions in Create an Autopilot cluster. After creating the cluster, return to the GKE AI/ML page in the Google Cloud console to select a model.
The model deployment page prepopulates with your selected model as well as the recommended model server and accelerator. You can also configure settings like maximum latency.
(Optional) To view the manifest with the recommended configuration, click View YAML.
To deploy the manifest with the recommended configuration, click Deploy. It might take several minutes for the deployment operation to complete.

Note: Before deployment, GKE validates the selected environment for potential issues, such as deployment name or label collisions. We recommend using a new namespace for each model deployment.

To view your deployment, go to the Kubernetes Engine > Workloads page.

gcloud

Generate manifests: in the terminal, use the manifests option to generate Deployment, Service, and PodMonitoring manifests:
```
gcloud alpha container ai profiles manifests create
```
Use the required --model, --model-server, and --accelerator-type parameters to customize your manifest.

Optionally, you can set these parameters:
- --target-ntpot-milliseconds: Set this parameter to specify your HPA threshold. This parameter lets you define a scaling threshold to keep the Normalized Time Per Output Token (NTPOT)) P50 latency, which is measured at the fiftieth quartile, below the specified value. Choose a value above the minimum latency of your accelerator. The HPA is configured for maximum throughput if you specify an NTPOT value above the maximum latency of your accelerator. For example:
```
gcloud alpha container ai profiles manifests create \
    --model=google/gemma-2-27b-it \
    --model-server=vllm \
    --model-server-version=v0.7.2 \
    --accelerator-type=nvidia-l4 \
    --target-ntpot-milliseconds=200
```
- --model-server-version: The model server version. If not specified, this defaults to the latest version.
- --namespace: The namespace to deploy the manifests in. Default namespace is 'default'.
- --output: Valid values include manifest, comments, and all. By default, this parameter is set to all. You can choose to output only the manifest for deploying workloads, or you can choose to output only the comments if you want to view instructions for enabling features.
- --output-path: If specified, the output is saved to the provided path, rather than printed to the terminal so you can edit the output before deploying it. For example, you can use this with the --output=manifest option if you want to save your manifest in a YAML file. For example:
```
gcloud alpha container ai profiles manifests create \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --model-server vllm \
    --accelerator-type=nvidia-tesla-a100 \
    --output=manifest \
    --output-path  /tmp/manifests.yaml
```
For additional options, see the Google Cloud CLI documentation.

Tip: Models weights are typically sourced from public repositories such as Hugging Face. If you want to load a model from Cloud Storage, we recommend using the Cloud Storage FUSE CSI driver and following these recommendations for performance tuning.
Provision your infrastructure: ensure your infrastructure is correctly set up for model deployment, monitoring, and scaling by following these provisioning steps.
Deploy the manifests: run the kubectl apply command and pass in the YAML file for your manifests. For example:
```
kubectl apply -f ./manifests.yaml
```

Provision your infrastructure

Follow these steps to ensure your infrastructure is correctly set up for model deployment, monitoring, and scaling:

Create a cluster: You can serve your model on GKE Autopilot or Standard clusters. We recommend that you use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that is the best fit for your workloads, see Choose a GKE mode of operation.

If you don't have an existing cluster, follow these steps:
Autopilot
Follow these instructions to create an Autopilot cluster. GKE handles provisioning the nodes with GPU or TPU capacity based on the deployment manifests, if you have the necessary quota in your project.
Standard
1. Create a zonal or regional cluster.
2. Create a node pool with the appropriate accelerators. Follow these steps based on your chosen accelerator type:
  - GPUs: First, check the Quotas page in the Google Cloud console to ensure you have sufficient GPU capacity. Then, follow the instructions in Create a GPU node pool.
  - TPUs: First, ensure you have sufficient TPU by following the instructions in Ensure quota for TPUs and other GKE resources. Then, proceed to Create a TPU node pool.
(Optional, but recommended) Enable observability features: In the comments section of the generated manifest, additional commands are provided to enable suggested observability features. Enabling these features provides more insights to help you monitor the performance and status of workloads and the underlying infrastructure.

The following is an example of a command to enable observability features:
```
gcloud beta container clusters update $CLUSTER_NAME \
    --project=$PROJECT_ID \
    --location=$LOCATION \
    --enable-managed-prometheus \
    --logging=SYSTEM,WORKLOAD \
    --monitoring=SYSTEM,DEPLOYMENT,HPA,POD,DCGM \
    --auto-monitoring-scope=ALL
```
For more information, see Monitor your inference workloads.
(HPA only) Deploy a metrics adapter: A metrics adapter, such as the Custom Metrics Stackdriver Adapter, is necessary if HPA resources were generated in the deployment manifests. The metrics adapter enables the HPA to access model server metrics that use the kube external metrics API. To deploy the adapter, refer to the adapter documentation on GitHub.

Test your deployment endpoints

If you deployed the manifest by using the command line, the deployed service is exposed at the following endpoint:

http://model-model_server-service:port/

Test your service. In a separate terminal, set up port forwarding by running the following command:

kubectl port-forward service/model-model_server-service 8000:8000

For examples of how to build and send a request to your endpoint, see the vLLM documentation.

Monitor your inference workloads

To monitor your deployed inference workloads, go to the Metrics Explorer in the Google Cloud console.

Enable auto-monitoring

GKE includes an auto-monitoring feature that is part of the broader observability features. This feature scans the cluster for workloads that run on supported model servers and deploys the PodMonitoring resources that enable these workload metrics to be visible in Cloud Monitoring. For more information about enabling and configuring auto-monitoring, see Configure automatic application monitoring for workloads.

After enabling the feature, GKE installs prebuilt dashboards for monitoring applications for supported workloads.

If you deploy from the GKE AI/ML page in the Google Cloud console, PodMonitoring and HPA resources are automatically created for you by using the targetNtpot configuration.

Troubleshooting

If you set the latency too low, Inference Quickstart might not generate a recommendation. To fix this issue, select a latency target between the minimum and maximum latency that was observed for your selected accelerators.
Inference Quickstart exists independently of GKE components, so your cluster version is not directly relevant for using the service. However, we recommend using a fresh or up-to-date cluster to avoid any discrepancies in performance.
If you get a PERMISSION_DENIED error for gkerecommender.googleapis.com commands that says a quota project is missing, you need to set it manually. Run gcloud config set billing/quota_project PROJECT_ID to fix this.

Pod evicted due to low ephemeral storage

When deploying a large model (90 GiB or more) from Hugging Face, your Pod might be evicted with an error message similar to this:

Fails because inference server consumes too much ephemeral storage, and gets evicted low resources:  Warning  Evicted              3m24s                   kubelet                                The node was low on resource: ephemeral-storage. Threshold quantity: 10120387530, available: 303108Ki. Container inference-server was using 92343412Ki, request is 0, has larger consumption of ephemeral-storage..,

This error occurs because the model is cached on the node's boot disk, a form of ephemeral storage. Boot disk is used for ephemeral storage when the deployment manifest does not set the HF_HOME environment variable to a directory in the node's RAM.

By default, GKE nodes have a 100 GiB boot disk.
GKE reserves 10% of the boot disk for system overhead, leaving 90 GiB for your workloads.
If the model size is 90 GiB or larger, and run on a default sized boot disk, kubelet evicts the Pod to free up ephemeral storage.

To resolve this issue, choose one of the following options:

Use RAM for model caching: In your deployment manifest, set the HF_HOME environment variable to /dev/shm/hf_cache. This uses the node's RAM to cache the model instead of the boot disk.
Increase the boot disk size:
- GKE Standard: Increase the boot disk size when you create a cluster, create a node pool, or update a node pool.
- Autopilot: To request a larger boot disk, create a Custom Compute class and set the bootDiskSize field in the machineType rule.

What's next

Visit the AI/ML orchestration on GKE portal to explore our official guides, tutorials, and use cases for running AI/ML workloads on GKE.
For more information about model serving optimization, see Best practices for optimizing large language model inference with GPUs. It covers best practices for LLM serving with GPUs on GKE, like quantization, tensor parallelism, and memory management.
For more information about best practices for autoscaling, see these guides:
- Best practices for autoscaling large language model (LLM) inference workloads with GPUs
- Best practices for autoscaling large language model (LLM) inference workloads with TPUs
Explore experimental samples for leveraging GKE to accelerate your AI/ML initiatives in GKE AI Labs.

Run best practice inference with GKE Inference Quickstart recipes Stay organized with collections Save and categorize content based on your preferences.

Using Inference Quickstart

Benefits

Use cases

How it works

How tailored best practices are generated

Benchmarking

Before you begin

Prepare to use the GKE AI/ML user interface

Prepare to use the command line interface

Limitations

View optimized configurations for model inference

Console

gcloud

Models

Model servers

Server versions

Accelerators

Deploy recommended configurations

Console

gcloud

Provision your infrastructure

Autopilot

Standard

Test your deployment endpoints

Monitor your inference workloads

Enable auto-monitoring

Troubleshooting

Pod evicted due to low ephemeral storage

What's next

Run best practice inference with GKE Inference Quickstart recipes