This page describes how you can use GKE Inference Quickstart to simplify the deployment of AI/ML inference workloads on Google Kubernetes Engine (GKE). Inference Quickstart is a utility that lets you specify your inference business requirements and get optimized Kubernetes configurations based on best practices and Google's benchmarks for models, model servers, accelerators (GPUs, TPUs), and scaling. This helps you avoid the time-consuming process of manually adjusting and testing configurations.
This page is for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who want to understand how to efficiently manage and optimize GKE for AI/ML inference. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
To learn more about model serving concepts and terminology, and how GKE Gen AI capabilities can enhance and support your model serving performance, see About model inference on GKE.
Before reading this page, ensure you're familiar with Kubernetes, GKE, and model serving.
Using Inference Quickstart
The high-level steps to use Inference Quickstart are as follows. Click the links for detailed instructions.
- View tailored best practices: using the Google Cloud CLI in the terminal,
start by providing inputs such as your preferred
open model (for example, Llama, Gemma, or Mistral).
- You can specify your application's latency target, indicating whether it is latency-sensitive (like a chatbot) or throughput-sensitive (like batch analytics).
- Based on your requirements, Inference Quickstart provides accelerator choices, performance metrics, and Kubernetes manifests, which give you full control for deployment or further modifications. Generated manifests reference public model server images, so you don't have to create these images yourself.
- Deploy manifests: using the
kubectl apply
command, deploy the recommended manifests from command line. Before you deploy, you need to ensure you have sufficient accelerator quota for the selected GPUs or TPUs in your Google Cloud project. - Monitor performance: use Cloud Monitoring to monitor metrics that GKE for monitoring workload performance. You can view model server specific dashboards, and fine-tune your deployment as needed.
Benefits
Inference Quickstart helps you save time and resources by providing optimized configurations. These optimizations improve performance and reduce infrastructure costs, in the following ways:
- You receive detailed tailored best practices for setting accelerators (GPU and TPU), model server, and scaling configurations. GKE routinely updates the tool with the latest fixes, images, and performance benchmarks.
- You can specify your workload's latency and throughput requirements from a command-line interface, and get detailed tailored best practices as Kubernetes deployment manifests.
Use cases
Inference Quickstart is suitable for scenarios like the following:
- Discover optimal GKE inference architectures: if you're transitioning from another environment, such as on-premises or a different cloud provider, and want the most up-to-date recommended inference architectures on GKE for your specific performance needs.
- Expedite AI/ML inference deployments: if you're an experienced Kubernetes user and want to quickly start deploying AI inference workloads, Inference Quickstart helps you discover and implement best practice deployments on GKE, with detailed YAML configurations based on best practices.
- Explore TPUs for enhanced performance: if you're already utilizing Kubernetes on GKE with GPUs, you can use Inference Quickstart to explore the benefits of using TPUs to potentially achieve better performance.
How it works
Inference Quickstart provides tailored best practices based on Google's exhaustive internal benchmarks of single-replica performance for model, model server, and accelerator topology combinations. These benchmarks graph latency versus throughput, including queue size and KV cache metrics, which map out performance curves for each combination.
How tailored best practices are generated
We measure latency in Normalized Time per Output Token (NTPOT) in milliseconds and throughput in output tokens per second, by saturating accelerators. To learn more about these performance metrics, see About model inference on GKE.
The following example latency profile illustrates the inflection point where throughput plateaus (green), the post-inflection point where latency worsens (red), and the ideal zone (blue) for optimal throughput at the latency target. Inference Quickstart provides performance data and configurations for this ideal zone.
Based on an inference application's latency requirements, Inference Quickstart identifies suitable combinations and determines the optimal operating point on the latency-throughput curve. This point sets the Horizontal Pod Autoscaler (HPA) threshold, with a buffer to account for scale-up latency. The overall threshold also informs the initial number of replicas needed, though the HPA dynamically adjusts this number based on workload.
Benchmarking
The provided configurations and performance data are based on benchmarks that use the ShareGPT dataset to send traffic with the following input and output distribution.
Input Tokens | Output Tokens | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Min | Median | Mean | P90 | P99 | Max | Min | Median | Mean | P90 | P99 | Max |
4 | 108 | 226 | 635 | 887 | 1024 | 1 | 132 | 195 | 488 | 778 | 1024 |
To run the benchmark yourself, follow the instructions at AI-Hypercomputer/inference-benchmark. We provide different options that you can use during benchmarking to simulate load patterns that are representative of your workload.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Google Cloud project.
Ensure you have sufficient accelerator capacity for your project:
- If you use GPUs: Check the Quotas page.
- If you use TPUs: Refer to Ensure quota for TPUs and other GKE resources.
Generate a Hugging Face access token and a corresponding Kubuernetes Secret, if you don't have one already. To create a Kubernetes Secret that contains the Hugging Face token, run the following command:
kubectl create secret generic hf-secret \ --from-literal=hf_api_token=HUGGING_FACE_TOKEN \ --namespace=NAMESPACE
Replace the following values:
- HUGGING_FACE_TOKEN: the Hugging Face token you created earlier.
- NAMESPACE: the Kubernetes namespace where you want to deploy your model server.
Some models might also require you to accept and sign their consent license agreement.
If you use the gcloud CLI to run Inference Quickstart, you also need to run these additional commands:
Enable the
gkerecommender.googleapis.com
API:gcloud services enable gkerecommender.googleapis.com
Enable command authentication (this step is required to set the billing and quota project that is used for API calls):
gcloud auth application-default login
Limitations
Be aware of the following limitations before you start using Inference Quickstart:
- Inference Quickstart does not provide profiles for all models supported by a given model server.
Get optimized configurations for model inference
Use the gcloud alpha container ai recommender
command to explore and
view optimized combinations of model, model server, model server version, and
accelerators:
Models
To explore and select a model, use the models
option.
gcloud alpha container ai recommender models list
The output looks similar to the following:
Supported models:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- google/gemma-2-27b-it
- google/gemma-2-2b-it
- meta-llama/Llama-3.2-1B-Instruct
- meta-llama/Llama-3.3-70B-Instruct
- meta-llama/Meta-Llama-3-8B
- mistralai/Mixtral-8x22B-Instruct-v0.1
- mistralai/Mixtral-8x7B-Instruct-v0.1
Model servers
To explore recommended model servers for the model you are interested in, use the model-servers
option. For example:
gcloud alpha container ai recommender model-servers list \
--model=meta-llama/Meta-Llama-3-8B
The output looks similar to the following:
Supported model servers:
- vllm
Server versions
Optionally, to explore supported versions of the model server you are interested in,
use the model-server-versions
option. If you skip this step,
Inference Quickstart defaults to the latest version.
For example:
gcloud alpha container ai recommender model-server-versions list \
--model=meta-llama/Meta-Llama-3-8B \
--model_server=vllm
The output looks similar to the following:
Supported model server versions:
- e92694b6fe264a85371317295bca6643508034ef
- v0.7.2
Accelerators
To explore recommended accelerators for the model and model server
combination you are interested in, use the accelerators
option.
For example:
gcloud alpha container ai recommender accelerators list \
--model=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--model-server-version=v0.7.2
The output looks similar to the following:
Supported accelerators:
accelerator | model | model server | model server version | accelerator count | output tokens per second | ntpot ms
---------------------|-----------------------------------------|--------------|------------------------------------------|-------------------|--------------------------|---------
nvidia-tesla-a100 | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | vllm | v0.7.2 | 1 | 3357 | 72
nvidia-h100-80gb | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | vllm | v0.7.2 | 1 | 6934 | 30
For more details on each accelerator, use --format=yaml
The output returns a list of accelerator types, and these metrics:
- Throughput, in output tokens per second
- Normalized time per output token (NTPOT), in milliseconds
The values represent the performance observed at the point where throughput stops increasing and latency starts dramatically increasing (that is, the inflection or saturation point) for a given profile with this accelerator type. To learn more about these performance metrics, see About model inference on GKE.
For additional options, see the Google Cloud CLI documentation.
After you choose a model, model server, model server version, and accelerator, you can proceed to create a deployment manifest.
Generate and deploy recommended configurations
This section describes how to generate and deploy configuration recommendations by using the gcloud CLI.
Generate manifests: In the terminal, use the
manifests
option to generate Deployment, Service, and PodMonitoring manifests:gcloud alpha container ai recommender manifests create
Use the
--model
,--model-server
,--model-server-version
, and--accelerator
parameters to customize your manifest.Optionally, you can set these parameters:
--target-ntpot-milliseconds
: Set this parameter to specify your HPA threshold. This parameter lets you define a scaling threshold to keep the NTPOT (Normalized Time Per Output Token) P50 latency, which is measured at the 50th quartile, below the specified value. Choose a value above the minimum latency of your accelerator. The HPA will be configured for maximum throughput if you specify NTPOT above the maximum latency of your accelerator. Here's an example:gcloud alpha container ai recommender manifests create \ --model=google/gemma-2-27b-it \ --model-server=vllm \ --model-server-version=v0.7.2 \ --accelerator-type=nvidia-l4 \ --target-ntpot-milliseconds=200
--output
: Valid values includemanifest
,comments
, andall
. By default, this is set toall
. You can choose to output only the manifest for deploying workloads, or you can choose to output only the comments if you want to view instructions for enabling features.--output-path
: If specified, the output will be saved to the provided path rather than printed to the terminal so you can edit the output before deploying it. For example, you can use this with the--output=manifest
option if you want to save your manifest in a YAML file. Here's an example:gcloud alpha container ai recommender manifests create \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \ --model-server vllm \ --accelerator-type=nvidia-tesla-a100 \ --output=manifest \ --output-path /tmp/manifests.yaml
For additional options, see the Google Cloud CLI documentation.
Provision your infrastructure: ensure your infrastructure is correctly set up for model deployment, monitoring, and scaling by following these steps.
Deploy the manifests: run the
kubectl apply
command and pass in the YAML file for your manifests. For example:kubectl apply -f ./manifests.yaml
Provision your infrastructure
Follow these steps to ensure your infrastructure is correctly set up for model deployment, monitoring, and scaling:
Create a cluster: You can serve your model on GKE Autopilot or Standard clusters. We recommend that you use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that is the best fit for your workloads, see Choose a GKE mode of operation.
If you don't have an existing cluster, follow these steps:
Autopilot
Follow these instructions to create an Autopilot cluster. GKE handles provisioning the nodes with GPU or TPU capacity based on the deployment manifests, if you have the necessary quota in your project.
Standard
- Create a zonal or regional cluster.
Create a node pool with the appropriate accelerators. Follow these steps based on your chosen accelerator type:
- GPUs: First, check the Quotas page in Google Cloud console to ensure you have sufficient GPU capacity. Then, follow the instructions in Create a GPU node pool.
- TPUs: First, ensure you have sufficient TPU by following the instructions in Ensure quota for TPUs and other GKE resources. Then, proceed to Create a TPU node pool.
(Optional, but recommended) Enable observability features: In the comments section of the generated manifest, additional commands are provided to enable suggested observability features. Enabling these features provides more insights to help you monitor the performance and status of workloads and the underlying infrastructure.
The following is an example of a command to enable observability features:
gcloud beta container clusters update $CLUSTER_NAME --project=$PROJECT_ID --location=$LOCATION \ --enable-managed-prometheus \ --logging=SYSTEM,WORKLOAD \ --monitoring=SYSTEM,DEPLOYMENT,HPA,POD,DCGM \ --auto-monitoring-scope=ALL
For more information, see Monitor your inference workloads
(HPA only) Deploy a metrics adapter: A metrics adapter, such as the Custom Metrics Stackdriver Adapter, is necessary if HPA resources were generated in the deployment manifests. The metrics adapter enables the HPA to access model server metrics that use the kube external metrics API. To deploy the adapter, refer to the adapter documentation on GitHub.
Test your deployment endpoints
The deployed service is exposed at the following endpoint:
http://model-model_server-service:port/
Test your service. In a separate terminal, set up port forwarding by running the following command:
kubectl port-forward service/model-model_server-service 8000:8000
For examples of how to build and send a request to your endpoint, see the vLLM documentation.
Monitor your inference workloads
To monitor your deployed inference workloads, go to the Metrics Explorer in the Google Cloud console.
Enable auto-monitoring
GKE includes an auto-monitoring feature that is part of the broader observability features. This feature scans the cluster for workloads that run on supported model servers and deploys the PodMonitoring resources that enables these workload metrics to be visible in Cloud Monitoring. For more information about enabling and configuring auto-monitoring, see Configure automatic application monitoring for workloads.
After enabling the feature, GKE installs prebuilt dashboards for monitoring applications for supported workloads.
Troubleshooting
- If you set the latency too low, Inference Quickstart might not generate a recommendation. To fix this issue, select a latency target between the minimum and maximum latency that was observed for your selected accelerators.
- Inference Quickstart exists independently of GKE components, so your cluster version is not directly relevant for using the service. However, we recommend using a fresh or up-to-date cluster to avoid any discrepancies in performance.
- If you get a
PERMISSION_DENIED
error forgkerecommender.googleapis.com
commands saying a quota project is missing with ADC, you need to set it manually. Rungcloud config set billing/quota_project PROJECT_ID
to fix this.
What's next
- Visit the AI/ML orchestration on GKE portal to explore our official guides, tutorials, and use cases for running AI/ML workloads on GKE.
- For more information about model serving optimization, see Best practices for optimizing large language model inference with GPUs. It covers best practices for LLM serving with GPUs on GKE, like quantization, tensor parallelism, and memory management.
- For more information about best practices for autoscaling, see these guides:
- Explore additional examples, best practices, and tools on the ai-on-gke GitHub repository.