Run best practice inference with GKE Inference Quickstart recipes


This page describes how you can use GKE Inference Quickstart to simplify the deployment of AI/ML inference workloads on Google Kubernetes Engine (GKE). Inference Quickstart is a utility that lets you specify your inference business requirements and get optimized Kubernetes configurations based on best practices and Google's benchmarks for models, model servers, accelerators (GPUs, TPUs), and scaling. This helps you avoid the time-consuming process of manually adjusting and testing configurations.

This page is for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who want to understand how to efficiently manage and optimize GKE for AI/ML inference. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.

To learn more about model serving concepts and terminology, and how GKE Gen AI capabilities can enhance and support your model serving performance, see About model inference on GKE.

Before reading this page, ensure you're familiar with Kubernetes, GKE, and model serving.

Using Inference Quickstart

The high-level steps to use Inference Quickstart are as follows. Click the links for detailed instructions.

  1. View tailored best practices: using the Google Cloud CLI in the terminal, start by providing inputs such as your preferred open model (for example, Llama, Gemma, or Mistral).
    • You can specify your application's latency target, indicating whether it is latency-sensitive (like a chatbot) or throughput-sensitive (like batch analytics).
    • Based on your requirements, Inference Quickstart provides accelerator choices, performance metrics, and Kubernetes manifests, which give you full control for deployment or further modifications. Generated manifests reference public model server images, so you don't have to create these images yourself.
  2. Deploy manifests: using the kubectl apply command, deploy the recommended manifests from command line. Before you deploy, you need to ensure you have sufficient accelerator quota for the selected GPUs or TPUs in your Google Cloud project.
  3. Monitor performance: use Cloud Monitoring to monitor metrics that GKE for monitoring workload performance. You can view model server specific dashboards, and fine-tune your deployment as needed.

Benefits

Inference Quickstart helps you save time and resources by providing optimized configurations. These optimizations improve performance and reduce infrastructure costs, in the following ways:

  • You receive detailed tailored best practices for setting accelerators (GPU and TPU), model server, and scaling configurations. GKE routinely updates the tool with the latest fixes, images, and performance benchmarks.
  • You can specify your workload's latency and throughput requirements from a command-line interface, and get detailed tailored best practices as Kubernetes deployment manifests.

Use cases

Inference Quickstart is suitable for scenarios like the following:

  • Discover optimal GKE inference architectures: if you're transitioning from another environment, such as on-premises or a different cloud provider, and want the most up-to-date recommended inference architectures on GKE for your specific performance needs.
  • Expedite AI/ML inference deployments: if you're an experienced Kubernetes user and want to quickly start deploying AI inference workloads, Inference Quickstart helps you discover and implement best practice deployments on GKE, with detailed YAML configurations based on best practices.
  • Explore TPUs for enhanced performance: if you're already utilizing Kubernetes on GKE with GPUs, you can use Inference Quickstart to explore the benefits of using TPUs to potentially achieve better performance.

How it works

Inference Quickstart provides tailored best practices based on Google's exhaustive internal benchmarks of single-replica performance for model, model server, and accelerator topology combinations. These benchmarks graph latency versus throughput, including queue size and KV cache metrics, which map out performance curves for each combination.

How tailored best practices are generated

We measure latency in Normalized Time per Output Token (NTPOT) in milliseconds and throughput in output tokens per second, by saturating accelerators. To learn more about these performance metrics, see About model inference on GKE.

The following example latency profile illustrates the inflection point where throughput plateaus (green), the post-inflection point where latency worsens (red), and the ideal zone (blue) for optimal throughput at the latency target. Inference Quickstart provides performance data and configurations for this ideal zone.

Latency profile with green marker less than 2000 output tokens per second and red marker higher than 2000 output tokens per second

Based on an inference application's latency requirements, Inference Quickstart identifies suitable combinations and determines the optimal operating point on the latency-throughput curve. This point sets the Horizontal Pod Autoscaler (HPA) threshold, with a buffer to account for scale-up latency. The overall threshold also informs the initial number of replicas needed, though the HPA dynamically adjusts this number based on workload.

Benchmarking

The provided configurations and performance data are based on benchmarks that use the ShareGPT dataset to send traffic with the following input and output distribution.

Input Tokens Output Tokens
Min Median Mean P90 P99 Max Min Median Mean P90 P99 Max
4 108 226 635 887 1024 1 132 195 488 778 1024

To run the benchmark yourself, follow the instructions at AI-Hypercomputer/inference-benchmark. We provide different options that you can use during benchmarking to simulate load patterns that are representative of your workload.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

  • Make sure that billing is enabled for your Google Cloud project.

  • Ensure you have sufficient accelerator capacity for your project:

  • Generate a Hugging Face access token and a corresponding Kubuernetes Secret, if you don't have one already. To create a Kubernetes Secret that contains the Hugging Face token, run the following command:

    kubectl create secret generic hf-secret \
        --from-literal=hf_api_token=HUGGING_FACE_TOKEN \
        --namespace=NAMESPACE
    

    Replace the following values:

    • HUGGING_FACE_TOKEN: the Hugging Face token you created earlier.
    • NAMESPACE: the Kubernetes namespace where you want to deploy your model server.
  • Some models might also require you to accept and sign their consent license agreement.

If you use the gcloud CLI to run Inference Quickstart, you also need to run these additional commands:

  1. Enable the gkerecommender.googleapis.com API:

    gcloud services enable gkerecommender.googleapis.com
    
  2. Enable command authentication (this step is required to set the billing and quota project that is used for API calls):

    gcloud auth application-default login
    

Limitations

Be aware of the following limitations before you start using Inference Quickstart:

  • Inference Quickstart does not provide profiles for all models supported by a given model server.

Get optimized configurations for model inference

Use the gcloud alpha container ai recommender command to explore and view optimized combinations of model, model server, model server version, and accelerators:

Models

To explore and select a model, use the models option.

  gcloud alpha container ai recommender models list

The output looks similar to the following:

  Supported models:
  -  deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  -  google/gemma-2-27b-it
  -  google/gemma-2-2b-it
  -  meta-llama/Llama-3.2-1B-Instruct
  -  meta-llama/Llama-3.3-70B-Instruct
  -  meta-llama/Meta-Llama-3-8B
  -  mistralai/Mixtral-8x22B-Instruct-v0.1
  -  mistralai/Mixtral-8x7B-Instruct-v0.1

Model servers

To explore recommended model servers for the model you are interested in, use the model-servers option. For example:

  gcloud alpha container ai recommender model-servers list \
      --model=meta-llama/Meta-Llama-3-8B

The output looks similar to the following:

  Supported model servers:
  -  vllm

Server versions

Optionally, to explore supported versions of the model server you are interested in, use the model-server-versions option. If you skip this step, Inference Quickstart defaults to the latest version. For example:

  gcloud alpha container ai recommender model-server-versions list \
      --model=meta-llama/Meta-Llama-3-8B \
      --model_server=vllm

The output looks similar to the following:

  Supported model server versions:
  -  e92694b6fe264a85371317295bca6643508034ef
  -  v0.7.2

Accelerators

To explore recommended accelerators for the model and model server combination you are interested in, use the accelerators option. For example:

  gcloud alpha container ai recommender accelerators list \
      --model=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
      --model-server-version=v0.7.2

The output looks similar to the following:

  Supported accelerators:
  accelerator          | model                                   | model server | model server version                     | accelerator count | output tokens per second | ntpot ms
  ---------------------|-----------------------------------------|--------------|------------------------------------------|-------------------|--------------------------|---------
  nvidia-tesla-a100    | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | vllm         | v0.7.2                                   | 1                 | 3357                     | 72
  nvidia-h100-80gb     | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | vllm         | v0.7.2                                   | 1                 | 6934                     | 30

  For more details on each accelerator, use --format=yaml

The output returns a list of accelerator types, and these metrics:

  • Throughput, in output tokens per second
  • Normalized time per output token (NTPOT), in milliseconds

The values represent the performance observed at the point where throughput stops increasing and latency starts dramatically increasing (that is, the inflection or saturation point) for a given profile with this accelerator type. To learn more about these performance metrics, see About model inference on GKE.

For additional options, see the Google Cloud CLI documentation.

After you choose a model, model server, model server version, and accelerator, you can proceed to create a deployment manifest.

Generate and deploy recommended configurations

This section describes how to generate and deploy configuration recommendations by using the gcloud CLI.

  1. Generate manifests: In the terminal, use the manifests option to generate Deployment, Service, and PodMonitoring manifests:

    gcloud alpha container ai recommender manifests create
    

    Use the --model, --model-server, --model-server-version, and --accelerator parameters to customize your manifest.

    Optionally, you can set these parameters:

    • --target-ntpot-milliseconds: Set this parameter to specify your HPA threshold. This parameter lets you define a scaling threshold to keep the NTPOT (Normalized Time Per Output Token) P50 latency, which is measured at the 50th quartile, below the specified value. Choose a value above the minimum latency of your accelerator. The HPA will be configured for maximum throughput if you specify NTPOT above the maximum latency of your accelerator. Here's an example:

      gcloud alpha container ai recommender manifests create \
          --model=google/gemma-2-27b-it \
          --model-server=vllm \
          --model-server-version=v0.7.2 \
          --accelerator-type=nvidia-l4 \
          --target-ntpot-milliseconds=200
      
    • --output: Valid values include manifest, comments, and all. By default, this is set to all. You can choose to output only the manifest for deploying workloads, or you can choose to output only the comments if you want to view instructions for enabling features.

    • --output-path: If specified, the output will be saved to the provided path rather than printed to the terminal so you can edit the output before deploying it. For example, you can use this with the --output=manifest option if you want to save your manifest in a YAML file. Here's an example:

      gcloud alpha container ai recommender manifests create \
          --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
          --model-server vllm \
          --accelerator-type=nvidia-tesla-a100 \
          --output=manifest \
          --output-path  /tmp/manifests.yaml
      

    For additional options, see the Google Cloud CLI documentation.

  2. Provision your infrastructure: ensure your infrastructure is correctly set up for model deployment, monitoring, and scaling by following these steps.

  3. Deploy the manifests: run the kubectl apply command and pass in the YAML file for your manifests. For example:

    kubectl apply -f ./manifests.yaml
    

Provision your infrastructure

Follow these steps to ensure your infrastructure is correctly set up for model deployment, monitoring, and scaling:

  1. Create a cluster: You can serve your model on GKE Autopilot or Standard clusters. We recommend that you use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that is the best fit for your workloads, see Choose a GKE mode of operation.

    If you don't have an existing cluster, follow these steps:

    Autopilot

    Follow these instructions to create an Autopilot cluster. GKE handles provisioning the nodes with GPU or TPU capacity based on the deployment manifests, if you have the necessary quota in your project.

    Standard

    1. Create a zonal or regional cluster.
    2. Create a node pool with the appropriate accelerators. Follow these steps based on your chosen accelerator type:

  2. (Optional, but recommended) Enable observability features: In the comments section of the generated manifest, additional commands are provided to enable suggested observability features. Enabling these features provides more insights to help you monitor the performance and status of workloads and the underlying infrastructure.

    The following is an example of a command to enable observability features:

    gcloud beta container clusters update $CLUSTER_NAME
        --project=$PROJECT_ID
        --location=$LOCATION \
        --enable-managed-prometheus \
        --logging=SYSTEM,WORKLOAD \
        --monitoring=SYSTEM,DEPLOYMENT,HPA,POD,DCGM \
        --auto-monitoring-scope=ALL
    

    For more information, see Monitor your inference workloads

  3. (HPA only) Deploy a metrics adapter: A metrics adapter, such as the Custom Metrics Stackdriver Adapter, is necessary if HPA resources were generated in the deployment manifests. The metrics adapter enables the HPA to access model server metrics that use the kube external metrics API. To deploy the adapter, refer to the adapter documentation on GitHub.

Test your deployment endpoints

The deployed service is exposed at the following endpoint:

http://model-model_server-service:port/

Test your service. In a separate terminal, set up port forwarding by running the following command:

kubectl port-forward service/model-model_server-service 8000:8000

For examples of how to build and send a request to your endpoint, see the vLLM documentation.

Monitor your inference workloads

To monitor your deployed inference workloads, go to the Metrics Explorer in the Google Cloud console.

Enable auto-monitoring

GKE includes an auto-monitoring feature that is part of the broader observability features. This feature scans the cluster for workloads that run on supported model servers and deploys the PodMonitoring resources that enables these workload metrics to be visible in Cloud Monitoring. For more information about enabling and configuring auto-monitoring, see Configure automatic application monitoring for workloads.

After enabling the feature, GKE installs prebuilt dashboards for monitoring applications for supported workloads.

Troubleshooting

  • If you set the latency too low, Inference Quickstart might not generate a recommendation. To fix this issue, select a latency target between the minimum and maximum latency that was observed for your selected accelerators.
  • Inference Quickstart exists independently of GKE components, so your cluster version is not directly relevant for using the service. However, we recommend using a fresh or up-to-date cluster to avoid any discrepancies in performance.
  • If you get a PERMISSION_DENIED error for gkerecommender.googleapis.com commands saying a quota project is missing with ADC, you need to set it manually. Run gcloud config set billing/quota_project PROJECT_ID to fix this.

What's next