About model inference on GKE


This page describes the key concepts, benefits, and steps for running generative AI model inference workloads on Google Kubernetes Engine (GKE), using GKE Gen AI capabilities.

Inference serving is critical in deploying your generative AI models to real-world applications. GKE provides a robust and scalable platform for managing your containerized workloads, making it a compelling choice for serving your models in development or production. With GKE, you can use Kubernetes' capabilities for orchestration, scaling, and high availability to efficiently deploy and manage your inference services.

Recognizing the specific demands of AI/ML inference, Google Cloud has introduced GKE Gen AI capabilities—a suite of features specifically designed to enhance and optimize inference serving on GKE. For more information about specific features, see GKE Gen AI capabilities.

Terminology

This page uses the following terminology related to inference on GKE:

  • Inference: the process of running a generative AI model, such as a large language model or diffusion model, within a GKE cluster to generate text, embeddings, or other outputs from input data. Model inference on GKE leverages accelerators to efficiently handle complex computations for real-time or batch processing.
  • Model: a generative AI model that has learned patterns from data and is used for inference. Models vary in size and architecture, from smaller domain-specific models to massive multi-billion parameter neural networks that are optimized for diverse language tasks.
  • Model server: a containerized service responsible for receiving inference requests and returning predictions. This service might be a Python app, or a more robust solution like vLLM, JetStream, TensorFlow Serving, or Triton Inference Server. The model server handles loading models into memory, and executes computations on accelerators to return predictions efficiently.
  • Accelerator: specialized hardware, such as Graphics Processing Units (GPUs) from NVIDIA and Tensor Processing Units (TPUs) from Google, that can be attached to GKE nodes to speed up computations, particularly for training and inference tasks.

Benefits of GKE for inference

Inference serving on GKE provides several benefits:

  • Efficient price-performance: get value and speed for your inference serving needs. GKE lets you choose from a range of powerful accelerators (GPUs and TPUs), so you only pay for the performance you need.
  • Faster deployment: accelerate your time to market with tailored best practices, qualifications, and best practices provided by GKE Gen AI capabilities.
  • Scalable performance: scale out performance with prebuilt monitoring by using Horizontal Pod Autoscaling (HPA) and custom metrics. You can run a range of pre-trained or custom models, from 8 billion parameters up to 671 billion parameters.
  • Full portability: benefit from full portability with open standards. Google contributes to key Kubernetes APIs, including Gateway and LeaderWorkerSet, and all APIs are portable with Kubernetes distributions.
  • Ecosystem support: build on GKE's robust ecosystem which supports tools like Kueue for advanced resource queuing and management, and Ray for distributed computing, to facilitate scalable and efficient model training and inference.

How inference on GKE works

This section describes, at a high-level, the steps to use GKE for inference serving:

  1. Containerize your model: deploy a model by containerizing the model server (such as vLLM), and loading model weights from Cloud Storage or a repository like Hugging Face. When you use GKE Inference Quickstart, the containerized image is automatically managed in the manifest for you.

  2. Create a GKE cluster: create a GKE cluster to host your deployment. Choose Autopilot for a managed experience or Standard for customization. Configure the cluster size, node types, and accelerators. For an optimized configuration, use Inference Quickstart.

  3. Deploy your model as a Kubernetes Deployment: create a Kubernetes Deployment to manage your inference service. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster. Specify the Docker image, replicas, and settings. Kubernetes pulls the image and runs your containers on the GKE cluster nodes. Configure the Pods with your model server and model, including LoRA adapters if needed.

  4. Expose your inference service: make your inference service accessible by creating a Kubernetes Service to provide a network endpoint for your Deployment.

  5. Handle inference requests: send data in the from your application clients to your Service's endpoint, in the expected format (JSON, gRPC). If you are using a load balancer, it distributes requests to model replicas. The model server processes the request, runs the model, and returns the prediction.

  6. Scale and monitor your inference deployment: scale inference with HPA to automatically adjust replicas based on CPU or latency. Use Inference Quickstart to get auto-generated scaling recommendations. To track performance, use Cloud Monitoring and Cloud Logging with prebuilt observability, including dashboards for popular model servers like vLLM.

For detailed examples that use specific models, model servers, and accelerators, see Inference examples.

GKE Gen AI capabilities

You can use these capabilities together or individually to address key challenges in serving generative AI models and improving resource utilization within your GKE environment, at no additional cost.

Name Description Benefits
GKE Inference Quickstart (Preview)

Specify your business needs and get tailored best practices for the combination of accelerators, scaling configurations, and model servers that best meets your needs. You can access this service with the gcloud CLI.

For more information, see Run best practice inference with GKE Inference Quickstart recipes.

  • Saves time by automating the initial steps of choosing and configuring your infrastructure.
  • Lets you maintain full control over your Kubernetes setup for further tuning.

Planning for inference

This section covers some of the key considerations that you should take into account for your inference workloads on GKE.

Cost-efficiency

Serving large generative AI models can be expensive due to the usage of accelerators, so you should focus on efficient resource utilization. Selecting the right machine type and accelerator is crucial, to ensure you match the accelerator memory to your model size and quantization level. For example, G2 instances with NVIDIA L4 GPUs can be cost-effective for smaller models, while A3 instances are better suited for larger ones.

Follow these tips and recommendations to maximize cost-efficiency:

Performance

To optimize inference performance on GKE, focus on the following benchmark metrics:

Benchmark indicators Metric (unit) Description
Latency Time to First Token (TTFT) (ms) Time it takes to generate the first token for a request.
Normalized Time Per Output Token (NTPOT) (ms) Request latency normalized by the number of output tokens, measured as request_latency / total_output_tokens.
Time Per Output Token (TPOT) (ms) Time it takes to generate one output token, measured as (request_latency - time_to_first_token) / (total_output_tokens - 1).
Inter-token latency (ITL) (ms) Measures latency between two output token generations. Unlike TPOT, which measures latency across the entire request, ITL measures the time to generate each individual output token. These individual measurements are then aggregated to produce mean, median, and percentile values such as p90.
Request latency (ms) End-to-end time to complete a request.
Throughput Requests per second Total number of requests that you serve per second. Note that this metric might not be a reliable way to measure LLM throughput because it can vary widely for different context lengths.
Output tokens per second A common metric that is measured as total_output_tokens_generated_by_server / elapsed_time_in_seconds.
Input tokens per second Measured as total_input_tokens_generated_by_server / elapsed_time_in_seconds.
Tokens per second Measured as total_tokens_generated_by_server / elapsed_time_in_seconds. This metric counts both input and output tokens, helping you compare workloads with high prefill versus high decode times.

To try performance benchmarking your workloads, follow the instructions on AI-Hypercomputer/inference-benchmark.

Consider these additional tips and recommendations for performance:

  • To get the recommended accelerators based on your performance needs, use Inference Quickstart.
  • To boost performance, use model server optimization techniques like batching and PagedAttention, which are in our best practices guide. Additionally, prioritize efficient memory management and attention computation for consistently low inter-token latencies.
  • Use standardized metrics across model servers (such as Hugging Face TGI, vLLM, or NVIDIA Triton) to improve autoscaling and load balancing, which can help you achieve higher throughput at your chosen latency. GKE provides automatic application monitoring for several model servers.

Obtainability

Ensuring the obtainability of resources (CPUs, GPUs and TPUs) is crucial for maintaining the performance, availability, and cost-efficiency of your inference workloads. Inference workloads often exhibit bursty and unpredictable traffic patterns, which can challenge hardware capacity. GKE addresses these challenges with features like the following:

  • Resource consumption options: choose from options such as reservations for assured capacity, cost-effective scaling, Dynamic Workload Scheduler, and Spot VMs for cost optimization and on-demand access for immediate availability.
  • Resource rightsizing: for example, Google Cloud offers smaller A3 High VMs with NVIDIA H100 GPUs (1g, 2g, or 4g) for cost-effective generative AI inference scaling that support Spot VMs.
  • Compute classes for accelerators: you can use custom compute classes for more granular control, to prevent over-provisioning and maximize resource obtainability with automatic fallback options.

Node upgrades

GKE automates much of the upgrade process, but you need to consider upgrade strategies, especially for compatibility and testing. For manual upgrades, you can choose between surge or blue-green upgrades based on your inference workload's tolerance for interruption. Surge upgrades are fast, but they can briefly impact services. Blue-green upgrades offer near-zero downtime, which is crucial for real-time inference. To learn more, see Node upgrade strategies.

GPUs and TPUs don't support live migration, so maintenance requires restarting Pods. Use GKE notifications to prepare for disruptions. We recommend using Pod Disruption Budgets (PDBs) to ensure a minimum number of Pods remain available. Make sure your Pods can gracefully handle termination. TPU slices can be disrupted by single host events, so plan for redundancy. For more best practices, see Manage GKE node disruption for GPUs and TPUs.

Try inference examples

Find GKE deployment examples for generative AI models, accelerators, and model servers. If you are just getting started, we recommend exploring the Serve Gemma open models using GPUs on GKE with vLLM tutorial.

Or, search for a tutorial by keyword:

Accelerator Model Server Tutorial
GPUs vLLM Serve LLMs like DeepSeek-R1 671B or Llama 3.1 405B on GKE
GPUs vLLM Serve Gemma open models using GPUs on GKE with vLLM
GPUs NVIDIA Triton Serve a model with a single GPU in GKE
GPUs Ray Serve Serve an LLM on L4 GPUs with Ray
GPUs TGI Serve an LLM with multiple GPUs in GKE
GPUs NVIDIA Triton Serve Gemma open models using GPUs on GKE with Triton and TensorRT-LLM
GPUs Hugging Face TGI Serve Gemma open models using GPUs on GKE with Hugging Face TGI
GPUs TensorFlow Serving Serve a model with a single GPU in GKE
TPUs vLLM Serve an LLM using TPU Trillium on GKE with vLLM
TPUs vLLM Serve an LLM using TPUs on GKE with KubeRay
TPUs JetStream Serve an LLM using TPUs on GKE with JetStream and PyTorch
TPUs JetStream Serve Gemma using TPUs on GKE with JetStream
TPUs MaxDiffusion Serve Stable Diffusion XL (SDXL) using TPUs on GKE with MaxDiffusion
TPUs Optimum TPU Serve open source models using TPUs on GKE with Optimum TPU

What's next