About AI/ML model inference on GKE

Autopilot Standard

This page describes the key concepts, benefits, and steps for running generative AI/ML model inference workloads on Google Kubernetes Engine (GKE), using GKE Gen AI capabilities.

Inference serving is critical in deploying your generative AI models to real-world applications. GKE provides a robust and scalable platform for managing your containerized workloads, making it a compelling choice for serving your models in development or production. With GKE, you can use Kubernetes' capabilities for orchestration, scaling, and high availability to efficiently deploy and manage your inference services.

Recognizing the specific demands of AI/ML inference, Google Cloud has introduced GKE Gen AI capabilities—a suite of features specifically designed to enhance and optimize inference serving on GKE. For more information about specific features, see GKE Gen AI capabilities.

Get started with AI/ML model inference on GKE

You can start exploring AI/ML model inference on GKE in minutes. You can use GKE's free tier, which lets you get started with Kubernetes without incurring costs for cluster management.

Go to the GKE AI/ML page in Google Cloud console
Try the Deploy Models steps to deploy a containerized model and model server.
Read Planning for inference, which has guidance and resources for planning and running your inference workloads on GKE.

Terminology

This page uses the following terminology related to inference on GKE:

Inference: the process of running a generative AI model, such as a large language model or diffusion model, within a GKE cluster to generate text, embeddings, or other outputs from input data. Model inference on GKE leverages accelerators to efficiently handle complex computations for real-time or batch processing.
Model: a generative AI model that has learned patterns from data and is used for inference. Models vary in size and architecture, from smaller domain-specific models to massive multi-billion parameter neural networks that are optimized for diverse language tasks.
Model server: a containerized service responsible for receiving inference requests and returning inferences. This service might be a Python app, or a more robust solution like vLLM, JetStream, TensorFlow Serving, or Triton Inference Server. The model server handles loading models into memory, and executes computations on accelerators to return inferences efficiently.
Accelerator: specialized hardware, such as Graphics Processing Units (GPUs) from NVIDIA and Tensor Processing Units (TPUs) from Google, that can be attached to GKE nodes to speed up computations, particularly for training and inference tasks.

Benefits of GKE for inference

Inference serving on GKE provides several benefits:

Efficient price-performance: get value and speed for your inference serving needs. GKE lets you choose from a range of powerful accelerators (GPUs and TPUs), so you only pay for the performance you need.
Faster deployment: accelerate your time to market with tailored best practices, qualifications, and best practices provided by GKE Gen AI capabilities.
Scalable performance: scale out performance with prebuilt monitoring by using GKE Inference Gateway, Horizontal Pod Autoscaling (HPA), and custom metrics. You can run a range of pre-trained or custom models, from 8 billion parameters up to 671 billion parameters.
Full portability: benefit from full portability with open standards. Google contributes to key Kubernetes APIs, including Gateway and LeaderWorkerSet, and all APIs are portable with Kubernetes distributions.
Ecosystem support: build on GKE's robust ecosystem which supports tools like Kueue for advanced resource queuing and management, and Ray for distributed computing, to facilitate scalable and efficient model training and inference.

How inference on GKE works

This section describes, at a high-level, the steps to use GKE for inference serving:

Containerize your model: deploy a model by containerizing the model server (such as vLLM), and loading model weights from Cloud Storage or a repository like Hugging Face. When you use GKE Inference Quickstart, the containerized image is automatically managed in the manifest for you.
Create a GKE cluster: create a GKE cluster to host your deployment. Choose Autopilot for a managed experience or Standard for customization. Configure the cluster size, node types, and accelerators. For an optimized configuration, use Inference Quickstart.
Deploy your model as a Kubernetes Deployment: create a Kubernetes Deployment to manage your inference service. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster. Specify the Docker image, replicas, and settings. Kubernetes pulls the image and runs your containers on the GKE cluster nodes. Configure the Pods with your model server and model, including LoRA adapters if needed.
Expose your inference service: make your inference service accessible by creating a Kubernetes Service to provide a network endpoint for your Deployment. Use Inference Gateway for intelligent load balancing and routing that's specifically tailored for generative AI inference workloads.
Handle inference requests: send data in the from your application clients to your Service's endpoint, in the expected format (JSON, gRPC). If you are using a load balancer, it distributes requests to model replicas. The model server processes the request, runs the model, and returns the inference.
Scale and monitor your inference deployment: scale inference with HPA to automatically adjust replicas based on CPU or latency. Use Inference Quickstart to get auto-generated scaling recommendations. To track performance, use Cloud Monitoring and Cloud Logging with prebuilt observability, including dashboards for popular model servers like vLLM.

For detailed examples that use specific models, model servers, and accelerators, see Inference examples.

GKE Gen AI capabilities

You can use these capabilities together or individually to address key challenges in serving generative AI models and improving resource utilization within your GKE environment, at no additional cost.

Name	Description	Benefits
GKE Inference Quickstart (Preview)	Specify your business needs and get tailored best practices for the combination of accelerators, scaling configurations, and model servers that best meets your needs. You can access this service with the gcloud CLI. For more information, see Run best practice inference with GKE Inference Quickstart recipes.	Saves time by automating the initial steps of choosing and configuring your infrastructure. Lets you maintain full control over your Kubernetes setup for further tuning.
GKE Inference Gateway (Preview)	Get routing based on metrics, like KV cache utilization, for better latency. For more information, see About GKE Inference Gateway	Share fined-tuned models that use LoRA files, with affinity-based endpoint picking for cost-efficiency. Achieve high availability by dynamically accessing GPU and TPU capacity across regions. Enhance the security of your models with Model Armor add-on policies.
Model weight loading accelerators	Access data in Cloud Storage quickly using Cloud Storage FUSE with caching and parallel downloads. For inference workloads that demand consistent scale out performance, Google Cloud Hyperdisk ML is a network-attached disk that can be attached to up to 2,500 Pods.	Optimize inference startup time by minimizing weight loading model latency on GKE. For deployments with limited node scaling, consider using Cloud Storage FUSE to mount model weights. For massive-scale scenarios demanding consistent, low latency access to large model weights, Google Cloud Hyperdisk ML offers a dedicated block storage solution.

Name

Description

Benefits

GKE Inference Quickstart (Preview)

Specify your business needs and get tailored best practices for the combination of accelerators, scaling configurations, and model servers that best meets your needs. You can access this service with the gcloud CLI.

For more information, see Run best practice inference with GKE Inference Quickstart recipes.

Saves time by automating the initial steps of choosing and configuring your infrastructure.
Lets you maintain full control over your Kubernetes setup for further tuning.

GKE Inference Gateway (Preview)

Get routing based on metrics, like KV cache utilization, for better latency.

For more information, see About GKE Inference Gateway

Share fined-tuned models that use LoRA files, with affinity-based endpoint picking for cost-efficiency.
Achieve high availability by dynamically accessing GPU and TPU capacity across regions.
Enhance the security of your models with Model Armor add-on policies.

Model weight loading accelerators

Access data in Cloud Storage quickly using Cloud Storage FUSE with caching and parallel downloads.

For inference workloads that demand consistent scale out performance, Google Cloud Hyperdisk ML is a network-attached disk that can be attached to up to 2,500 Pods.

Optimize inference startup time by minimizing weight loading model latency on GKE.
For deployments with limited node scaling, consider using Cloud Storage FUSE to mount model weights.
For massive-scale scenarios demanding consistent, low latency access to large model weights, Google Cloud Hyperdisk ML offers a dedicated block storage solution.

Planning for inference

This section covers some of the key considerations that you should take into account for your inference workloads on GKE.

Cost-efficiency

Serving large generative AI models can be expensive due to the usage of accelerators, so you should focus on efficient resource utilization. Selecting the right machine type and accelerator is crucial, to ensure you match the accelerator memory to your model size and quantization level. For example, G2 instances with NVIDIA L4 GPUs can be cost-effective for smaller models, while A3 instances are better suited for larger ones.

Follow these tips and recommendations to maximize cost-efficiency:

Use Inference Quickstart to get the recommended accelerators based on your performance needs.
Use techniques like quantization and request batching to improve serving efficiency. To learn more, see Best practices for optimizing large language model inference with GPUs.
Use autoscaling, which dynamically adjusts resources based on demand. This can result in cost savings, especially for fluctuating workloads. To learn more, see the following guides:
- Best practices for autoscaling large language model (LLM) inference workloads with GPUs
- Best practices for autoscaling large language model (LLM) inference workloads with TPUs

Performance

To optimize inference performance on GKE, focus on the following benchmark metrics:

Benchmark indicators	Metric (unit)	Description
Latency	Time to First Token (TTFT) (ms)	Time it takes to generate the first token for a request.
	Normalized Time Per Output Token (NTPOT) (ms)	Request latency normalized by the number of output tokens, measured as `request_latency / total_output_tokens`.
	Time Per Output Token (TPOT) (ms)	Time it takes to generate one output token, measured as `(request_latency - time_to_first_token) / (total_output_tokens - 1)`.
	Inter-token latency (ITL) (ms)	Measures latency between two output token generations. Unlike TPOT, which measures latency across the entire request, ITL measures the time to generate each individual output token. These individual measurements are then aggregated to produce mean, median, and percentile values such as p90.
	Request latency (ms)	End-to-end time to complete a request.
Throughput	Requests per second	Total number of requests that you serve per second. Note that this metric might not be a reliable way to measure LLM throughput because it can vary widely for different context lengths.
	Output tokens per second	A common metric that is measured as `total_output_tokens_generated_by_server / elapsed_time_in_seconds`.
	Input tokens per second	Measured as `total_input_tokens_generated_by_server / elapsed_time_in_seconds`.
	Tokens per second	Measured as `total_tokens_generated_by_server / elapsed_time_in_seconds`. This metric counts both input and output tokens, helping you compare workloads with high prefill versus high decode times.

Consider these additional tips and recommendations for performance:

To get the recommended accelerators based on your performance needs, use Inference Quickstart.
To boost performance, use model server optimization techniques like batching and PagedAttention, which are in our best practices guide. Additionally, prioritize efficient memory management and attention computation for consistently low inter-token latencies.
Use standardized metrics across model servers (such as Hugging Face TGI, vLLM, or NVIDIA Triton) to improve autoscaling and load balancing, which can help you achieve higher throughput at your chosen latency. GKE provides automatic application monitoring for several model servers.
Use GKE network infrastructure features like Inference Gateway to minimize latency.
Use Cloud Storage FUSE with parallel downloads and caching, or Hyperdisk ML to accelerate loading model weights from persistent storage.
For large-scale training or inferencing, use Pathways. Pathways simplifies large-scale machine learning computations by enabling a single JAX client to orchestrate workloads across multiple large TPU slices. For more information, see Pathways.

Obtainability

Ensuring the obtainability of resources (CPUs, GPUs and TPUs) is crucial for maintaining the performance, availability, and cost-efficiency of your inference workloads. Inference workloads often exhibit bursty and unpredictable traffic patterns, which can challenge hardware capacity. GKE addresses these challenges with features like the following:

Resource consumption options: choose from options such as reservations for assured capacity, cost-effective scaling, Dynamic Workload Scheduler, and Spot VMs for cost optimization and on-demand access for immediate availability.
Resource rightsizing: for example, Google Cloud offers smaller A3 High VMs with NVIDIA H100 GPUs (1g, 2g, or 4g) for cost-effective generative AI inference scaling that support Spot VMs.
Compute classes for accelerators: you can use custom compute classes for more granular control, to prevent over-provisioning and maximize resource obtainability with automatic fallback options.

Node upgrades

GKE automates much of the upgrade process, but you need to consider upgrade strategies, especially for compatibility and testing. For manual upgrades, you can choose between surge or blue-green upgrades based on your inference workload's tolerance for interruption. Surge upgrades are fast, but they can briefly impact services. Blue-green upgrades offer near-zero downtime, which is crucial for real-time inference. To learn more, see Node upgrade strategies.

GPUs and TPUs don't support live migration, so maintenance requires restarting Pods. Use GKE notifications to prepare for disruptions. We recommend using Pod Disruption Budgets (PDBs) to ensure a minimum number of Pods remain available. Make sure your Pods can gracefully handle termination. TPU slices can be disrupted by single host events, so plan for redundancy. For more best practices, see Manage GKE node disruption for GPUs and TPUs.

Try inference examples

Find GKE deployment examples for generative AI models, accelerators, and model servers. If you are just getting started, we recommend exploring the Serve Gemma open models using GPUs on GKE with vLLM tutorial.

Or, search for a tutorial by keyword:

Accelerator	Model Server	Tutorial
GPUs	vLLM	Serve LLMs like DeepSeek-R1 671B or Llama 3.1 405B on GKE
GPUs	vLLM	Serve Llama models using GPUs on GKE with vLLM
GPUs	vLLM	Serve Gemma open models using GPUs on GKE with vLLM
GPUs	vLLM	Serve an LLM with GKE Inference Gateway
GPUs	NVIDIA Triton	Serve a model with a single GPU in GKE
GPUs	Ray Serve	Serve an LLM on L4 GPUs with Ray
GPUs	TGI	Serve an LLM with multiple GPUs in GKE
GPUs	NVIDIA Triton	Serve Gemma open models using GPUs on GKE with Triton and TensorRT-LLM
GPUs	Hugging Face TGI	Serve Gemma open models using GPUs on GKE with Hugging Face TGI
GPUs	TensorFlow Serving	Serve a model with a single GPU in GKE
TPUs	vLLM	Serve an LLM using TPU Trillium on GKE with vLLM
TPUs	vLLM	Serve an LLM using TPUs on GKE with KubeRay
TPUs	JetStream	Serve an LLM using TPUs on GKE with JetStream and PyTorch
TPUs	JetStream	Serve Gemma using TPUs on GKE with JetStream
TPUs	MaxDiffusion	Serve Stable Diffusion XL (SDXL) using TPUs on GKE with MaxDiffusion
TPUs	Optimum TPU	Serve open source models using TPUs on GKE with Optimum TPU

What's next

Visit the AI/ML orchestration on GKE portal to explore our official guides, tutorials, and use cases for running AI/ML workloads on GKE.
To learn more about model serving optimization, see Best practices for optimizing large language model inference with GPUs. It covers best practices for LLM serving with GPUs on GKE, like quantization, tensor parallelism, and memory management.
Explore experimental samples for leveraging GKE to accelerate your AI/ML initiatives in GKE AI Labs.