Choose a load balancing strategy for AI/ML model inference on GKE


This page helps you choose the appropriate load balancing strategy for AI/ML model inference workloads on Google Kubernetes Engine (GKE).

This page is intended for the following personas:

  • Machine learning (ML) engineers, Platform admins and operators, and Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving AI/ML workloads.
  • Cloud architects and Networking specialists who interact with Kubernetes networking.

To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

Before you read this page, ensure that you're familiar with the following:

When you deploy AI/ML model inference workloads on Google Kubernetes Engine (GKE), choose the right load balancing strategy to optimize performance, scalability, and cost efficiency. Google Cloud provides the following distinct solutions:

  • GKE Inference Gateway: a solution built for advanced AI/ML routing. For more information, see the GKE Inference Gateway documentation.
  • GKE Gateway with custom metrics: a solution that uses Application Load Balancers, offering general-purpose control, which can be combined with an Application Load Balancer.

Combine load balancing solutions

You can use both GKE Inference Gateway and GKE Gateway with custom metrics together in some architectures. In these architectures, the Application Load Balancer is used with the GKE Gateway with custom metrics. For example, a global external Application Load Balancer directs traffic to the appropriate region based on factors such as geography and health checks. For more information, see Application Load Balancers. After traffic reaches a specific region, the GKE Inference Gateway performs fine-grained, AI-aware load balancing to route requests to the optimal model server. For more information, see GKE Inference Gateway documentation.

To choose the Google Cloud load balancing solution that best serves your inference applications on GKE, consider your workload characteristics, performance requirements, and operational model.

To direct traffic to the most suitable and least-loaded model server replica, the GKE Inference Gateway's Endpoint Picker extension monitors critical AI-specific metrics. These metrics include model server KV cache utilization, pending request queue length, overall GPU or TPU loading, LoRA adapter availability, and the computational cost of individual requests. In addition to sophisticated routing, GKE Inference Gateway provides request prioritization and optimized autoscaling for model servers.

Overview of GKE Gateway with custom metrics

Application Load Balancers, such as the global external Application Load Balancer and the regional external Application Load Balancer, are general-purpose load balancers that distribute traffic based on custom metrics your backend services report. This capability provides fine-grained control over load distribution based on application-specific performance indicators.

The GKE Gateway acts as a Kubernetes-native interface for provisioning and managing Application Load Balancers. Essentially, when you define a Gateway resource in your GKE cluster, the GKE Gateway controller automatically configures the underlying Application Load Balancer, providing a simplified way to manage external HTTP/HTTPS traffic to your GKE services directly from Kubernetes, while using Google Cloud's load balancing infrastructure.

Compare load balancing solutions

The following table compares the features of GKE Inference Gateway and GKE Gateway with custom metrics.

Feature Inference Gateway GKE Gateway with custom metrics (by using Application Load Balancers)
Primary use case Optimizes Generative AI/ML inference workloads on Kubernetes, such as serving large language models (LLMs). Works well for serving multiple use cases on a single model, ensuring fair access to model resources, and optimizing latency-sensitive, GPU/TPU-based LLM workloads. Provides general-purpose HTTP(S) load balancing for workloads that need accurate traffic distribution based on custom, application-reported metrics (load signals). Works well for latency-sensitive services, such as real-time gaming servers or high-frequency trading platforms, that report custom utilization data.
Base routing Supports standard HTTP(S) routing by host and path, extending the GKE Gateway API. Supports standard HTTP(S) routing by host and path, configured using the GKE Gateway API's standard resources.
Advanced routing logic Performs model-aware routing (for example, body-based model name), traffic-splitting, mirroring, and applies priority and criticality levels. Balances traffic based on application-reported custom metrics by using the Open Request Cost Aggregation (ORCA) standard. This enables policies such as WEIGHTED_ROUND_ROBIN for endpoint weighting within a locality.
Supported metrics Uses a suite of built-in, AI-specific signals that are ready to use, such as GPU/TPU utilization, `KV cache hits`, and `request queue length`. You can also configure it to use application-reported metrics sent using a standardized HTTP header mechanism. Relies on application-reported metrics by using a standardized HTTP header mechanism (this mechanism is known as _ORCA load reporting_). This format allows for reporting standard metrics, such as CPU and memory, or custom-named metrics for application-specific constrained resources.
Request handling Lowers non-uniform request costs, which are common in LLMs. Supports request [criticality levels](/kubernetes-engine/docs/concepts/about-gke-inference-gateway#traffic-distribution). Optimized for relatively uniform request costs. Doesn't include built-in request prioritization.
LoRa adapter support Provides native, affinity-based routing to appropriate LoRa-equipped backends. Does not provide native support.
Autoscaling integration Optimizes scaling for model servers based on AI-specific metrics, such as `KV cache hits`. Horizontal Pod Autoscaler (HPA) can use custom metrics, but the setup is generic based on metrics reported for Application Load Balancer.
Setup and configuration Configure it with the GKE Gateway API. Extends the standard API with specialized InferencePool and InferenceModel Custom Resource Definitions (CRDs) to enable its AI-aware features. Configure it with the GKE Gateway API's standard resources. The application must implement the HTTP header-based mechanism to report custom metrics.
Security Provides AI-content filtering with Model Armor at the gateway. Leverages foundational GKE security features such as TLS, IAM, Role-Based Access Control (RBAC), and namespaces. Uses the standard Application Load Balancer security stack, including Model Armor, TLS termination, and IAM. Model Armor is also supported by integrating it as a Service Extension.
Observability Offers built-in observability into AI-specific metrics, including GPU or TPU utilization, `KV cache hits`, `request queue length`, and model latency. Observability relies on any custom metrics the application is configured to report. You can view these in Cloud Monitoring. These can include standard or custom-named metrics.
Extensibility Built on an extensible, open-source foundation that supports a user-managed Endpoint Picker algorithm. Extends the GKE Gateway API with specialized Custom Resource Definitions (InferencePool, InferenceModel) to simplify common AI/ML use cases. Designed for flexibility, allowing load balancing to be extended with any custom metric (load signal) the application can report by using the ORCA standard.
Launch stage Preview GA

When to use GKE Inference Gateway

Use GKE Inference Gateway to optimize sophisticated AI/ML inference workloads on GKE, especially for LLMs.

Choose GKE Inference Gateway when you need to do the following:

  • Model-aware routing: direct traffic based on LLM-specific states like KV cache hits or request queue length, or to specific LoRA adapters.
  • Cost-aware load balancing: efficiently handle inference requests with variable processing costs and prioritize them by criticality levels (critical, standard, or sheddable).
  • AI-specific autoscaling: dynamically scale model servers based on relevant AI metrics for optimal resource use.
  • Built-in AI safety and observability: use native Model Armor integration for AI safety checks and get ready-to-use insights into GPU/TPU utilization, KV cache hits, and request queue length.
  • Simplified GenAI deployment: benefit from a purpose-built, extensible solution that simplifies common GenAI deployment patterns on GKE while offering customization through its GKE Gateway API foundation.

When to Use GKE Gateway with Custom Metrics

Use GKE Gateway with custom metrics for flexible, general-purpose load balancing that adapts to your application's unique performance indicators, including for some inference scenarios.

Choose GKE Gateway with custom metrics when you need to do the following:

  • Handle high traffic volumes with relatively uniform request costs.
  • Distribute load based on application-reported custom metrics using ORCA load reporting.
  • Avoid AI/LLM-specific routing intelligence offered by the GKE Inference Gateway.
  • Prioritize consistency with existing Application Load Balancer deployments that meet your inference service's needs.

What's next