This page introduces Google Kubernetes Engine (GKE) Inference Gateway, an enhancement to the GKE Gateway for optimized serving of generative AI applications. It explains the key concepts, features, and how GKE Inference Gateway works.
This page is intended for the following personas:
- Machine learning (ML) engineers, Platform admins and operators, and Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving AI/ML workloads.
- Cloud architects and Networking specialists who interact with Kubernetes networking.
Before you read this page, ensure that you're familiar with the following:
- AI/ML orchestration on GKE.
- Generative AI glossary.
- GKE networking concepts, including Services, and the GKE Gateway API.
- Load balancing in Google Cloud, especially how load balancers interact with GKE.
Overview
GKE Inference Gateway is an extension to the GKE Gateway that provides optimized routing and load balancing for serving generative Artificial Intelligence (AI) workloads. It simplifies the deployment, management, and observability of AI inference workloads.
Features and benefits
GKE Inference Gateway provides the following key capabilities to efficiently serve generative AI models for generative AI applications on GKE:
- Optimized load balancing for inference: distributes requests to
optimize AI model serving performance. It uses metrics from model servers,
such as
KVCache Utilization
and thequeue length of pending requests
, to use accelerators (such as GPUs and TPUs) more efficiently for generative AI workloads. - Dynamic LoRA fine-tuned model serving: supports serving dynamic LoRA fine-tuned models on a common accelerator. This reduces the number of GPUs and TPUs required to serve models by multiplexing multiple LoRA fine-tuned models on a common base model and accelerator.
- Optimized autoscaling for inference: the GKE Horizontal Pod Autoscaler (HPA) uses model server metrics to autoscale, which helps ensure efficient compute resource use and optimized inference performance.
- Model-aware routing: routes inference requests based on the model names
defined in the
OpenAI API
specifications within your GKE cluster. You can define Gateway routing policies, such as traffic splitting and request mirroring, to manage different model versions and simplify model rollouts. For example, you can route requests for a specific model name to differentInferencePool
objects, each serving a different version of the model. - Model-specific serving
Criticality
: lets you specify the servingCriticality
of AI models. Prioritize latency-sensitive requests over latency-tolerant batch inference jobs. For example, you can prioritize requests from latency-sensitive applications and drop less time-sensitive tasks when resources are constrained. - Integrated AI safety: integrates with Google Cloud Model Armor, a service that applies AI safety checks to prompts and responses at the gateway. Model Armor provides logs of requests, responses, and processing for retrospective analysis and optimization. GKE Inference Gateway's open interfaces let third-party providers and developers integrate custom services into the inference request process.
- Inference observability: provides observability metrics for inference requests, such as request rate, latency, errors, and saturation. Monitor the performance and behavior of your inference services.
Understand key concepts
GKE Inference Gateway enhances the existing GKE
Gateway that uses
GatewayClass
objects. GKE Inference Gateway introduces the following new
Gateway API Custom Resource Definitions (CRDs), aligned with the OSS Kubernetes
Gateway API extension for
Inference:
InferencePool
object: represents a group of Pods (containers) that share the same compute configuration, accelerator type, base language model, and model server. This logically groups and manages your AI model serving resources. A singleInferencePool
object can span multiple Pods across different GKE nodes and provides scalability and high availability.InferenceModel
object: specifies the serving model's name from theInferencePool
according to theOpenAI API
specification. TheInferenceModel
object also specifies the model's serving properties, such as the AI model'sCriticality
. GKE Inference Gateway gives preference to workloads classified asCritical
. This lets you multiplex latency-critical and latency-tolerant AI workloads on a GKE cluster. You can also configure theInferenceModel
object to serve LoRA fine-tuned models.TargetModel
object: specifies the target model name and theInferencePool
object that serves the model. This lets you define Gateway routing policies, such as traffic splitting and request mirroring, and simplify model version rollouts.
The following diagram illustrates GKE Inference Gateway and its integration with AI safety, observability, and model serving within a GKE cluster.

The following diagram illustrates the resource model that focuses on two new inference-focused personas and the resources they manage.

How GKE Inference Gateway works
GKE Inference Gateway uses Gateway API extensions and model-specific routing logic to handle client requests to an AI model. The following steps describe the request flow.
How the request flow works
GKE Inference Gateway routes client requests from the initial request to a model instance. This section describes how GKE Inference Gateway handles requests. This request flow is common for all clients.
- The client sends a request, formatted as described in the OpenAI API specification, to the model running in GKE.
- GKE Inference Gateway processes the request using the following
inference extensions:
- Body-based routing extension: extracts the model identifier from the
client request body and sends it to GKE Inference Gateway.
GKE Inference Gateway then uses this identifier to route the
request based on rules defined in the Gateway API
HTTPRoute
object. Request body routing is similar to routing based on the URL path. The difference is that request body routing uses data from the request body. - Security extension: uses Model Armor or supported third-party solutions to enforce model-specific security policies that include content filtering, threat detection, sanitization, and logging. The security extension applies these policies to both request and response processing paths. This enables the security extension to sanitize and log both requests and responses.
- Endpoint picker extension: monitors key metrics from model servers
within the
InferencePool
. It tracks the key-value cache (KV-cache) utilization, queue length of pending requests, and active LoRA adapters on each model server. It then routes the request to the optimal model replica based on these metrics to minimize latency and maximize throughput for AI inference.
- Body-based routing extension: extracts the model identifier from the
client request body and sends it to GKE Inference Gateway.
GKE Inference Gateway then uses this identifier to route the
request based on rules defined in the Gateway API
- GKE Inference Gateway routes the request to the model replica returned by the endpoint picker extension.
The following diagram illustrates the request flow from a client to a model instance through GKE Inference Gateway.

How traffic distribution works
GKE Inference Gateway dynamically distributes inference requests to model
servers within the InferencePool
object. This helps optimize resource utilization
and maintains performance under varying load conditions.
GKE Inference Gateway uses the following two mechanisms to manage traffic
distribution:
Endpoint picking: dynamically selects the most suitable model server to handle an inference request. It monitors server load and availability, then makes routing decisions.
Queueing and shedding: manages request flow and prevents traffic overload. GKE Inference Gateway stores incoming requests in a queue, prioritizes requests based on defined criteria, and drops requests when the system is overloaded.
GKE Inference Gateway supports the following Criticality
levels:
Critical
: these workloads are prioritized. The system ensures these requests are served even under resource constraints.Standard
: these workloads are served when resources are available. If resources are limited, these requests are dropped.Sheddable
: these workloads are served opportunistically. If resources are scarce, these requests are dropped to protectCritical
workloads.
When the system is under resource pressure, Standard
and Sheddable
requests
are immediately dropped with a 429
error code to safeguard Critical
workloads.
Streaming inference
GKE Inference Gateway supports streaming inference for applications like chatbots and live translation that require continuous or near-real-time updates. Streaming inference delivers responses in incremental chunks or segments, rather than as a single, complete output. If an error occurs during a streaming response, the stream terminates, and the client receives an error message. GKE Inference Gateway does not retry streaming responses.
Explore application examples
This section provides examples to address various generative AI application scenarios by using GKE Inference Gateway.
Example 1: Serve multiple generative AI models on a GKE cluster
A company wants to deploy multiple large language models (LLMs) to serve different workloads. For example, they might want to deploy a Gemma3
model for a chatbot interface and a Deepseek
model for a recommendation application. The company needs to ensure optimal serving performance for these LLMs.
Using GKE Inference Gateway, you can deploy these LLMs on your
GKE cluster with your chosen accelerator configuration in an
InferencePool
. You can then route requests based on the model name (such as
chatbot
and recommender
) and the Criticality
property.
The following diagram illustrates how GKE Inference Gateway
routes requests to different models based on the model name and Criticality
.

Example 2: Serve LoRA adapters on a shared accelerator
A company wants to serve LLMs for document analysis and focuses on audiences in multiple languages, such as English and Spanish. They have fine-tuned models for each language, but need to efficiently use their GPU and TPU capacity. You can use GKE Inference Gateway to deploy dynamic LoRA fine-tuned adapters for each language (for example, english-bot
and spanish-bot
) on a common base model (for example, llm-base
) and accelerator. This lets you reduce the number of required accelerators by densely packing multiple models on a common accelerator.
The following diagram illustrates how GKE Inference Gateway serves multiple LoRA adapters on a shared accelerator.

What's next
- Deploy GKE Inference Gateway
- Customize GKE Inference Gateway configuration
- Serve an LLM with GKE Inference Gateway