Learn how to configure Body-Based Routing, a capability of GKE Inference Gateway that lets you route inference requests by extracting the model name directly from the HTTP request body.
This is particularly useful for applications that adhere to specifications like the OpenAI API, where the model identifier is often embedded within the request payload rather than in headers or URL paths.
How body-based routing works
GKE Inference Gateway implements Body-Based Routing as an ext_proc
extension of Envoy proxy. The request flow is designed to integrate
seamlessly with your existing Gateway API configurations:
- Request reception: the Layer 7 load balancer receives an incoming inference request.
- Body parameter extraction: the Layer 7 load balancer forwards the
request to the Body to Header extension. This extension extracts the
standard
model
parameter from the HTTP request body. - Header injection: the extracted model parameter's value is then injected
as a new request header (with the key
X-Gateway-Model-Name
). - Routing decision: with the model name now available in a request header,
the GKE Inference Gateway can use existing Gateway API
HTTPRoute
constructs to make routing decisions. For example,HTTPRoute
rules can match against the injected header to direct traffic to the appropriateInferencePool
. - Endpoint selection: the Layer 7 load balancer selects the appropriate
InferencePool
(BackendService) and group of endpoints. It then forwards request and endpoint information to the Endpoint Picker extension for fine-grained endpoint selection within the chosen pool. - Final routing: the request is routed to the specific model replica selected by the Endpoint Picker extension.
This process ensures that even when model information is deep within the request body, your GKE Inference Gateway can intelligently route traffic to the correct backend services.
Configure body-based routing
The Body-Based Routing (BBR) extension for GKE Inference Gateway runs as a service that you manage within your Kubernetes cluster. From the perspective of the Layer 7 load balancer, the BBR extension is an external gRPC server. When the load balancer needs to inspect a request body to determine the model name, it makes a gRPC call to the BBR service. The BBR service then processes the request and returns information to the load balancer, such as headers to be injected for routing.
To enable body-based routing, deploy the BBR extension as a Pod and integrate
it using GCPRoutingExtension
and HTTPRoute
resources.
Prerequisites
Ensure you have a GKE cluster (version 1.32 or later).
Follow the main guide to configure GKE Inference Gateway.
Deploy the body-based router
The body-based routing extension is deployed as a Kubernetes Deployment and
Service, along with a GCPRoutingExtension
resource within your cluster.
You can use Helm for a simplified installation.
To deploy the required resources for the body-based router, run the following command:
helm install body-based-router oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing \
--set provider.name=gke \
--set inferenceGateway.name=GATEWAY_NAME
Replace GATEWAY_NAME
with the name of your Gateway resource.
This command deploys the following resources:
- A Service and a Deployment for the Body-Based Routing extension.
- A
GCPRoutingExtension
resource and aGCPHealthCheckPolicy
resource to attach the Body-Based Routing extension to your GKE Gateway resource.
Configure HTTPRoute
for model-aware routing
After the extension is deployed and configured, you can define HTTPRoute
resources that use the injected header (X-Gateway-Model-Name
) for routing
decisions.
The following is an example HTTPRoute
manifest for model-aware routing:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: routes-to-llms
spec:
parentRefs:
- name: GATEWAY_NAME
rules:
- matches:
- headers:
- type: Exact
name: X-Gateway-Model-Name
value: chatbot # Matches the extracted model name
path:
type: PathPrefix
value: /
backendRefs:
- name: gemma # Target InferencePool for 'chatbot' model
kind: InferencePool
group: "inference.networking.k8s.io"
- matches:
- headers:
- type: Exact
name: X-Gateway-Model-Name
value: sentiment # Matches another extracted model name
path:
type: PathPrefix
value: /
backendRefs:
- name: llama # Target InferencePool for 'sentiment' model
kind: InferencePool
group: "inference.networking.k8s.io"
To apply this manifest to your cluster, save it as httproute-for-models.yaml
and run the following command:
kubectl apply -f httproute-for-models.yaml
Considerations and limitations
When planning your Body-Based Routing implementation, consider the following:
Fail-closed behavior: the Body-Based Routing extension is designed to operate in a "fail-closed" mode. If the extension is unavailable or fails to process a request, it results in an error (for example, a 404 or 503 response code if no default backend is configured) rather than routing incorrectly. Ensure your deployments are highly available to maintain service reliability.
Request body size and streaming: processing large HTTP request bodies, especially with streaming enabled, can introduce complexities. If the Envoy proxy is forced to stream the request body (typically for bodies larger than 250 KB), it might be unable to inject new headers. This can lead to routing failures (for example, a 404 error if the header matching rules cannot be applied).
Long-term maintenance: the Body-Based Routing extension runs as a component within your GKE clusters. You are responsible for its lifecycle management, including upgrades, security patches, and ensuring its continued operation.
What's next
- Learn how to customize GKE Inference Gateway.
- Read more about GKE Inference Gateway.
- Monitor your Gateway resources.
- Learn more about the Gateway API.