This page describes how to deploy GKE Inference Gateway.
This page is intended for Networking specialists responsible for managing GKE infrastructure and platform administrators who manage AI workloads.
Before reading this page, ensure that you're familiar with the following:
- About GKE Inference Gateway
- AI/ML orchestration on GKE.
- Generative AI glossary.
- Load balancing in Google Cloud, especially how load balancers interact with GKE.
- GKE Service Extensions. For more information, read the GKE Gateway controller documentation.
- Customize GKE Gateway traffic using Service Extensions.
GKE Inference Gateway enhances Google Kubernetes Engine (GKE) Gateway to optimize the serving of generative AI applications and workloads on GKE. It provides efficient management and scaling of AI workloads, enables workload-specific performance objectives such as latency, and enhances resource utilization, observability, and AI safety.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
Enable the Compute Engine API, the Network Services API, and the Model Armor API if needed.
Go to Enable access to APIs and follow the instructions.
Make sure that you have the following roles on the project:
roles/container.admin
,roles/iam.serviceAccountAdmin
.Ensure your project has sufficient quota for H100 GPUs. To learn more, see Plan GPU quota and Allocation quotas.
Create a Hugging Face account if you don't already have one. You will need this to access the model resources for this tutorial.
Request access to the Llama 3.1 model and generate an access token. Access to this model requires an approved request on Hugging Face, and the deployment will fail if access has not been granted.
- Sign the license consent agreement: You must sign the consent agreement to use the Llama 3.1 model. Go to the model's page on Hugging Face, verify your account, and accept the terms.
- Generate an access token: To access the model, you need a Hugging Face token. In your Hugging Face account, go to Your Profile > Settings > Access Tokens, create a new token with at least Read permissions, and copy it to your clipboard.
GKE Gateway controller requirements
- GKE version 1.32.3 or later.
- Google Cloud CLI version 407.0.0 or later.
- Gateway API is supported on VPC-native clusters only.
- You must enable a proxy-only subnet.
- Your cluster must have the
HttpLoadBalancing
add-on enabled. - If you are using Istio, you must upgrade Istio to one of the following
versions:
- 1.15.2 or later
- 1.14.5 or later
- 1.13.9 or later
- If you are using Shared VPC, then in the host project, you need to
assign the
Compute Network User
role to the GKE Service account for the service project.
Restrictions and limitations
The following restrictions and limitations apply:
- Multi-cluster Gateways are not supported.
- GKE Inference Gateway is only supported on the
gke-l7-regional-external-managed
andgke-l7-rilb
GatewayClass resources. - Cross-regional internal Application Load Balancers are not supported.
Configure GKE Inference Gateway
To configure GKE Inference Gateway, consider this example. A
team runs vLLM
and
Llama3
models and actively experiments with two
distinct LoRA fine-tuned adapters: "food-review" and "cad-fabricator".
The high-level workflow for configuring GKE Inference Gateway is as follows:
- Prepare your environment: set up the necessary infrastructure and components.
- Create an inference pool: define a pool of model
servers using the
InferencePool
Custom Resource. - Specify inference objectives: specify
inference objectives using the
InferenceObjective
Custom Resource - Create the Gateway: expose the inference service using Gateway API.
- Create the
HTTPRoute
: define how HTTP traffic is routed to the inference service. - Send inference requests: make requests to the deployed model.
Prepare your environment
Install Helm.
Create a GKE cluster:
- Create a GKE Autopilot or Standard cluster with version 1.32.3 or later. For instructions, see Create a GKE cluster.
- Configure the nodes with your preferred compute family and accelerator.
- Use GKE Inference Quickstart for pre-configured and tested deployment manifests, based on your selected accelerator, model, and performance needs.
Install needed Custom Resource Definition (CRDs) in your GKE cluster:
- For GKE versions prior to
1.34.0-gke.1626000
, run the following command to install both v1InferencePool
and alphaInferenceObjective
CRD
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml
- For GKE versions
1.34.0-gke.1626000
or later, run the following command to install alphaInferenceObjective
CRD only :
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml
- For GKE versions prior to
If you are using GKE version earlier than
v1.32.2-gke.1182001
and you want to use Model Armor with GKE Inference Gateway, you must install the traffic and routing extension CRDs:kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/gke-gateway-api/refs/heads/main/config/crd/networking.gke.io_gcptrafficextensions.yaml kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/gke-gateway-api/refs/heads/main/config/crd/networking.gke.io_gcproutingextensions.yaml
To set up authorization to scrape metrics, create the
inference-gateway-sa-metrics-reader-secret
secret:kubectl apply -f - <<EOF --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: inference-gateway-metrics-reader rules: - nonResourceURLs: - /metrics verbs: - get --- apiVersion: v1 kind: ServiceAccount metadata: name: inference-gateway-sa-metrics-reader namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: inference-gateway-sa-metrics-reader-role-binding namespace: default subjects: - kind: ServiceAccount name: inference-gateway-sa-metrics-reader namespace: default roleRef: kind: ClusterRole name: inference-gateway-metrics-reader apiGroup: rbac.authorization.k8s.io --- apiVersion: v1 kind: Secret metadata: name: inference-gateway-sa-metrics-reader-secret namespace: default annotations: kubernetes.io/service-account.name: inference-gateway-sa-metrics-reader type: kubernetes.io/service-account-token --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: inference-gateway-sa-metrics-reader-secret-read rules: - resources: - secrets apiGroups: [""] verbs: ["get", "list", "watch"] resourceNames: ["inference-gateway-sa-metrics-reader-secret"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: gmp-system:collector:inference-gateway-sa-metrics-reader-secret-read namespace: default roleRef: name: inference-gateway-sa-metrics-reader-secret-read kind: ClusterRole apiGroup: rbac.authorization.k8s.io subjects: - name: collector namespace: gmp-system kind: ServiceAccount EOF
Create a model server and model deployment
This section shows how to deploy a model server and model. The example uses a
vLLM
model server with a Llama3
model. The deployment is labeled as
app:vllm-llama3-8b-instruct
. This deployment also uses two LoRA adapters
named food-review
and cad-fabricator
from Hugging Face.
You can adapt this example with your own model server container and model, serving port, and deployment name. You can also configure LoRA adapters in the deployment, or deploy the base model. The following steps describe how to create the necessary Kubernetes resources.
Create a Kubernetes Secret to store your Hugging Face token. This token is used to access the base model and the LoRA adapters:
kubectl create secret generic hf-token --from-literal=token=HF_TOKEN
Replace
HF_TOKEN
with your Hugging Face token.Deploy the model server and model. The following command applies a manifest that defines a Kubernetes Deployment for a
vLLM
model server with aLlama3
model:kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/release-1.0/config/manifests/vllm/gpu-deployment.yaml
Create an inference pool
The InferencePool
Kubernetes custom resource defines a group of Pods with a
common base large language model (LLM) and compute configuration. The
selector
field specifies which Pods belong to this pool. The labels in this
selector must exactly match the labels applied to your model server Pods. The
targetPort
field defines the ports that the model server uses within the Pods.
The extensionRef
field references an extension service that provides
additional capability for the inference pool. The InferencePool
enables
GKE Inference Gateway to route traffic to your model server
Pods.
Before you create the InferencePool
, ensure that the Pods that the
InferencePool
selects are already running.
To create an InferencePool
using Helm, perform the following steps:
helm install vllm-llama3-8b-instruct \
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
--set provider.name=gke \
--version v1.0.0 \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
Change the following field to match your Deployment:
inferencePool.modelServers.matchLabels.app
: the key of the label used to select your model server Pods.
The Helm install automatically installs the necessary timeout policy, endpoint-picker and the Pods needed for observability.
This creates an InferencePool
object: vllm-llama3-8b-instruct
referencing the model endpoint services within the Pods. It also creates a deployment of the Endpoint Picker named app:vllm-llama3-8b-instruct-epp
for this created InferencePool
.
Specify inference objectives
The InferenceObjective
custom resource lets you specify priority of requests.
The metadata.name
field of the InferenceObjective
resource specifies the name of the Inference Objective, the Priority
field specifies its serving criticality, and the poolRef
field specifies the InferencePool
on which the model is served.
```yaml
apiVersion: inference.networking.k8s.io/v1alpha2
kind: InferenceObjective
metadata:
name: NAME
spec:
priority: VALUE
poolRef:
name: INFERENCE_POOL_NAME
group: "inference.networking.k8s.io"
```
Replace the following:
NAME
: the name of your Inference Objective. For example,food-review
.VALUE
: the priority for the Inference Objective. This is an integer where a higher value indicates a more critical request. For example, 10.INFERENCE_POOL_NAME
: the name of theInferencePool
you created in the previous step. For example,vllm-llama3-8b-instruct
.
To create an InferenceObjective
, perform the following steps:
Save the following manifest as
inference-objectives.yaml
. This manifest creates twoInferenceObjective
resources. The first configures thefood-review
Inference Objective on thevllm-llama3-8b-instruct
InferencePool
with a priority of 10. The second configures thellama3-base-model
Inference Objective to be served with a higher priority of 20.apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceObjective metadata: name: food-review spec: priority: 10 poolRef: name: vllm-llama3-8b-instruct group: "inference.networking.k8s.io" --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceObjective metadata: name: llama3-base-model spec: priority: 20 # Higher priority poolRef: name: vllm-llama3-8b-instruct
Apply the sample manifest to your cluster:
kubectl apply -f inferenceobjective.yaml
Create the Gateway
The Gateway resource is the entry point for external traffic into your Kubernetes cluster. It defines the listeners that accept incoming connections.
The GKE Inference Gateway works with the following Gateway Classes:
gke-l7-rilb
: for regional internal Application Load Balancers.gke-l7-regional-external-managed
: for regional external Application Load Balancers.
For more information, see Gateway Classes documentation.
To create a Gateway, perform the following steps:
Save the following sample manifest as
gateway.yaml
:apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: GATEWAY_NAME spec: gatewayClassName: GATEWAY_CLASS listeners: - protocol: HTTP port: 80 name: http
Replace the following:
GATEWAY_NAME
: a unique name for your Gateway resource. For example,inference-gateway
.GATEWAY_CLASS
: the Gateway Class you want to use. For example,gke-l7-regional-external-managed
.
Apply the manifest to your cluster:
kubectl apply -f gateway.yaml
Note: For more information about configuring TLS to secure your Gateway with HTTPS, see the GKE documentation on TLS configuration.
Create the HTTPRoute
The HTTPRoute
resource defines how the GKE Gateway routes
incoming HTTP requests to backend services, such as your InferencePool
. The
HTTPRoute
resource specifies matching rules (for example, headers or paths) and the
backend to which traffic should be forwarded.
To create an
HTTPRoute
, save the following sample manifest ashttproute.yaml
:apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: HTTPROUTE_NAME spec: parentRefs: - name: GATEWAY_NAME rules: - matches: - path: type: PathPrefix value: PATH_PREFIX backendRefs: - name: INFERENCE_POOL_NAME group: "inference.networking.k8s.io" kind: InferencePool
Replace the following:
HTTPROUTE_NAME
: a unique name for yourHTTPRoute
resource. For example,my-route
.GATEWAY_NAME
: the name of theGateway
resource that you created. For example,inference-gateway
.PATH_PREFIX
: the path prefix that you use to match incoming requests. For example,/
to match all.INFERENCE_POOL_NAME
: the name of theInferencePool
resource that you want to route traffic to. For example,vllm-llama3-8b-instruct
.
Apply the manifest to your cluster:
kubectl apply -f httproute.yaml
Send inference request
After you have configured GKE Inference Gateway, you can send inference requests to your deployed model. This lets you generate text based on your input prompt and specified parameters.
To send inference requests, perform the following steps:
Set the following environment variables:
export GATEWAY_NAME=GATEWAY_NAME export PORT_NUMBER=PORT_NUMBER # Use 80 for HTTP
Replace the following:
GATEWAY_NAME
: the name of your Gateway resource.PORT_NUMBER
: the port number you configured in the Gateway.
To get the Gateway endpoint, run the following command:
echo "Waiting for the Gateway IP address..." IP="" while [ -z "$IP" ]; do IP=$(kubectl get gateway/${GATEWAY_NAME} -o jsonpath='{.status.addresses[0].value}' 2>/dev/null) if [ -z "$IP" ]; then echo "Gateway IP not found, waiting 5 seconds..." sleep 5 fi done echo "Gateway IP address is: $IP" PORT=${PORT_NUMBER}
To send a request to the
/v1/completions
endpoint usingcurl
, run the following command:curl -i -X POST ${IP}:${PORT}/v1/completions \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer $(gcloud auth application-default print-access-token)' \ -d '{ "model": "MODEL_NAME", "prompt": "PROMPT_TEXT", "max_tokens": MAX_TOKENS, "temperature": "TEMPERATURE" }'
Replace the following:
MODEL_NAME
: the name of the model or LoRA adapter to use.PROMPT_TEXT
: the input prompt for the model.MAX_TOKENS
: the maximum number of tokens to generate in the response.TEMPERATURE
: controls the randomness of the output. Use the value0
for deterministic output, or a higher number for more creative output.
The following example shows you how to send a sample request to GKE Inference Gateway:
curl -i -X POST ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -H 'Authorization: Bearer $(gcloud auth print-access-token)' -d '{
"model": "food-review-1",
"prompt": "What is the best pizza in the world?",
"max_tokens": 2048,
"temperature": "0"
}'
Be aware of the following behaviours:
- Request body: the request body can include additional parameters like
stop
andtop_p
. Refer to the OpenAI API specification for a complete list of options. - Error handling: implement proper error handling in your client code to
handle potential errors in the response. For example, check the HTTP status
code in the
curl
response. A non-200
status code generally indicates an error. - Authentication and authorization: for production deployments, secure your
API endpoint with authentication and authorization mechanisms. Include the
appropriate headers (for example,
Authorization
) in your requests.
What's next
- Customize GKE Inference Gateway configuration
- Configure Body-Based Routing
- Serve an LLM with GKE Inference Gateway