Customize GKE Inference Gateway configuration

Autopilot Standard

This page describes how to customize GKE Inference Gateway deployment.

This page is for Networking specialists responsible for managing GKE infrastructure, and platform administrators who manage AI workloads.

To manage and optimize inference workloads, you configure advanced features of GKE Inference Gateway.

Understand and configure the following advanced features:

To use Model Armor integration, configure AI security and safety checks.
To view GKE Inference Gateway and model server metrics and dashboards, and to enable HTTP access logging for detailed request and response information, configure observability
To automatically scale your GKE Inference Gateway deployments, configure autoscaling.

Configure AI security and safety checks

GKE Inference Gateway integrates with Model Armor to perform safety checks on prompts and responses for applications that use large language models (LLMs). This integration provides an additional layer of safety enforcement at the infrastructure level that complements application-level safety measures. This enables centralized policy application across all LLM traffic.

The following diagram illustrates Model Armor integration with GKE Inference Gateway on a GKE cluster:

Google Cloud Model Armor integration on a GKE cluster — **Figure:** Model Armor integration on a GKE cluster

To configure AI safety checks, perform the following steps:

Ensure that the following prerequisites are met:
1. Enable the Model Armor service in your Google Cloud project.
2. Create the Model Armor templates using the Model Armor console, Google Cloud CLI, or API.
Ensure that you have already created a Model Armor template named my-model-armor-template-name-id.
To configure the GCPTrafficExtension, perform the following steps:
1. Save the following sample manifest as gcp-traffic-extension.yaml:
```
kind: GCPTrafficExtension
apiVersion: networking.gke.io/v1
metadata:
  name: my-model-armor-extension
spec:
  targetRefs:
  - group: "gateway.networking.k8s.io"
    kind: Gateway
    name: GATEWAY_NAME
  extensionChains:
  - name: my-model-armor-chain1
    matchCondition:
      celExpressions:
      - celMatcher: request.path.startsWith("/")
    extensions:
    - name: my-model-armor-service
      supportedEvents:
      - RequestHeaders
      timeout: 1s
      googleAPIServiceName: "modelarmor.us-central1.rep.googleapis.com"
      metadata:
        'extensionPolicy': MODEL_ARMOR_TEMPLATE_NAME
        'sanitizeUserPrompt': 'true'
        'sanitizeUserResponse': 'true'
```
  Replace the following:
  - GATEWAY_NAME: the name of the Gateway.
  - MODEL_ARMOR_TEMPLATE_NAME: the name of your Model Armor template.
  The gcp-traffic-extension.yaml file includes the following settings:
  - targetRefs: specifies the Gateway to which this extension applies.
  - extensionChains: defines a chain of extensions to be applied to the traffic.
  - matchCondition: defines the conditions under which the extensions are applied.
  - extensions: defines the extensions to be applied.
  - supportedEvents: specifies the events during which the extension is invoked.
  - timeout: specifies the timeout for the extension.
  - googleAPIServiceName: specifies the service name for the extension.
  - metadata: specifies the metadata for the extension, including the extensionPolicy and prompt or response sanitization settings.
2. Apply the sample manifest to your cluster:
```
kubectl apply -f `gcp-traffic-extension.yaml`
```

After you configure the AI safety checks and integrate them with your Gateway, Model Armor automatically filters prompts and responses based on the defined rules.

Configure observability

GKE Inference Gateway provides insights into the health, performance, and behavior of your inference workloads. This helps you to identify and resolve issues, optimize resource utilization, and ensure the reliability of your applications.

Google Cloud provides the following Cloud Monitoring dashboards that offer inference observability for GKE Inference Gateway:

GKE Inference Gateway dashboard: provides golden metrics for LLM serving, such as request and token throughput, latency, errors, and cache utilization for the InferencePool. To see the complete list of available GKE Inference Gateway metrics, see Exposed metrics.
Model server dashboard: provides a dashboard for golden signals of model server. This lets you to monitor the load and performance of the model servers, such as KVCache Utilization and Queue length. This lets you to monitor the load and performance of the model servers.
Load balancer dashboard: reports metrics from the load balancer, such as requests per second, end-to-end request serving latency, and request-response status codes. These metrics help you understand the performance of end-to-end request serving and identify errors.
Data Center GPU Manager (DCGM) metrics: provides metrics from NVIDIA GPUs, such as the performance and utilization of NVIDIA GPUs. You can configure NVIDIA Data Center GPU Manager (DCGM) metrics in Cloud Monitoring. For more information, see Collect and view DCGM metrics.

View GKE Inference Gateway dashboard

To view GKE Inference Gateway dashboard, perform the following steps:

In the Google Cloud console, go to the Monitoring page.

Go to Monitoring
In the navigation pane, select Dashboards.
In the Integrations section, select GMP.
In the Cloud Monitoring Dashboard Templates page, search for "Gateway".
View GKE Inference Gateway dashboard.

Alternately, you can follow the instructions in Monitoring dashboard.

Configure model server observability dashboard

To collect golden signals from each model server and understand what contributes to GKE Inference Gateway performance, you can configure auto-monitoring for your model servers. This includes model servers such as the following:

To view the integration dashboards, perform the following steps:

Collect the metrics from your model server.
In the Google Cloud console, go to the Monitoring page.

Go to Monitoring
In the navigation pane, select Dashboards.
Under Integrations, select GMP. The corresponding integration dashboards are displayed.

Figure: Integration dashboards

For more information, see Customize monitoring for applications.

Configure the Cloud Monitoring alerts

To configure Cloud Monitoring alerts for GKE Inference Gateway, perform the following steps:

Modify the threshold in alerts. Save the following sample manifest as alerts.yaml:

groups:
- name: gateway-api-inference-extension
  rules:
  - alert: HighInferenceRequestLatencyP99
    annotations:
      title: 'High latency (P99) for model {{ $labels.model_name }}'
      description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
    expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0
    for: 5m
    labels:
      severity: 'warning'
  - alert: HighInferenceErrorRate
    annotations:
      title: 'High error rate for model {{ $labels.model_name }}'
      description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
    expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05
    for: 5m
    labels:
      severity: 'critical'
      impact: 'availability'
  - alert: HighInferencePoolAvgQueueSize
    annotations:
      title: 'High average queue size for inference pool {{ $labels.name }}'
      description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
    expr: inference_pool_average_queue_size > 50
    for: 5m
    labels:
      severity: 'critical'
      impact: 'performance'
  - alert: HighInferencePoolAvgKVCacheUtilization
    annotations:
      title: 'High KV cache utilization for inference pool {{ $labels.name }}'
      description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
    expr: inference_pool_average_kv_cache_utilization > 0.9
    for: 5m
    labels:
      severity: 'critical'
      impact: 'resource_exhaustion'

To create alerting policies, run the following command:

gcloud alpha monitoring policies migrate --policies-from-prometheus-alert-rules-yaml=alerts.yaml

You see new alert policies in the Alerting page.

Modify alerts

You can find complete list of latest metrics available in the kubernetes-sigs/gateway-api-inference-extension GitHub repository, and you can append new alerts to the manifest by using other metrics.

To make change to sample alerts, take the following alert for example:

  - alert: HighInferenceRequestLatencyP99
    annotations:
      title: 'High latency (P99) for model {{ $labels.model_name }}'
      description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
    expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0
    for: 5m
    labels:
      severity: 'warning'

The alert will fire if the 99 percentile of the request duration in 5 minutes is over 10 seconds. You can modify the expr section of the alert to adjust the threshold based on your requirements.

Configure logging for GKE Inference Gateway

Configuring logging for GKE Inference Gateway provides detailed information about requests and responses, which is useful for troubleshooting, auditing, and performance analysis. HTTP access logs record every request and response, including headers, status codes, and timestamps. This level of detail can help you identify issues, find errors, and understand the behavior of your inference workloads.

To configure logging for GKE Inference Gateway, enable HTTP access logging for each of your InferencePool objects.

Save the following sample manifest as logging-backend-policy.yaml:

apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
  name: logging-backend-policy
  namespace: NAMESPACE_NAME
spec:
  default:
    logging:
      enabled: true
      sampleRate: 500000
  targetRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: INFERENCE_POOL_NAME

Replace the following:

NAMESPACE_NAME: the name of the namespace where your InferencePool is deployed.
INFERENCE_POOL_NAME: the name of the InferencePool.

Apply the sample manifest to your cluster:

kubectl apply -f logging-backend-policy.yaml

After you apply this manifest, GKE Inference Gateway enables HTTP access logs for the specified InferencePool. You can view these logs in Cloud Logging. The logs include detailed information about each request and response, such as the request URL, headers, response status code, and latency.

Creating logs-based metrics to view error details

You can use logs-based metrics to analyze your load balancing logs and extract error details. Each GKE Gateway class, such as the gke-l7-global-external-managed and gke-l7-regional-internal-managed Gateway classes, is backed by a different load balancer. For more information, see GatewayClass capabilities.

Each load balancer has a different monitored resource that you must use when you create a logs-based metric. For more information about the monitored resource for each load balancer, see the following:

For regional external load balancers: Logs-based metrics for external HTTP(S) load balancers
For internal load balancers: Logs-based metrics for internal HTTP(S) load balancers

To create a logs-based metric to view error details, do the following:

Create a JSON file named error_detail_metric.json with the following LogMetric definition. This configuration creates a metric that extracts the proxyStatus field from your load balancer logs.

{
  "description": "Metric to extract error details from load balancer logs.",
  "filter": "resource.type=\"MONITORED_RESOURCE\"",
  "metricDescriptor": {
    "metricKind": "DELTA",
    "valueType": "INT64",
    "labels": [
      {
        "key": "error_detail",
        "valueType": "STRING",
        "description": "The detailed error string from the load balancer."
      }
    ]
  },
  "labelExtractors": {
    "error_detail": "EXTRACT(jsonPayload.proxyStatus)"
  }
}

Replace MONITORED_RESOURCE with the monitored resource for your load balancer.

Open Cloud Shell or your local terminal where the gcloud CLI is installed.

To create the metric, run the gcloud logging metrics create command with the --config-from-file flag:

gcloud logging metrics create error_detail_metric \
    --config-from-file=error_detail_metric.json

After the metric is created, you can use it in Cloud Monitoring to view the distribution of errors reported by the load balancer. For more information, see Create a logs-based metric.

For more information about creating alerts from logs-based metrics, see Create an alerting policy on a counter metric.

Configure autoscaling

Autoscaling adjusts resource allocation in response to load variations, maintaining performance and resource efficiency by dynamically adding or removing Pods based on demand. For GKE Inference Gateway, this involves horizontal autoscaling of Pods in each InferencePool. The GKE Horizontal Pod Autoscaler (HPA) autoscales Pods based on model-server metrics such as KVCache Utilization. This ensures the inference service handles different workloads and query volumes while efficiently managing resource usage.

To configure InferencePool instances so they autoscale based on metrics produced by GKE Inference Gateway, perform the following steps:

Deploy a PodMonitoring object in the cluster to collect metrics produced by GKE Inference Gateway. For more information, see Configure observability.

Deploy the Custom Metrics Stackdriver Adapter to give HPA access to the metrics:

Save the following sample manifest as adapter_new_resource_model.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: custom-metrics
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-metrics:system:auth-delegator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:auth-delegator
subjects:
- kind: ServiceAccount
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: custom-metrics-auth-reader
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: custom-metrics-resource-reader
  namespace: custom-metrics
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - nodes/stats
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-metrics-resource-reader
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: custom-metrics-resource-reader
subjects:
- kind: ServiceAccount
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
---
apiVersion: apps/v1
kind: Deployment
metadata:
  run: custom-metrics-stackdriver-adapter
  k8s-app: custom-metrics-stackdriver-adapter
spec:
  replicas: 1
  selector:
    matchLabels:
      run: custom-metrics-stackdriver-adapter
      k8s-app: custom-metrics-stackdriver-adapter
  template:
    metadata:
      labels:
        run: custom-metrics-stackdriver-adapter
        k8s-app: custom-metrics-stackdriver-adapter
        kubernetes.io/cluster-service: "true"
    spec:
      serviceAccountName: custom-metrics-stackdriver-adapter
      containers:
      - image: gcr.io/gke-release/custom-metrics-stackdriver-adapter:v0.15.2-gke.1
        imagePullPolicy: Always
        name: pod-custom-metrics-stackdriver-adapter
        command:
        - /adapter
        - --use-new-resource-model=true
        - --fallback-for-container-metrics=true
        resources:
          limits:
            cpu: 250m
            memory: 200Mi
          requests:
            cpu: 250m
            memory: 200Mi
---
apiVersion: v1
kind: Service
metadata:
  labels:
    run: custom-metrics-stackdriver-adapter
    k8s-app: custom-metrics-stackdriver-adapter
    kubernetes.io/cluster-service: 'true'
    kubernetes.io/name: Adapter
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
spec:
  ports:
  - port: 443
    protocol: TCP
    targetPort: 443
  selector:
    run: custom-metrics-stackdriver-adapter
    k8s-app: custom-metrics-stackdriver-adapter
  type: ClusterIP
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.custom.metrics.k8s.io
spec:
  insecureSkipTLSVerify: true
  group: custom.metrics.k8s.io
  groupPriorityMinimum: 100
  versionPriority: 100
  service:
    name: custom-metrics-stackdriver-adapter
    namespace: custom-metrics
  version: v1beta1
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta2.custom.metrics.k8s.io
spec:
  insecureSkipTLSVerify: true
  group: custom.metrics.k8s.io
  groupPriorityMinimum: 100
  versionPriority: 200
  service:
    name: custom-metrics-stackdriver-adapter
    namespace: custom-metrics
  version: v1beta2
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.external.metrics.k8s.io
spec:
  insecureSkipTLSVerify: true
  group: external.metrics.k8s.io
  groupPriorityMinimum: 100
  versionPriority: 100
  service:
    name: custom-metrics-stackdriver-adapter
    namespace: custom-metrics
  version: v1beta1
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-metrics-reader
rules:
- apiGroups:
  - "external.metrics.k8s.io"
  resources:
  - "*"
  verbs:
  - list
  - get
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-metrics-reader
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-metrics-reader
subjects:
- kind: ServiceAccount
  name: horizontal-pod-autoscaler
  namespace: kube-system