This page describes how to customize GKE Inference Gateway deployment.
This page is for Networking specialists responsible for managing GKE infrastructure, and platform administrators who manage AI workloads.
To manage and optimize inference workloads, you configure advanced features of GKE Inference Gateway.
Understand and configure the following advanced features:
- To use Model Armor integration, configure AI security and safety checks.
- To view GKE Inference Gateway and model server metrics and dashboards, and to enable HTTP access logging for detailed request and response information, configure observability
- To automatically scale your GKE Inference Gateway deployments, configure autoscaling.
Configure AI security and safety checks
GKE Inference Gateway integrates with Model Armor to perform safety checks on prompts and responses for applications that use large language models (LLMs). This integration provides an additional layer of safety enforcement at the infrastructure level that complements application-level safety measures. This enables centralized policy application across all LLM traffic.
The following diagram illustrates Model Armor integration with GKE Inference Gateway on a GKE cluster:

To configure AI safety checks, perform the following steps:
Ensure that the following prerequisites are met:
- Enable the Model Armor service in your Google Cloud project.
- Create the Model Armor templates using the Model Armor console, Google Cloud CLI, or API.
Ensure that you have already created a Model Armor template named
my-model-armor-template-name-id
.To configure the
GCPTrafficExtension
, perform the following steps:Save the following sample manifest as
gcp-traffic-extension.yaml
:kind: GCPTrafficExtension apiVersion: networking.gke.io/v1 metadata: name: my-model-armor-extension spec: targetRefs: - group: "gateway.networking.k8s.io" kind: Gateway name: GATEWAY_NAME extensionChains: - name: my-model-armor-chain1 matchCondition: celExpressions: - celMatcher: request.path.startsWith("/") extensions: - name: my-model-armor-service supportedEvents: - RequestHeaders timeout: 1s googleAPIServiceName: "modelarmor.us-central1.rep.googleapis.com" metadata: 'extensionPolicy': MODEL_ARMOR_TEMPLATE_NAME 'sanitizeUserPrompt': 'true' 'sanitizeUserResponse': 'true'
Replace the following:
GATEWAY_NAME
: the name of the Gateway.MODEL_ARMOR_TEMPLATE_NAME
: the name of your Model Armor template.
The
gcp-traffic-extension.yaml
file includes the following settings:targetRefs
: specifies the Gateway to which this extension applies.extensionChains
: defines a chain of extensions to be applied to the traffic.matchCondition
: defines the conditions under which the extensions are applied.extensions
: defines the extensions to be applied.supportedEvents
: specifies the events during which the extension is invoked.timeout
: specifies the timeout for the extension.googleAPIServiceName
: specifies the service name for the extension.metadata
: specifies the metadata for the extension, including theextensionPolicy
and prompt or response sanitization settings.
Apply the sample manifest to your cluster:
kubectl apply -f `gcp-traffic-extension.yaml`
After you configure the AI safety checks and integrate them with your Gateway, Model Armor automatically filters prompts and responses based on the defined rules.
Configure observability
GKE Inference Gateway provides insights into the health, performance, and behavior of your inference workloads. This helps you to identify and resolve issues, optimize resource utilization, and ensure the reliability of your applications.
Google Cloud provides the following Cloud Monitoring dashboards that offer inference observability for GKE Inference Gateway:
- GKE Inference Gateway dashboard:
provides golden metrics for LLM serving, such as request
and token throughput, latency, errors, and cache utilization for the
InferencePool
. To see the complete list of available GKE Inference Gateway metrics, see Exposed metrics. - Model server dashboard: provides a
dashboard for golden signals of model server. This lets you to monitor the load and performance of the model servers, such as
KVCache Utilization
andQueue length
. This lets you to monitor the load and performance of the model servers. - Load balancer dashboard: reports metrics from the load balancer, such as requests per second, end-to-end request serving latency, and request-response status codes. These metrics help you understand the performance of end-to-end request serving and identify errors.
- Data Center GPU Manager (DCGM) metrics: provides metrics from NVIDIA GPUs, such as the performance and utilization of NVIDIA GPUs. You can configure NVIDIA Data Center GPU Manager (DCGM) metrics in Cloud Monitoring. For more information, see Collect and view DCGM metrics.
View GKE Inference Gateway dashboard
To view GKE Inference Gateway dashboard, perform the following steps:
In the Google Cloud console, go to the Monitoring page.
In the navigation pane, select Dashboards.
In the Integrations section, select GMP.
In the Cloud Monitoring Dashboard Templates page, search for "Gateway".
View GKE Inference Gateway dashboard.
Alternately, you can follow the instructions in Monitoring dashboard.
Configure model server observability dashboard
To collect golden signals from each model server and understand what contributes to GKE Inference Gateway performance, you can configure auto-monitoring for your model servers. This includes model servers such as the following:
To view the integration dashboards, perform the following steps:
- Collect the metrics from your model server.
In the Google Cloud console, go to the Monitoring page.
In the navigation pane, select Dashboards.
Under Integrations, select GMP. The corresponding integration dashboards are displayed.
Figure: Integration dashboards
For more information, see Customize monitoring for applications.
Configure the load balancer observability dashboard
To use the Application Load Balancer with GKE Inference Gateway, import the dashboard by performing the following steps:
To create the load balancer dashboard, create the following file and save it as
dashboard.json
:{ "displayName": "GKE Inference Gateway (Load Balancer) Prometheus Overview", "dashboardFilters": [ { "filterType": "RESOURCE_LABEL", "labelKey": "cluster", "templateVariable": "", "valueType": "STRING" }, { "filterType": "RESOURCE_LABEL", "labelKey": "location", "templateVariable": "", "valueType": "STRING" }, { "filterType": "RESOURCE_LABEL", "labelKey": "namespace", "templateVariable": "", "valueType": "STRING" }, { "filterType": "RESOURCE_LABEL", "labelKey": "forwarding_rule_name", "templateVariable": "", "valueType": "STRING" } ], "labels": {}, "mosaicLayout": { "columns": 48, "tiles": [ { "height": 8, "width": 48, "widget": { "title": "", "id": "", "text": { "content": "### Inferece Gateway Metrics\n\nPlease refer to the [official documentation](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/site-src/guides/metrics.md) for more details of underlying metrics used in the dashboard.\n\n\n### External Application Load Balancer Metrics\n\nPlease refer to the [pubic page](/load-balancing/docs/metrics) for complete list of External Application Load Balancer metrics.\n\n### Model Server Metrics\n\nYou can redirect to the detail dashboard for model servers under the integration tab", "format": "MARKDOWN", "style": { "backgroundColor": "#FFFFFF", "fontSize": "FS_EXTRA_LARGE", "horizontalAlignment": "H_LEFT", "padding": "P_EXTRA_SMALL", "pointerLocation": "POINTER_LOCATION_UNSPECIFIED", "textColor": "#212121", "verticalAlignment": "V_TOP" } } } }, { "yPos": 8, "height": 4, "width": 48, "widget": { "title": "External Application Load Balancer", "id": "", "sectionHeader": { "dividerBelow": false, "subtitle": "" } } }, { "yPos": 12, "height": 15, "width": 24, "widget": { "title": "E2E Request Latency p99 (by code)", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.99, sum by(le, response_code) (rate(loadbalancing_googleapis_com:https_external_regional_total_latencies_bucket{monitored_resource=\"http_external_regional_lb_rule\",forwarding_rule_name=~\".*inference-gateway.*\"}[1m])))", "unitOverride": "ms" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 12, "height": 43, "width": 48, "widget": { "title": "Regional", "collapsibleGroup": { "collapsed": false }, "id": "" } }, { "yPos": 12, "xPos": 24, "height": 15, "width": 24, "widget": { "title": "E2E Request Latency p95 (by code)", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.95, sum by(le, response_code) (rate(loadbalancing_googleapis_com:https_external_regional_total_latencies_bucket{monitored_resource=\"http_external_regional_lb_rule\",forwarding_rule_name=~\".*inference-gateway.*\"}[1m])))", "unitOverride": "ms" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 27, "height": 15, "width": 24, "widget": { "title": "E2E Request Latency p90 (by code)", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.90, sum by(le, response_code) (rate(loadbalancing_googleapis_com:https_external_regional_total_latencies_bucket{monitored_resource=\"http_external_regional_lb_rule\",forwarding_rule_name=~\".*inference-gateway.*\"}[1m])))", "unitOverride": "ms" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 27, "xPos": 24, "height": 15, "width": 24, "widget": { "title": "E2E Request Latency p50 (by code)", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.50, sum by(le, response_code) (rate(loadbalancing_googleapis_com:https_external_regional_total_latencies_bucket{monitored_resource=\"http_external_regional_lb_rule\",forwarding_rule_name=~\".*inference-gateway.*\"}[1m])))", "unitOverride": "ms" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 42, "height": 13, "width": 48, "widget": { "title": "Request /s (by code)", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "sum by (response_code)(rate(loadbalancing_googleapis_com:https_external_regional_request_count{monitored_resource=\"http_external_regional_lb_rule\", forwarding_rule_name=~\".*inference-gateway.*\"}[1m]))", "unitOverride": "" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 55, "height": 4, "width": 48, "widget": { "title": "Inference Optimized Gateway", "id": "", "sectionHeader": { "dividerBelow": false, "subtitle": "" } } }, { "yPos": 59, "height": 17, "width": 48, "widget": { "title": "Request Latency", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "p95", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[${__interval}])))", "unitOverride": "s" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p90", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[${__interval}])))", "unitOverride": "s" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p50", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[${__interval}])))", "unitOverride": "s" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 59, "height": 65, "width": 48, "widget": { "title": "Inference Model", "collapsibleGroup": { "collapsed": false }, "id": "" } }, { "yPos": 76, "height": 16, "width": 24, "widget": { "title": "Request / s", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "sum by(model_name, target_model_name) (rate(inference_model_request_total{}[${__interval}]))", "unitOverride": "" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 76, "xPos": 24, "height": 16, "width": 24, "widget": { "title": "Request Error / s", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "sum by (error_code,model_name,target_model_name) (rate(inference_model_request_error_total[${__interval}]))", "unitOverride": "" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 92, "height": 16, "width": 24, "widget": { "title": "Request Size", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "p95", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_request_sizes_bucket{}[${__interval}])))", "unitOverride": "By" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p90", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_request_sizes_bucket{}[${__interval}])))", "unitOverride": "By" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p50", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_request_sizes_bucket{}[${__interval}])))", "unitOverride": "By" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 92, "xPos": 24, "height": 16, "width": 24, "widget": { "title": "Response Size", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "p95", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_response_sizes_bucket{}[${__interval}])))", "unitOverride": "By" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p90", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_response_sizes_bucket{}[${__interval}])))", "unitOverride": "By" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p50", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_response_sizes_bucket{}[${__interval}])))", "unitOverride": "By" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 108, "height": 16, "width": 24, "widget": { "title": "Input Token Count", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "p95", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_input_tokens_bucket{}[${__interval}])))", "unitOverride": "" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p90", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_input_tokens_bucket{}[${__interval}])))", "unitOverride": "" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p50", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_input_tokens_bucket{}[${__interval}])))", "unitOverride": "" } } ], "thresholds": [] } } }, { "yPos": 108, "xPos": 24, "height": 16, "width": 24, "widget": { "title": "Output Token Count", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "p95", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_output_tokens_bucket{}[${__interval}])))", "unitOverride": "" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p90", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_output_tokens_bucket{}[${__interval}])))", "unitOverride": "" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p50", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_output_tokens_bucket{}[${__interval}])))", "unitOverride": "" } } ], "thresholds": [] } } }, { "yPos": 124, "height": 16, "width": 24, "widget": { "title": "Average KV Cache Utilization", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "sum by (name)(avg_over_time(inference_pool_average_kv_cache_utilization[${__interval}]))*100", "unitOverride": "%" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 124, "height": 16, "width": 48, "widget": { "title": "Inference Pool", "collapsibleGroup": { "collapsed": false }, "id": "" } }, { "yPos": 124, "xPos": 24, "height": 16, "width": 24, "widget": { "title": "Average Queue Size", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "sum by (name) (avg_over_time(inference_pool_average_queue_size[${__interval}]))", "unitOverride": "" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 140, "height": 4, "width": 48, "widget": { "title": "Model Server", "id": "", "sectionHeader": { "dividerBelow": true, "subtitle": "The following charts will only be populated if model server is exporting metrics." } } }, { "yPos": 144, "height": 32, "width": 48, "widget": { "title": "vLLM", "collapsibleGroup": { "collapsed": false }, "id": "" } }, { "yPos": 144, "xPos": 1, "height": 16, "width": 24, "widget": { "title": "Token Throughput", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "Prompt Tokens/Sec", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "sum by(model_name) (rate(vllm:prompt_tokens_total[${__interval}]))", "unitOverride": "" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "Generation Tokens/Sec", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "sum by(model_name) (rate(vllm:generation_tokens_total[${__interval}]))", "unitOverride": "" } } ], "thresholds": [] } } }, { "yPos": 144, "xPos": 25, "height": 16, "width": 23, "widget": { "title": "Request Latency", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "p95", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[${__interval}])))", "unitOverride": "s" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p90", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[${__interval}])))", "unitOverride": "s" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p50", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[${__interval}])))", "unitOverride": "s" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 160, "xPos": 1, "height": 16, "width": 24, "widget": { "title": "Time Per Output Token Latency", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "p95", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[${__interval}])))", "unitOverride": "s" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p90", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[${__interval}])))", "unitOverride": "s" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p50", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[${__interval}])))", "unitOverride": "s" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } }, { "yPos": 160, "xPos": 25, "height": 16, "width": 23, "widget": { "title": "Time To First Token Latency", "id": "", "xyChart": { "chartOptions": { "displayHorizontal": false, "mode": "COLOR", "showLegend": false }, "dataSets": [ { "breakdowns": [], "dimensions": [], "legendTemplate": "p95", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[${__interval}])))", "unitOverride": "s" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p90", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[${__interval}])))", "unitOverride": "s" } }, { "breakdowns": [], "dimensions": [], "legendTemplate": "p50", "measures": [], "plotType": "LINE", "targetAxis": "Y1", "timeSeriesQuery": { "outputFullDuration": false, "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[${__interval}])))", "unitOverride": "s" } } ], "thresholds": [], "yAxis": { "label": "", "scale": "LINEAR" } } } } ] } }
To install the dashboard to your Google Cloud Armor, run the following command:
gcloud monitoring dashboards create --project $PROJECT_ID --config-from-file=dashboard.json
Open the Monitoring page in the Google Cloud console.
On the navigation menu, select Dashboards.
Select the Inference Optimized Gateway (with L7LB) Prometheus Overview dashboard from the list of custom dashboards.
The External Application Load Balancer section displays the following load balancing metrics:
- E2E Request Latency p99 (by code): shows the ninety-ninth percentile of end-to-end request latency for requests that the load balancer serves, aggregated by returned status code.
- Request/s (by code): shows the number of requests that the load balancer serves, aggregated by returned status code.
Configure logging for GKE Inference Gateway
Configuring logging for GKE Inference Gateway provides detailed information about requests and responses, which is useful for troubleshooting, auditing, and performance analysis. HTTP access logs record every request and response, including headers, status codes, and timestamps. This level of detail can help you identify issues, find errors, and understand the behavior of your inference workloads.
To configure logging for GKE Inference Gateway, enable HTTP access logging for each of your InferencePool
objects.
Save the following sample manifest as
logging-backend-policy.yaml
:apiVersion: networking.gke.io/v1 kind: GCPBackendPolicy metadata: name: logging-backend-policy namespace: NAMESPACE_NAME spec: default: logging: enabled: true sampleRate: 500000 targetRef: group: inference.networking.x-k8s.io kind: InferencePool name: INFERENCE_POOL_NAME
Replace the following:
NAMESPACE_NAME
: the name of the namespace where yourInferencePool
is deployed.INFERENCE_POOL_NAME
: the name of theInferencePool
.
Apply the sample manifest to your cluster:
kubectl apply -f logging-backend-policy.yaml
After you apply this manifest, GKE Inference Gateway enables HTTP
access logs for the specified InferencePool
. You can view these logs in
Cloud Logging. The logs include detailed information about each request and
response, such as the request URL, headers, response status code, and latency.
Configure autoscaling
Autoscaling adjusts resource allocation in response to load variations,
maintaining performance and resource efficiency by dynamically adding or
removing Pods based on demand. For GKE Inference Gateway, this involves
horizontal autoscaling of Pods in each InferencePool
. The GKE
Horizontal Pod Autoscaler (HPA) autoscales Pods based on model-server metrics
such as KVCache Utilization
. This ensures the inference service handles
different workloads and query volumes while efficiently managing resource usage.
To configure InferencePool
instances so they autoscale based on metrics produced by GKE Inference Gateway, perform the following steps:
Deploy a
PodMonitoring
object in the cluster to collect metrics produced by GKE Inference Gateway. For more information, see Configure observability.Deploy the Custom Metrics Stackdriver Adapter to give HPA access to the metrics:
Save the following sample manifest as
adapter_new_resource_model.yaml
:apiVersion: v1 kind: Namespace metadata: name: custom-metrics --- apiVersion: v1 kind: ServiceAccount metadata: name: custom-metrics-stackdriver-adapter namespace: custom-metrics --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: custom-metrics:system:auth-delegator roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:auth-delegator subjects: - kind: ServiceAccount name: custom-metrics-stackdriver-adapter namespace: custom-metrics --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: custom-metrics-auth-reader namespace: kube-system roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: extension-apiserver-authentication-reader subjects: - kind: ServiceAccount name: custom-metrics-stackdriver-adapter namespace: custom-metrics --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: custom-metrics-resource-reader namespace: custom-metrics rules: - apiGroups: - "" resources: - pods - nodes - nodes/stats verbs: - get - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: custom-metrics-resource-reader roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: custom-metrics-resource-reader subjects: - kind: ServiceAccount name: custom-metrics-stackdriver-adapter namespace: custom-metrics --- apiVersion: apps/v1 kind: Deployment metadata: run: custom-metrics-stackdriver-adapter k8s-app: custom-metrics-stackdriver-adapter spec: replicas: 1 selector: matchLabels: run: custom-metrics-stackdriver-adapter k8s-app: custom-metrics-stackdriver-adapter template: metadata: labels: run: custom-metrics-stackdriver-adapter k8s-app: custom-metrics-stackdriver-adapter kubernetes.io/cluster-service: "true" spec: serviceAccountName: custom-metrics-stackdriver-adapter containers: - image: gcr.io/gke-release/custom-metrics-stackdriver-adapter:v0.15.2-gke.1 imagePullPolicy: Always name: pod-custom-metrics-stackdriver-adapter command: - /adapter - --use-new-resource-model=true - --fallback-for-container-metrics=true resources: limits: cpu: 250m memory: 200Mi requests: cpu: 250m memory: 200Mi --- apiVersion: v1 kind: Service metadata: labels: run: custom-metrics-stackdriver-adapter k8s-app: custom-metrics-stackdriver-adapter kubernetes.io/cluster-service: 'true' kubernetes.io/name: Adapter name: custom-metrics-stackdriver-adapter namespace: custom-metrics spec: ports: - port: 443 protocol: TCP targetPort: 443 selector: run: custom-metrics-stackdriver-adapter k8s-app: custom-metrics-stackdriver-adapter type: ClusterIP --- apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: name: v1beta1.custom.metrics.k8s.io spec: insecureSkipTLSVerify: true group: custom.metrics.k8s.io groupPriorityMinimum: 100 versionPriority: 100 service: name: custom-metrics-stackdriver-adapter namespace: custom-metrics version: v1beta1 --- apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: name: v1beta2.custom.metrics.k8s.io spec: insecureSkipTLSVerify: true group: custom.metrics.k8s.io groupPriorityMinimum: 100 versionPriority: 200 service: name: custom-metrics-stackdriver-adapter namespace: custom-metrics version: v1beta2 --- apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: name: v1beta1.external.metrics.k8s.io spec: insecureSkipTLSVerify: true group: external.metrics.k8s.io groupPriorityMinimum: 100 versionPriority: 100 service: name: custom-metrics-stackdriver-adapter namespace: custom-metrics version: v1beta1 --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: external-metrics-reader rules: - apiGroups: - "external.metrics.k8s.io" resources: - "*" verbs: - list - get - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: external-metrics-reader roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: external-metrics-reader subjects: - kind: ServiceAccount name: horizontal-pod-autoscaler namespace: kube-system
Apply the sample manifest to your cluster:
kubectl apply -f adapter_new_resource_model.yaml
To give adapter permissions to read metrics from the project, run the following command:
$ PROJECT_ID=PROJECT_ID $ PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)") $ gcloud projects add-iam-policy-binding projects/PROJECT_ID \ --role roles/monitoring.viewer \ --member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/custom-metrics/sa/custom-metrics-stackdriver-adapter
Replace
PROJECT_ID
with your Google Cloud project ID.For each
InferencePool
, deploy one HPA that is similar to the following:apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: INFERENCE_POOL_NAME namespace: INFERENCE_POOL_NAMESPACE spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: INFERENCE_POOL_NAME minReplicas: MIN_REPLICAS maxReplicas: MAX_REPLICAS metrics: - type: External external: metric: name: prometheus.googleapis.com|inference_pool_average_kv_cache_utilization|gauge selector: matchLabels: metric.labels.name: INFERENCE_POOL_NAME resource.labels.cluster: CLUSTER_NAME resource.labels.namespace: INFERENCE_POOL_NAMESPACE target: type: AverageValue averageValue: TARGET_VALUE
Replace the following:
INFERENCE_POOL_NAME
: the name of theInferencePool
.INFERENCE_POOL_NAMESPACE
: the namespace of theInferencePool
.CLUSTER_NAME
: the name of the cluster.MIN_REPLICAS
: the minimum availability of theInferencePool
(baseline capacity). HPA keeps this number of replicas up when usage is below the HPA target threshold. Highly available workloads must set this to a value higher than1
to ensure continued availability during Pod disruptions.MAX_REPLICAS
: the value that constrains the number of accelerators that must be assigned to the workloads hosted in theInferencePool
. HPA won't increase the number of replicas beyond this value. During peak traffic times, monitor the number of replicas to ensure that the value of theMAX_REPLICAS
field provides enough headroom so the workload can scale up to maintain the chosen workload performance characteristics.TARGET_VALUE
: the value that represents the chosen targetKV-Cache Utilization
per model server. This is a number between 0-100 and is highly dependent on the model server, model, accelerator, and incoming traffic characteristics. You can determine this target value experimentally through load testing and plotting a throughput versus latency graph. Select a chosen throughput and latency combination from the graph, and use the correspondingKV-Cache Utilization
value as the HPA target. You must tweak and monitor this value closely to achieve chosen price-performance results. You can use GKE Inference Recommendations to determine this value automatically.
What's next
- Read about GKE Inference Gateway.
- Read about deploying GKE Inference Gateway.
- Read about GKE Inference Gateway roll out operations.
- Read about serving with GKE Inference Gateway.