Customize GKE Inference Gateway configuration


This page describes how to customize GKE Inference Gateway deployment.

This page is for Networking specialists responsible for managing GKE infrastructure, and platform administrators who manage AI workloads.

To manage and optimize inference workloads, you configure advanced features of GKE Inference Gateway.

Understand and configure the following advanced features:

Configure AI security and safety checks

GKE Inference Gateway integrates with Model Armor to perform safety checks on prompts and responses for applications that use large language models (LLMs). This integration provides an additional layer of safety enforcement at the infrastructure level that complements application-level safety measures. This enables centralized policy application across all LLM traffic.

The following diagram illustrates Model Armor integration with GKE Inference Gateway on a GKE cluster:

Google Cloud Model Armor integration on a GKE cluster
Figure: Model Armor integration on a GKE cluster

To configure AI safety checks, perform the following steps:

  1. Ensure that the following prerequisites are met:

    1. Enable the Model Armor service in your Google Cloud project.
    2. Create the Model Armor templates using the Model Armor console, Google Cloud CLI, or API.
  2. Ensure that you have already created a Model Armor template named my-model-armor-template-name-id.

  3. To configure the GCPTrafficExtension, perform the following steps:

    1. Save the following sample manifest as gcp-traffic-extension.yaml:

      kind: GCPTrafficExtension
      apiVersion: networking.gke.io/v1
      metadata:
        name: my-model-armor-extension
      spec:
        targetRefs:
        - group: "gateway.networking.k8s.io"
          kind: Gateway
          name: GATEWAY_NAME
        extensionChains:
        - name: my-model-armor-chain1
          matchCondition:
            celExpressions:
            - celMatcher: request.path.startsWith("/")
          extensions:
          - name: my-model-armor-service
            supportedEvents:
            - RequestHeaders
            timeout: 1s
            googleAPIServiceName: "modelarmor.us-central1.rep.googleapis.com"
            metadata:
              'extensionPolicy': MODEL_ARMOR_TEMPLATE_NAME
              'sanitizeUserPrompt': 'true'
              'sanitizeUserResponse': 'true'
      

      Replace the following:

      • GATEWAY_NAME: the name of the Gateway.
      • MODEL_ARMOR_TEMPLATE_NAME: the name of your Model Armor template.

      The gcp-traffic-extension.yaml file includes the following settings:

      • targetRefs: specifies the Gateway to which this extension applies.
      • extensionChains: defines a chain of extensions to be applied to the traffic.
      • matchCondition: defines the conditions under which the extensions are applied.
      • extensions: defines the extensions to be applied.
      • supportedEvents: specifies the events during which the extension is invoked.
      • timeout: specifies the timeout for the extension.
      • googleAPIServiceName: specifies the service name for the extension.
      • metadata: specifies the metadata for the extension, including the extensionPolicy and prompt or response sanitization settings.
    2. Apply the sample manifest to your cluster:

      kubectl apply -f `gcp-traffic-extension.yaml`
      

After you configure the AI safety checks and integrate them with your Gateway, Model Armor automatically filters prompts and responses based on the defined rules.

Configure observability

GKE Inference Gateway provides insights into the health, performance, and behavior of your inference workloads. This helps you to identify and resolve issues, optimize resource utilization, and ensure the reliability of your applications.

Google Cloud provides the following Cloud Monitoring dashboards that offer inference observability for GKE Inference Gateway:

  • GKE Inference Gateway dashboard: provides golden metrics for LLM serving, such as request and token throughput, latency, errors, and cache utilization for the InferencePool. To see the complete list of available GKE Inference Gateway metrics, see Exposed metrics.
  • Model server dashboard: provides a dashboard for golden signals of model server. This lets you to monitor the load and performance of the model servers, such as KVCache Utilization and Queue length. This lets you to monitor the load and performance of the model servers.
  • Load balancer dashboard: reports metrics from the load balancer, such as requests per second, end-to-end request serving latency, and request-response status codes. These metrics help you understand the performance of end-to-end request serving and identify errors.
  • Data Center GPU Manager (DCGM) metrics: provides metrics from NVIDIA GPUs, such as the performance and utilization of NVIDIA GPUs. You can configure NVIDIA Data Center GPU Manager (DCGM) metrics in Cloud Monitoring. For more information, see Collect and view DCGM metrics.

View GKE Inference Gateway dashboard

To view GKE Inference Gateway dashboard, perform the following steps:

  1. In the Google Cloud console, go to the Monitoring page.

    Go to Monitoring

  2. In the navigation pane, select Dashboards.

  3. In the Integrations section, select GMP.

  4. In the Cloud Monitoring Dashboard Templates page, search for "Gateway".

  5. View GKE Inference Gateway dashboard.

Alternately, you can follow the instructions in Monitoring dashboard.

Configure model server observability dashboard

To collect golden signals from each model server and understand what contributes to GKE Inference Gateway performance, you can configure auto-monitoring for your model servers. This includes model servers such as the following:

To view the integration dashboards, perform the following steps:

  1. Collect the metrics from your model server.
  2. In the Google Cloud console, go to the Monitoring page.

    Go to Monitoring

  3. In the navigation pane, select Dashboards.

  4. Under Integrations, select GMP. The corresponding integration dashboards are displayed.

    A view of the integration dashboards
    Figure: Integration dashboards

For more information, see Customize monitoring for applications.

Configure the load balancer observability dashboard

To use the Application Load Balancer with GKE Inference Gateway, import the dashboard by performing the following steps:

  1. To create the load balancer dashboard, create the following file and save it as dashboard.json:

    
    {
        "displayName": "GKE Inference Gateway (Load Balancer) Prometheus Overview",
        "dashboardFilters": [
          {
            "filterType": "RESOURCE_LABEL",
            "labelKey": "cluster",
            "templateVariable": "",
            "valueType": "STRING"
          },
          {
            "filterType": "RESOURCE_LABEL",
            "labelKey": "location",
            "templateVariable": "",
            "valueType": "STRING"
          },
          {
            "filterType": "RESOURCE_LABEL",
            "labelKey": "namespace",
            "templateVariable": "",
            "valueType": "STRING"
          },
          {
            "filterType": "RESOURCE_LABEL",
            "labelKey": "forwarding_rule_name",
            "templateVariable": "",
            "valueType": "STRING"
          }
        ],
        "labels": {},
        "mosaicLayout": {
          "columns": 48,
          "tiles": [
            {
              "height": 8,
              "width": 48,
              "widget": {
                "title": "",
                "id": "",
                "text": {
                  "content": "### Inferece Gateway Metrics\n\nPlease refer to the [official documentation](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/site-src/guides/metrics.md) for more details of underlying metrics used in the dashboard.\n\n\n### External Application Load Balancer Metrics\n\nPlease refer to the [pubic page](/load-balancing/docs/metrics) for complete list of External Application Load Balancer metrics.\n\n### Model Server Metrics\n\nYou can redirect to the detail dashboard for model servers under the integration tab",
                  "format": "MARKDOWN",
                  "style": {
                    "backgroundColor": "#FFFFFF",
                    "fontSize": "FS_EXTRA_LARGE",
                    "horizontalAlignment": "H_LEFT",
                    "padding": "P_EXTRA_SMALL",
                    "pointerLocation": "POINTER_LOCATION_UNSPECIFIED",
                    "textColor": "#212121",
                    "verticalAlignment": "V_TOP"
                  }
                }
              }
            },
            {
              "yPos": 8,
              "height": 4,
              "width": 48,
              "widget": {
                "title": "External Application Load Balancer",
                "id": "",
                "sectionHeader": {
                  "dividerBelow": false,
                  "subtitle": ""
                }
              }
            },
            {
              "yPos": 12,
              "height": 15,
              "width": 24,
              "widget": {
                "title": "E2E Request Latency p99 (by code)",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.99, sum by(le, response_code) (rate(loadbalancing_googleapis_com:https_external_regional_total_latencies_bucket{monitored_resource=\"http_external_regional_lb_rule\",forwarding_rule_name=~\".*inference-gateway.*\"}[1m])))",
                        "unitOverride": "ms"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 12,
              "height": 43,
              "width": 48,
              "widget": {
                "title": "Regional",
                "collapsibleGroup": {
                  "collapsed": false
                },
                "id": ""
              }
            },
            {
              "yPos": 12,
              "xPos": 24,
              "height": 15,
              "width": 24,
              "widget": {
                "title": "E2E Request Latency p95 (by code)",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.95, sum by(le, response_code) (rate(loadbalancing_googleapis_com:https_external_regional_total_latencies_bucket{monitored_resource=\"http_external_regional_lb_rule\",forwarding_rule_name=~\".*inference-gateway.*\"}[1m])))",
                        "unitOverride": "ms"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 27,
              "height": 15,
              "width": 24,
              "widget": {
                "title": "E2E Request Latency p90 (by code)",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.90, sum by(le, response_code) (rate(loadbalancing_googleapis_com:https_external_regional_total_latencies_bucket{monitored_resource=\"http_external_regional_lb_rule\",forwarding_rule_name=~\".*inference-gateway.*\"}[1m])))",
                        "unitOverride": "ms"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 27,
              "xPos": 24,
              "height": 15,
              "width": 24,
              "widget": {
                "title": "E2E Request Latency p50 (by code)",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.50, sum by(le, response_code) (rate(loadbalancing_googleapis_com:https_external_regional_total_latencies_bucket{monitored_resource=\"http_external_regional_lb_rule\",forwarding_rule_name=~\".*inference-gateway.*\"}[1m])))",
                        "unitOverride": "ms"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 42,
              "height": 13,
              "width": 48,
              "widget": {
                "title": "Request /s (by code)",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "sum by (response_code)(rate(loadbalancing_googleapis_com:https_external_regional_request_count{monitored_resource=\"http_external_regional_lb_rule\", forwarding_rule_name=~\".*inference-gateway.*\"}[1m]))",
                        "unitOverride": ""
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 55,
              "height": 4,
              "width": 48,
              "widget": {
                "title": "Inference Optimized Gateway",
                "id": "",
                "sectionHeader": {
                  "dividerBelow": false,
                  "subtitle": ""
                }
              }
            },
            {
              "yPos": 59,
              "height": 17,
              "width": 48,
              "widget": {
                "title": "Request Latency",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p95",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[${__interval}])))",
                        "unitOverride": "s"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p90",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[${__interval}])))",
                        "unitOverride": "s"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p50",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[${__interval}])))",
                        "unitOverride": "s"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 59,
              "height": 65,
              "width": 48,
              "widget": {
                "title": "Inference Model",
                "collapsibleGroup": {
                  "collapsed": false
                },
                "id": ""
              }
            },
            {
              "yPos": 76,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Request / s",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "sum by(model_name, target_model_name) (rate(inference_model_request_total{}[${__interval}]))",
                        "unitOverride": ""
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 76,
              "xPos": 24,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Request Error / s",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "sum by (error_code,model_name,target_model_name) (rate(inference_model_request_error_total[${__interval}]))",
                        "unitOverride": ""
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 92,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Request Size",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p95",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_request_sizes_bucket{}[${__interval}])))",
                        "unitOverride": "By"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p90",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_request_sizes_bucket{}[${__interval}])))",
                        "unitOverride": "By"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p50",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_request_sizes_bucket{}[${__interval}])))",
                        "unitOverride": "By"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 92,
              "xPos": 24,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Response Size",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p95",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_response_sizes_bucket{}[${__interval}])))",
                        "unitOverride": "By"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p90",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_response_sizes_bucket{}[${__interval}])))",
                        "unitOverride": "By"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p50",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_response_sizes_bucket{}[${__interval}])))",
                        "unitOverride": "By"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 108,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Input Token Count",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p95",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_input_tokens_bucket{}[${__interval}])))",
                        "unitOverride": ""
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p90",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_input_tokens_bucket{}[${__interval}])))",
                        "unitOverride": ""
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p50",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_input_tokens_bucket{}[${__interval}])))",
                        "unitOverride": ""
                      }
                    }
                  ],
                  "thresholds": []
                }
              }
            },
            {
              "yPos": 108,
              "xPos": 24,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Output Token Count",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p95",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(inference_model_output_tokens_bucket{}[${__interval}])))",
                        "unitOverride": ""
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p90",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(inference_model_output_tokens_bucket{}[${__interval}])))",
                        "unitOverride": ""
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p50",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(inference_model_output_tokens_bucket{}[${__interval}])))",
                        "unitOverride": ""
                      }
                    }
                  ],
                  "thresholds": []
                }
              }
            },
            {
              "yPos": 124,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Average KV Cache Utilization",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "sum by (name)(avg_over_time(inference_pool_average_kv_cache_utilization[${__interval}]))*100",
                        "unitOverride": "%"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 124,
              "height": 16,
              "width": 48,
              "widget": {
                "title": "Inference Pool",
                "collapsibleGroup": {
                  "collapsed": false
                },
                "id": ""
              }
            },
            {
              "yPos": 124,
              "xPos": 24,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Average Queue Size",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "sum by (name) (avg_over_time(inference_pool_average_queue_size[${__interval}]))",
                        "unitOverride": ""
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 140,
              "height": 4,
              "width": 48,
              "widget": {
                "title": "Model Server",
                "id": "",
                "sectionHeader": {
                  "dividerBelow": true,
                  "subtitle": "The following charts will only be populated if model server is exporting metrics."
                }
              }
            },
            {
              "yPos": 144,
              "height": 32,
              "width": 48,
              "widget": {
                "title": "vLLM",
                "collapsibleGroup": {
                  "collapsed": false
                },
                "id": ""
              }
            },
            {
              "yPos": 144,
              "xPos": 1,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Token Throughput",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "Prompt Tokens/Sec",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "sum by(model_name) (rate(vllm:prompt_tokens_total[${__interval}]))",
                        "unitOverride": ""
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "Generation Tokens/Sec",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "sum by(model_name) (rate(vllm:generation_tokens_total[${__interval}]))",
                        "unitOverride": ""
                      }
                    }
                  ],
                  "thresholds": []
                }
              }
            },
            {
              "yPos": 144,
              "xPos": 25,
              "height": 16,
              "width": 23,
              "widget": {
                "title": "Request Latency",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p95",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[${__interval}])))",
                        "unitOverride": "s"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p90",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[${__interval}])))",
                        "unitOverride": "s"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p50",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[${__interval}])))",
                        "unitOverride": "s"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 160,
              "xPos": 1,
              "height": 16,
              "width": 24,
              "widget": {
                "title": "Time Per Output Token Latency",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p95",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[${__interval}])))",
                        "unitOverride": "s"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p90",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[${__interval}])))",
                        "unitOverride": "s"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p50",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[${__interval}])))",
                        "unitOverride": "s"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            },
            {
              "yPos": 160,
              "xPos": 25,
              "height": 16,
              "width": 23,
              "widget": {
                "title": "Time To First Token Latency",
                "id": "",
                "xyChart": {
                  "chartOptions": {
                    "displayHorizontal": false,
                    "mode": "COLOR",
                    "showLegend": false
                  },
                  "dataSets": [
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p95",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[${__interval}])))",
                        "unitOverride": "s"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p90",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[${__interval}])))",
                        "unitOverride": "s"
                      }
                    },
                    {
                      "breakdowns": [],
                      "dimensions": [],
                      "legendTemplate": "p50",
                      "measures": [],
                      "plotType": "LINE",
                      "targetAxis": "Y1",
                      "timeSeriesQuery": {
                        "outputFullDuration": false,
                        "prometheusQuery": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[${__interval}])))",
                        "unitOverride": "s"
                      }
                    }
                  ],
                  "thresholds": [],
                  "yAxis": {
                    "label": "",
                    "scale": "LINEAR"
                  }
                }
              }
            }
          ]
        }
      }
    
  2. To install the dashboard to your Google Cloud Armor, run the following command:

    gcloud monitoring dashboards create --project $PROJECT_ID --config-from-file=dashboard.json
    
  3. Open the Monitoring page in the Google Cloud console.

    Go to Monitoring

  4. On the navigation menu, select Dashboards.

  5. Select the Inference Optimized Gateway (with L7LB) Prometheus Overview dashboard from the list of custom dashboards.

  6. The External Application Load Balancer section displays the following load balancing metrics:

    • E2E Request Latency p99 (by code): shows the ninety-ninth percentile of end-to-end request latency for requests that the load balancer serves, aggregated by returned status code.
    • Request/s (by code): shows the number of requests that the load balancer serves, aggregated by returned status code.

Configure logging for GKE Inference Gateway

Configuring logging for GKE Inference Gateway provides detailed information about requests and responses, which is useful for troubleshooting, auditing, and performance analysis. HTTP access logs record every request and response, including headers, status codes, and timestamps. This level of detail can help you identify issues, find errors, and understand the behavior of your inference workloads.

To configure logging for GKE Inference Gateway, enable HTTP access logging for each of your InferencePool objects.

  1. Save the following sample manifest as logging-backend-policy.yaml:

    apiVersion: networking.gke.io/v1
    kind: GCPBackendPolicy
    metadata:
      name: logging-backend-policy
      namespace: NAMESPACE_NAME
    spec:
      default:
        logging:
          enabled: true
          sampleRate: 500000
      targetRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: INFERENCE_POOL_NAME
    

    Replace the following:

    • NAMESPACE_NAME: the name of the namespace where your InferencePool is deployed.
    • INFERENCE_POOL_NAME: the name of the InferencePool.
  2. Apply the sample manifest to your cluster:

    kubectl apply -f logging-backend-policy.yaml
    

After you apply this manifest, GKE Inference Gateway enables HTTP access logs for the specified InferencePool. You can view these logs in Cloud Logging. The logs include detailed information about each request and response, such as the request URL, headers, response status code, and latency.

Configure autoscaling

Autoscaling adjusts resource allocation in response to load variations, maintaining performance and resource efficiency by dynamically adding or removing Pods based on demand. For GKE Inference Gateway, this involves horizontal autoscaling of Pods in each InferencePool. The GKE Horizontal Pod Autoscaler (HPA) autoscales Pods based on model-server metrics such as KVCache Utilization. This ensures the inference service handles different workloads and query volumes while efficiently managing resource usage.

To configure InferencePool instances so they autoscale based on metrics produced by GKE Inference Gateway, perform the following steps:

  1. Deploy a PodMonitoring object in the cluster to collect metrics produced by GKE Inference Gateway. For more information, see Configure observability.

  2. Deploy the Custom Metrics Stackdriver Adapter to give HPA access to the metrics:

    1. Save the following sample manifest as adapter_new_resource_model.yaml:

      apiVersion: v1
      kind: Namespace
      metadata:
        name: custom-metrics
      ---
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: custom-metrics-stackdriver-adapter
        namespace: custom-metrics
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        name: custom-metrics:system:auth-delegator
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: system:auth-delegator
      subjects:
      - kind: ServiceAccount
        name: custom-metrics-stackdriver-adapter
        namespace: custom-metrics
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: RoleBinding
      metadata:
        name: custom-metrics-auth-reader
        namespace: kube-system
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: Role
        name: extension-apiserver-authentication-reader
      subjects:
      - kind: ServiceAccount
        name: custom-metrics-stackdriver-adapter
        namespace: custom-metrics
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        name: custom-metrics-resource-reader
        namespace: custom-metrics
      rules:
      - apiGroups:
        - ""
        resources:
        - pods
        - nodes
        - nodes/stats
        verbs:
        - get
        - list
        - watch
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        name: custom-metrics-resource-reader
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: custom-metrics-resource-reader
      subjects:
      - kind: ServiceAccount
        name: custom-metrics-stackdriver-adapter
        namespace: custom-metrics
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        run: custom-metrics-stackdriver-adapter
        k8s-app: custom-metrics-stackdriver-adapter
      spec:
        replicas: 1
        selector:
          matchLabels:
            run: custom-metrics-stackdriver-adapter
            k8s-app: custom-metrics-stackdriver-adapter
        template:
          metadata:
            labels:
              run: custom-metrics-stackdriver-adapter
              k8s-app: custom-metrics-stackdriver-adapter
              kubernetes.io/cluster-service: "true"
          spec:
            serviceAccountName: custom-metrics-stackdriver-adapter
            containers:
            - image: gcr.io/gke-release/custom-metrics-stackdriver-adapter:v0.15.2-gke.1
              imagePullPolicy: Always
              name: pod-custom-metrics-stackdriver-adapter
              command:
              - /adapter
              - --use-new-resource-model=true
              - --fallback-for-container-metrics=true
              resources:
                limits:
                  cpu: 250m
                  memory: 200Mi
                requests:
                  cpu: 250m
                  memory: 200Mi
      ---
      apiVersion: v1
      kind: Service
      metadata:
        labels:
          run: custom-metrics-stackdriver-adapter
          k8s-app: custom-metrics-stackdriver-adapter
          kubernetes.io/cluster-service: 'true'
          kubernetes.io/name: Adapter
        name: custom-metrics-stackdriver-adapter
        namespace: custom-metrics
      spec:
        ports:
        - port: 443
          protocol: TCP
          targetPort: 443
        selector:
          run: custom-metrics-stackdriver-adapter
          k8s-app: custom-metrics-stackdriver-adapter
        type: ClusterIP
      ---
      apiVersion: apiregistration.k8s.io/v1
      kind: APIService
      metadata:
        name: v1beta1.custom.metrics.k8s.io
      spec:
        insecureSkipTLSVerify: true
        group: custom.metrics.k8s.io
        groupPriorityMinimum: 100
        versionPriority: 100
        service:
          name: custom-metrics-stackdriver-adapter
          namespace: custom-metrics
        version: v1beta1
      ---
      apiVersion: apiregistration.k8s.io/v1
      kind: APIService
      metadata:
        name: v1beta2.custom.metrics.k8s.io
      spec:
        insecureSkipTLSVerify: true
        group: custom.metrics.k8s.io
        groupPriorityMinimum: 100
        versionPriority: 200
        service:
          name: custom-metrics-stackdriver-adapter
          namespace: custom-metrics
        version: v1beta2
      ---
      apiVersion: apiregistration.k8s.io/v1
      kind: APIService
      metadata:
        name: v1beta1.external.metrics.k8s.io
      spec:
        insecureSkipTLSVerify: true
        group: external.metrics.k8s.io
        groupPriorityMinimum: 100
        versionPriority: 100
        service:
          name: custom-metrics-stackdriver-adapter
          namespace: custom-metrics
        version: v1beta1
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        name: external-metrics-reader
      rules:
      - apiGroups:
        - "external.metrics.k8s.io"
        resources:
        - "*"
        verbs:
        - list
        - get
        - watch
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        name: external-metrics-reader
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: external-metrics-reader
      subjects:
      - kind: ServiceAccount
        name: horizontal-pod-autoscaler
        namespace: kube-system
      
    2. Apply the sample manifest to your cluster:

      kubectl apply -f adapter_new_resource_model.yaml
      
  3. To give adapter permissions to read metrics from the project, run the following command:

    $ PROJECT_ID=PROJECT_ID
    $ PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
    $ gcloud projects add-iam-policy-binding projects/PROJECT_ID \
      --role roles/monitoring.viewer \
      --member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/custom-metrics/sa/custom-metrics-stackdriver-adapter
    

    Replace PROJECT_ID with your Google Cloud project ID.

  4. For each InferencePool, deploy one HPA that is similar to the following:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: INFERENCE_POOL_NAME
      namespace: INFERENCE_POOL_NAMESPACE
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: INFERENCE_POOL_NAME
      minReplicas: MIN_REPLICAS
      maxReplicas: MAX_REPLICAS
      metrics:
      - type: External
        external:
          metric:
            name: prometheus.googleapis.com|inference_pool_average_kv_cache_utilization|gauge
            selector:
              matchLabels:
                metric.labels.name: INFERENCE_POOL_NAME
                resource.labels.cluster: CLUSTER_NAME
                resource.labels.namespace: INFERENCE_POOL_NAMESPACE
          target:
            type: AverageValue
            averageValue: TARGET_VALUE
    

    Replace the following:

    • INFERENCE_POOL_NAME: the name of the InferencePool.
    • INFERENCE_POOL_NAMESPACE: the namespace of the InferencePool.
    • CLUSTER_NAME: the name of the cluster.
    • MIN_REPLICAS: the minimum availability of the InferencePool (baseline capacity). HPA keeps this number of replicas up when usage is below the HPA target threshold. Highly available workloads must set this to a value higher than 1 to ensure continued availability during Pod disruptions.
    • MAX_REPLICAS: the value that constrains the number of accelerators that must be assigned to the workloads hosted in the InferencePool. HPA won't increase the number of replicas beyond this value. During peak traffic times, monitor the number of replicas to ensure that the value of the MAX_REPLICAS field provides enough headroom so the workload can scale up to maintain the chosen workload performance characteristics.
    • TARGET_VALUE: the value that represents the chosen target KV-Cache Utilization per model server. This is a number between 0-100 and is highly dependent on the model server, model, accelerator, and incoming traffic characteristics. You can determine this target value experimentally through load testing and plotting a throughput versus latency graph. Select a chosen throughput and latency combination from the graph, and use the corresponding KV-Cache Utilization value as the HPA target. You must tweak and monitor this value closely to achieve chosen price-performance results. You can use GKE Inference Recommendations to determine this value automatically.

What's next