本頁面由 Cloud Translation API 翻譯而成。

自訂 GKE 推論閘道設定

自動駕駛標準

本頁說明如何自訂 GKE Inference Gateway 部署作業。

本頁面適用於負責管理 GKE 基礎架構的網路專家，以及管理 AI 工作負載的平台管理員。

如要管理及最佳化推論工作負載，請設定 GKE 推論閘道的高階功能。

瞭解及設定下列進階功能：

如要使用 Model Armor 整合功能，請設定 AI 安全檢查。
如要查看 GKE Inference Gateway 和模型伺服器指標與資訊主頁，並啟用 HTTP 存取記錄來取得詳細的請求和回應資訊，請設定可觀測性
如要自動調整 GKE Inference Gateway 部署作業的資源配置，請設定自動調度資源。

設定 AI 安全檢查

GKE Inference Gateway 會與 Model Armor 整合，對使用大型語言模型 (LLM) 的應用程式執行提示和回覆的安全檢查。這項整合功能可在基礎架構層級提供額外的安全措施，與應用程式層級的安全措施相輔相成。這樣一來，您就能集中管理所有大型語言模型流量的政策。

下圖說明 Model Armor 與 GKE 叢集上的 GKE Inference Gateway 整合：

在 GKE 叢集上整合 Google Cloud Model Armor — 圖： GKE 叢集上的 Model Armor 整合

如要設定 AI 安全性檢查，請按照下列步驟操作：

請確認符合下列必要條件：
1. 在 Google Cloud 專案中啟用 Model Armor 服務。
2. 使用 Model Armor 控制台、Google Cloud CLI 或 API，建立 Model Armor 範本。
請確認您已建立名為「my-model-armor-template-name-id」的 Model Armor 範本。

如要設定 GCPTrafficExtension，請執行下列步驟：

將下列範例資訊清單儲存為 gcp-traffic-extension.yaml：

kind: GCPTrafficExtension
apiVersion: networking.gke.io/v1
metadata:
  name: my-model-armor-extension
spec:
  targetRefs:
  - group: "gateway.networking.k8s.io"
    kind: Gateway
    name: GATEWAY_NAME
  extensionChains:
  - name: my-model-armor-chain1
    matchCondition:
      celExpressions:
      - celMatcher: request.path.startsWith("/")
    extensions:
    - name: my-model-armor-service
      supportedEvents:
      - RequestHeaders
      timeout: 1s
      googleAPIServiceName: "modelarmor.us-central1.rep.googleapis.com"
      metadata:
        'extensionPolicy': MODEL_ARMOR_TEMPLATE_NAME
        'sanitizeUserPrompt': 'true'
        'sanitizeUserResponse': 'true'

更改下列內容：

GATEWAY_NAME：閘道的名稱。
MODEL_ARMOR_TEMPLATE_NAME：Model Armor 範本的名稱。

gcp-traffic-extension.yaml 檔案包含下列設定：

targetRefs：指定要套用這項擴充功能的閘道。
extensionChains：定義要套用至流量的擴充功能鏈結。
matchCondition：定義套用擴充功能的條件。
extensions：定義要套用的擴充功能。
supportedEvents：指定擴充功能叫用的事件。
timeout：指定擴充功能的逾時時間。
googleAPIServiceName：指定擴充功能的服務名稱。
metadata：指定擴充功能的中繼資料，包括 extensionPolicy 和提示或回應清除設定。

將範例資訊清單套用至叢集：

kubectl apply -f `gcp-traffic-extension.yaml`

設定 AI 安全檢查並與 Gateway 整合後，Model Armor 會根據定義的規則自動篩選提示和回應。

設定可觀測性

GKE 推論閘道可深入分析推論工作負載的健康狀態、效能和行為。這有助於找出及解決問題、提升資源使用效率，並確保應用程式的可靠性。

Google Cloud 提供下列 Cloud Monitoring 資訊主頁，可為 GKE Inference Gateway 提供推論觀測功能：

GKE Inference Gateway 資訊主頁：提供 LLM 服務的黃金指標，例如要求和權杖輸送量、延遲時間、錯誤和 InferencePool 的快取用量。如要查看可用的 GKE Inference Gateway 指標完整清單，請參閱「公開指標」。
模型伺服器資訊主頁：提供模型伺服器關鍵信號的資訊主頁。您可以藉此監控模型伺服器的負載和效能，例如 KVCache Utilization 和 Queue length。方便監控模型伺服器的負載和效能。
負載平衡器資訊主頁：回報負載平衡器的指標，例如每秒要求數、端對端要求服務延遲時間，以及要求-回應狀態碼。這些指標有助於瞭解端對端要求服務的效能，並找出錯誤。
資料中心 GPU 管理工具 (DCGM) 指標：提供 NVIDIA GPU 的指標，例如 NVIDIA GPU 的效能和使用率。您可以在 Cloud Monitoring 中設定 NVIDIA Data Center GPU Manager (DCGM) 指標。詳情請參閱「收集及查看 DCGM 指標」。

查看 GKE Inference Gateway 資訊主頁

如要查看 GKE Inference Gateway 資訊主頁，請執行下列步驟：

前往 Google Cloud 控制台的「Monitoring」頁面。

前往「Monitoring」頁面
在導覽窗格中，選取「Dashboards」(資訊主頁)。
在「整合」部分中，選取「GMP」。
在「Cloud Monitoring Dashboard Templates」(Cloud Monitoring 資訊主頁範本) 頁面中，搜尋「Gateway」。
查看 GKE 推論閘道資訊主頁。

或者，您也可以按照「監控資訊主頁」中的指示操作。

設定模型伺服器觀測資訊主頁

如要從每個模型伺服器收集黃金信號，並瞭解哪些因素會影響 GKE Inference Gateway 效能，您可以為模型伺服器設定自動監控功能。這包括下列模型伺服器：

如要查看整合資訊主頁，請執行下列步驟：

從模型伺服器收集指標。
前往 Google Cloud 控制台的「Monitoring」頁面。

前往「Monitoring」頁面
在導覽窗格中，選取「Dashboards」(資訊主頁)。
選取「整合」下方的「GMP」。系統會顯示相應的整合資訊主頁。

圖：整合資訊主頁

詳情請參閱「自訂應用程式的監控設定」。

設定 Cloud Monitoring 快訊

如要為 GKE Inference Gateway 設定 Cloud Monitoring 快訊，請完成下列步驟：

修改快訊中的門檻。將下列範例資訊清單儲存為 alerts.yaml：

groups:
- name: gateway-api-inference-extension
  rules:
  - alert: HighInferenceRequestLatencyP99
    annotations:
      title: 'High latency (P99) for model {{ $labels.model_name }}'
      description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
    expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0
    for: 5m
    labels:
      severity: 'warning'
  - alert: HighInferenceErrorRate
    annotations:
      title: 'High error rate for model {{ $labels.model_name }}'
      description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
    expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05
    for: 5m
    labels:
      severity: 'critical'
      impact: 'availability'
  - alert: HighInferencePoolAvgQueueSize
    annotations:
      title: 'High average queue size for inference pool {{ $labels.name }}'
      description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
    expr: inference_pool_average_queue_size > 50
    for: 5m
    labels:
      severity: 'critical'
      impact: 'performance'
  - alert: HighInferencePoolAvgKVCacheUtilization
    annotations:
      title: 'High KV cache utilization for inference pool {{ $labels.name }}'
      description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
    expr: inference_pool_average_kv_cache_utilization > 0.9
    for: 5m
    labels:
      severity: 'critical'
      impact: 'resource_exhaustion'

如要建立快訊政策，請執行下列指令：
```
gcloud alpha monitoring policies migrate --policies-from-prometheus-alert-rules-yaml=alerts.yaml
```
您會在「快訊」頁面中看到新的快訊政策。

修改快訊

如要查看最新指標的完整清單，請前往 kubernetes-sigs/gateway-api-inference-extension GitHub 存放區，您也可以使用其他指標，將新快訊附加至資訊清單。

如要變更範例快訊，請以以下快訊為例：

  - alert: HighInferenceRequestLatencyP99
    annotations:
      title: 'High latency (P99) for model {{ $labels.model_name }}'
      description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
    expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0
    for: 5m
    labels:
      severity: 'warning'

如果 5 分鐘內第 99 個百分位數的要求時間超過 10 秒，就會觸發快訊。您可以修改快訊的 expr 部分，根據需求調整門檻。

設定 GKE Inference Gateway 的記錄功能

設定 GKE Inference Gateway 的記錄功能，可提供要求和回應的詳細資訊，有助於進行疑難排解、稽核和效能分析。HTTP 存取記錄會記錄每個要求和回應，包括標頭、狀態碼和時間戳記。這類詳細資料有助於找出問題、錯誤，以及瞭解推論工作負載的行為。

如要為 GKE Inference Gateway 設定記錄功能，請為每個 InferencePool 物件啟用 HTTP 存取記錄。

將下列範例資訊清單儲存為 logging-backend-policy.yaml：

apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
  name: logging-backend-policy
  namespace: NAMESPACE_NAME
spec:
  default:
    logging:
      enabled: true
      sampleRate: 500000
  targetRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: INFERENCE_POOL_NAME

更改下列內容：

NAMESPACE_NAME：部署 InferencePool 的命名空間名稱。
INFERENCE_POOL_NAME：InferencePool 的名稱。

將範例資訊清單套用至叢集：

kubectl apply -f logging-backend-policy.yaml

套用這份資訊清單後，GKE Inference Gateway 會為指定的 InferencePool 啟用 HTTP 存取記錄。您可以在 Cloud Logging 中查看這些記錄。記錄檔包含每項要求和回應的詳細資訊，例如要求網址、標頭、回應狀態碼和延遲時間。

設定自動調度資源功能

自動調度資源功能會根據負載變化調整資源分配，並依需求動態新增或移除 Pod，藉此維持效能和資源效率。如果是 GKE Inference Gateway，這會涉及每個 InferencePool 中 Pod 的水平自動調度資源。GKE 水平 Pod 自動配置器 (HPA) 會根據模型伺服器指標 (例如 KVCache Utilization) 自動調度 Pod。這可確保推論服務能處理不同的工作負載和查詢量，同時有效管理資源用量。

如要設定 InferencePool 執行個體，使其根據 GKE Inference Gateway 產生的指標自動調度資源，請執行下列步驟：

在叢集中部署 PodMonitoring 物件，收集 GKE Inference Gateway 產生的指標。詳情請參閱「設定可觀測性」。

部署自訂指標 Stackdriver 轉接器，授予 HPA 指標存取權：

將下列範例資訊清單儲存為 adapter_new_resource_model.yaml：

apiVersion: v1
kind: Namespace
metadata:
  name: custom-metrics
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-metrics:system:auth-delegator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:auth-delegator
subjects:
- kind: ServiceAccount
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: custom-metrics-auth-reader
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: custom-metrics-resource-reader
  namespace: custom-metrics
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - nodes/stats
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-metrics-resource-reader
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: custom-metrics-resource-reader
subjects:
- kind: ServiceAccount
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
---
apiVersion: apps/v1
kind: Deployment
metadata:
  run: custom-metrics-stackdriver-adapter
  k8s-app: custom-metrics-stackdriver-adapter
spec:
  replicas: 1
  selector:
    matchLabels:
      run: custom-metrics-stackdriver-adapter
      k8s-app: custom-metrics-stackdriver-adapter
  template:
    metadata:
      labels:
        run: custom-metrics-stackdriver-adapter
        k8s-app: custom-metrics-stackdriver-adapter
        kubernetes.io/cluster-service: "true"
    spec:
      serviceAccountName: custom-metrics-stackdriver-adapter
      containers:
      - image: gcr.io/gke-release/custom-metrics-stackdriver-adapter:v0.15.2-gke.1
        imagePullPolicy: Always
        name: pod-custom-metrics-stackdriver-adapter
        command:
        - /adapter
        - --use-new-resource-model=true
        - --fallback-for-container-metrics=true
        resources:
          limits:
            cpu: 250m
            memory: 200Mi
          requests:
            cpu: 250m
            memory: 200Mi
---
apiVersion: v1
kind: Service
metadata:
  labels:
    run: custom-metrics-stackdriver-adapter
    k8s-app: custom-metrics-stackdriver-adapter
    kubernetes.io/cluster-service: 'true'
    kubernetes.io/name: Adapter
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics
spec:
  ports:
  - port: 443
    protocol: TCP
    targetPort: 443
  selector:
    run: custom-metrics-stackdriver-adapter
    k8s-app: custom-metrics-stackdriver-adapter
  type: ClusterIP
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.custom.metrics.k8s.io
spec:
  insecureSkipTLSVerify: true
  group: custom.metrics.k8s.io
  groupPriorityMinimum: 100
  versionPriority: 100
  service:
    name: custom-metrics-stackdriver-adapter
    namespace: custom-metrics
  version: v1beta1
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta2.custom.metrics.k8s.io
spec:
  insecureSkipTLSVerify: true
  group: custom.metrics.k8s.io
  groupPriorityMinimum: 100
  versionPriority: 200
  service:
    name: custom-metrics-stackdriver-adapter
    namespace: custom-metrics
  version: v1beta2
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.external.metrics.k8s.io
spec:
  insecureSkipTLSVerify: true
  group: external.metrics.k8s.io
  groupPriorityMinimum: 100
  versionPriority: 100
  service:
    name: custom-metrics-stackdriver-adapter
    namespace: custom-metrics
  version: v1beta1
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-metrics-reader
rules:
- apiGroups:
  - "external.metrics.k8s.io"
  resources:
  - "*"
  verbs:
  - list
  - get
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-metrics-reader
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-metrics-reader
subjects:
- kind: ServiceAccount
  name: horizontal-pod-autoscaler
  namespace: kube-system

將範例資訊清單套用至叢集：

kubectl apply -f adapter_new_resource_model.yaml

如要授予轉接程式權限，允許讀取專案中的指標，請執行下列指令：

$ PROJECT_ID=PROJECT_ID
$ PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
$ gcloud projects add-iam-policy-binding projects/PROJECT_ID \
  --role roles/monitoring.viewer \
  --member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/custom-metrics/sa/custom-metrics-stackdriver-adapter

將 PROJECT_ID 替換為您的 Google Cloud 專案 ID。

針對每個 InferencePool，部署類似下列內容的 HPA：
```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: INFERENCE_POOL_NAME
  namespace: INFERENCE_POOL_NAMESPACE
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: INFERENCE_POOL_NAME
  minReplicas: MIN_REPLICAS
  maxReplicas: MAX_REPLICAS
  metrics:
  - type: External
    external:
      metric:
        name: prometheus.googleapis.com|inference_pool_average_kv_cache_utilization|gauge
        selector:
          matchLabels:
            metric.labels.name: INFERENCE_POOL_NAME
            resource.labels.cluster: CLUSTER_NAME
            resource.labels.namespace: INFERENCE_POOL_NAMESPACE
      target:
        type: AverageValue
        averageValue: TARGET_VALUE
```
更改下列內容：
- INFERENCE_POOL_NAME：InferencePool 的名稱。
- INFERENCE_POOL_NAMESPACE：InferencePool 的命名空間。
- CLUSTER_NAME：叢集名稱。
- MIN_REPLICAS：最低供應量 (基準容量)。InferencePool當用量低於 HPA 目標門檻時，HPA 會維持這個備用資源數量。高可用性工作負載必須將此值設為大於 1 的值，確保 Pod 中斷期間的持續可用性。
- MAX_REPLICAS：這個值會限制必須指派給 InferencePool 中裝載工作負載的加速器數量。HPA 不會將備用資源數量增加到超過這個值。在流量高峰期間，請監控副本數量，確保 MAX_REPLICAS 欄位的值提供足夠的空間，讓工作負載能夠擴充，以維持所選工作負載的效能特徵。
- TARGET_VALUE：代表每個模型伺服器所選目標的KV-Cache Utilization值。這個數字介於 0 到 100 之間，且高度取決於模型伺服器、模型、加速器和傳入流量特徵。您可以透過負載測試和繪製輸送量與延遲時間的圖表，實驗性地判斷這個目標值。從圖表中選取所選的輸送量和延遲時間組合，並使用對應的 KV-Cache Utilization 值做為 HPA 目標。您必須調整並密切監控這個值，才能達到所選的價格與成效結果。您可以使用 GKE 推論建議自動判斷這個值。

自訂 GKE 推論閘道設定 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

設定 AI 安全檢查

設定可觀測性

查看 GKE Inference Gateway 資訊主頁

設定模型伺服器觀測資訊主頁

設定 Cloud Monitoring 快訊

修改快訊

設定 GKE Inference Gateway 的記錄功能

設定自動調度資源功能

後續步驟

自訂 GKE 推論閘道設定