本頁面由 Cloud Translation API 翻譯而成。

在 TPU 上為 LLM 工作負載設定自動調度資源功能

自動駕駛標準

本頁說明如何使用 GKE Horizontal Pod Autoscaler (HPA) 部署 Gemma 大型語言模型 (LLM)，並透過單一主機 JetStream 設定自動調度基礎架構。

如要進一步瞭解如何選取自動調度資源的指標，請參閱在 GKE 上使用 TPU 自動調度 LLM 工作負載的最佳做法。

事前準備

開始之前，請確認您已完成下列工作：

啟用 Google Kubernetes Engine API。

啟用 Google Kubernetes Engine API

如要使用 Google Cloud CLI 執行這項工作，請安裝並初始化 gcloud CLI。如果您先前已安裝 gcloud CLI，請執行 gcloud components update 指令，取得最新版本。較舊的 gcloud CLI 版本可能不支援執行本文件中的指令。
注意：如果是現有的 gcloud CLI 安裝項目，請務必設定 compute/region 屬性。如果您主要使用區域叢集，請改為設定 compute/zone。設定預設位置後，即可避免 gcloud CLI 發生下列錯誤：One of [--zone, --region] must be supplied: Please specify location。如果叢集位置與您設定的預設位置不同，您可能需要在特定指令中指定位置。

熟悉並完成「透過 JetStream 在 GKE 上使用 TPU 提供 Gemma 服務」中的工作流程。請確認 JetStream 部署資訊清單中已設定 PROMETHEUS_PORT 引數。

使用指標自動調度資源

您可以透過 JetStream 推論伺服器或 TPU 效能指標發出的工作負載專屬效能指標，直接自動調整 Pod 的資源配置。

如要使用指標設定自動調度資源，請按照下列步驟操作：

將 JetStream 伺服器的指標匯出至 Cloud Monitoring。您可以使用 Google Cloud Managed Service for Prometheus，簡化 Prometheus 收集器的部署和設定作業。GKE 叢集會預設啟用 Google Cloud Managed Service for Prometheus，您也可以手動啟用。

以下資訊清單範例說明如何設定 PodMonitoring 資源定義，引導 Google Cloud Managed Service for Prometheus 以每 15 秒的間隔，從 Pod 抓取指標：

如要擷取伺服器指標，請使用下列資訊清單。伺服器指標支援的擷取間隔最短為 5 秒。

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: jetstream-podmonitoring
spec:
  selector:
    matchLabels:
      app: maxengine-server
  endpoints:
  - interval: 15s
    path: "/"
    port: PROMETHEUS_PORT
  targetLabels:
    metadata:
    - pod
    - container
    - node

如要擷取 TPU 指標，請使用下列資訊清單。系統指標支援的擷取間隔最短為 15 秒。

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: tpu-metrics-exporter
  namespace: kube-system
  labels:
    k8s-app: tpu-device-plugin
spec:
  endpoints:
    - port: 2112
      interval: 15s
  selector:
    matchLabels:
      k8s-app: tpu-device-plugin

安裝指標介面卡。這個轉接程式會將您匯出至 Monitoring 的伺服器指標，顯示給 HPA 控制器。詳情請參閱 Google Cloud Managed Service for Prometheus 說明文件中的「Pod 水平自動調度」。

如要讓 JetStream 根據個別指標調整規模，請使用 Custom Metrics Stackdriver Adapter。
如要讓 JetStream 根據多個不同指標組成的運算式值進行調度，請使用第三方 Prometheus 介面卡。

自訂指標 Stackdriver 轉接器

自訂指標 Stackdriver 轉接器支援從 Google Cloud Managed Service for Prometheus 查詢指標，轉接器 v0.13.1 版起即支援這項功能。

如要安裝自訂指標 Stackdriver 轉接器，請執行下列操作：

在叢集中設定代管集合。

在叢集中安裝自訂指標 Stackdriver 轉接器。

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

如果 Kubernetes 叢集已啟用 Workload Identity Federation for GKE，且您使用 Workload Identity Federation for GKE，則必須將 Monitoring Viewer 角色授予執行介面的服務帳戶。將 PROJECT_ID 替換為您的專案 ID。

export PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format 'get(projectNumber)')
gcloud projects add-iam-policy-binding projects/PROJECT_ID \
  --role roles/monitoring.viewer \
  --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/custom-metrics/sa/custom-metrics-stackdriver-adapter

Prometheus 介面卡

使用 prometheus-adapter 透過 Google Cloud Managed Service for Prometheus 進行擴縮時，請注意下列事項：

透過 Prometheus 前端 UI Proxy 傳送查詢，就像使用 Prometheus API 或 UI 查詢 Google Cloud Managed Service for Prometheus 時一樣。這個前端會在後續步驟中安裝。
根據預設，prometheus-adapter Deployment 的 prometheus-url 引數會設為 --prometheus-url=http://frontend.default.svc:9090/，其中 default 是您部署前端的命名空間。如果您在其他命名空間中部署前端，請據此設定這個引數。
在規則設定的 .seriesQuery 欄位中，您無法對指標名稱使用規則運算式比對工具。請改為完整指定指標名稱。

與上游 Prometheus 相比，資料可能需要稍長的時間才能在 Google Cloud Managed Service for Prometheus 中提供，因此設定過於急切的自動調整資源配置邏輯可能會導致非預期的行為。雖然無法保證資料的即時性，但資料傳送至 Google Cloud Managed Service for Prometheus 後，通常會在 3 到 7 秒內可供查詢 (不含任何網路延遲)。

prometheus-adapter 發出的所有查詢都具有全球效力。也就是說，如果您在兩個命名空間中都有應用程式，且發出的指標名稱相同，則使用該指標的 HPA 設定會根據這兩個應用程式的資料進行調整。為避免使用錯誤資料進行縮放，請務必在 PromQL 中使用 namespace 或 cluster 篩選器。

如要使用 prometheus-adapter 和受管理集合設定範例 HPA 設定，請按照下列步驟操作：

在叢集中設定代管集合。

在叢集中部署 Prometheus 前端 UI Proxy。建立名為 prometheus-frontend.yaml 的下列資訊清單：

  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: frontend
  spec:
    replicas: 2
    selector:
      matchLabels:
        app: frontend
    template:
      metadata:
        labels:
          app: frontend
      spec:
        automountServiceAccountToken: true
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/arch
                  operator: In
                  values:
                  - arm64
                  - amd64
                - key: kubernetes.io/os
                  operator: In
                  values:
                  - linux
        containers:
        - name: frontend
          image: gke.gcr.io/prometheus-engine/frontend:v0.8.0-gke.4
          args:
          - "--web.listen-address=:9090"
          - "--query.project-id=PROJECT_ID"
          ports:
          - name: web
            containerPort: 9090
          readinessProbe:
            httpGet:
              path: /-/ready
              port: web
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - all
            privileged: false
            runAsGroup: 1000
            runAsNonRoot: true
            runAsUser: 1000
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: web
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: prometheus
  spec:
    clusterIP: None
    selector:
      app: frontend
    ports:
    - name: web
      port: 9090

然後套用資訊清單：

kubectl apply -f prometheus-frontend.yaml

安裝 prometheus-community/prometheus-adapter Helm 圖表，確保 prometheus-adapter 已安裝在叢集中。建立下列 values.yaml 檔案：

rules:
  default: false
  external:
  - seriesQuery: 'jetstream_prefill_backlog_size'
    resources:
      template: <<.Resource>>
    name:
      matches: ""
      as: "jetstream_prefill_backlog_size"
    metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>,cluster="CLUSTER_NAME"})
  - seriesQuery: 'jetstream_slots_used_percentage'
    resources:
      template: <<.Resource>>
    name:
      matches: ""
      as: "jetstream_slots_used_percentage"
    metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>,cluster="CLUSTER_NAME"})
  - seriesQuery: 'memory_used'
    resources:
      template: <<.Resource>>
    name:
      matches: ""
      as: "memory_used_percentage"
    metricsQuery: avg(memory_used{cluster="CLUSTER_NAME",exported_namespace="default",container="jetstream-http"}) / avg(memory_total{cluster="CLUSTER_NAME",exported_namespace="default",container="jetstream-http"})

然後，使用這個檔案做為部署 Helm 圖表的值檔案：

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update && helm install example-release prometheus-community/prometheus-adapter -f values.yaml

如果您使用 Workload Identity Federation for GKE，也需要執行下列指令設定及授權服務帳戶：

首先，請建立叢內和 Google Cloud 服務帳戶：

gcloud iam service-accounts create prom-frontend-sa && kubectl create sa prom-frontend-sa

接著，繫結這兩個服務帳戶，並務必將 PROJECT_ID 替換為您的專案 ID：

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[default/prom-frontend-sa]" \
  jetstream-iam-sa@PROJECT_ID.iam.gserviceaccount.com \
&&
kubectl annotate serviceaccount \
  --namespace default \
  prom-frontend-sa \
  iam.gke.io/gcp-service-account=jetstream-iam-sa@PROJECT_ID.iam.gserviceaccount.com

接著，將 monitoring.viewer 角色指派給 Google Cloud 服務帳戶：

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member=serviceAccount:jetstream-iam-sa@PROJECT_ID.iam.gserviceaccount.com \
  --role=roles/monitoring.viewer

最後，將前端部署作業的服務帳戶設為新的叢內服務帳戶：
```
kubectl set serviceaccount deployment frontend prom-frontend-sa
```

設定以指標為準的 HPA 資源。部署以偏好伺服器指標為依據的 HPA 資源。詳情請參閱 Google Cloud Managed Service for Prometheus 說明文件中的Pod 水平自動調度資源。具體的 HPA 設定取決於指標類型 (伺服器或 TPU) 和已安裝的指標介面卡。

所有 HPA 設定都需要幾個值，且必須設定這些值才能建立 HPA 資源：
- MIN_REPLICAS：允許的 JetStream Pod 副本數量下限。如果沒有在「Deploy JetStream」(部署 JetStream) 步驟中修改 JetStream 部署資訊清單，建議將此值設為 1。
- MAX_REPLICAS：允許的 JetStream Pod 副本數量上限。範例 JetStream 部署作業每個副本需要 8 個晶片，而節點集區包含 16 個晶片。如要盡量縮短擴充延遲時間，請將此值設為 2。值越大，叢集自動配置器在節點集區中建立新節點的頻率就越高，因此擴充延遲時間也會增加。
- TARGET：所有 JetStream 執行個體這項指標的目標平均值。如要進一步瞭解如何根據這個值判斷副本數量，請參閱 Kubernetes 自動調度資源說明文件。
自訂指標 Stackdriver 轉接器
自訂指標 Stackdriver 轉接器支援使用 Google Cloud Managed Service for Prometheus 中所有 Pod 的個別指標查詢平均值，調整工作負載的資源配置。使用自訂指標 Stackdriver 轉接器時，建議使用 jetstream_prefill_backlog_size 和 jetstream_slots_used_percentage 伺服器指標，以及 memory_used TPU 指標進行調整。

如要建立 HPA 資訊清單，以便使用伺服器指標進行調整，請建立下列 hpa.yaml 檔案：
```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: jetstream-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: maxengine-server
  minReplicas: MIN_REPLICAS
  maxReplicas: MAX_REPLICAS
  metrics:
  - type: Pods
    pods:
      metric:
        name: prometheus.googleapis.com|jetstream_METRIC|gauge
      target:
        type: AverageValue
        averageValue: TARGET
```
使用自訂指標 Stackdriver 轉接器搭配 TPU 指標時，建議只使用 kubernetes.io|node|accelerator|memory_used 指標進行調整。如要建立 HPA 資訊清單，以便使用這項指標進行資源調度，請建立下列 hpa.yaml 檔案：
```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: jetstream-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: maxengine-server
  minReplicas: MIN_REPLICAS
  maxReplicas: MAX_REPLICAS
  metrics:
  - type: External
    external:
      metric:
        name: prometheus.googleapis.com|memory_used|gauge
        selector:
          matchLabels:
            metric.labels.container: jetstream-http
            metric.labels.exported_namespace: default
      target:
        type: AverageValue
        averageValue: TARGET
```
Prometheus 介面卡
Prometheus 介面卡支援使用 Google Cloud Managed Service for Prometheus 的 PromQL 查詢值，調整工作負載規模。您先前定義了 jetstream_prefill_backlog_size 和 jetstream_slots_used_percentage 伺服器指標，代表所有 Pod 的平均值。

如要建立 HPA 資訊清單，以便使用伺服器指標進行調整，請建立下列 hpa.yaml 檔案：
```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: jetstream-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: maxengine-server
  minReplicas: MIN_REPLICAS
  maxReplicas: MAX_REPLICAS
  metrics:
  - type: External
    external:
      metric:
        name: jetstream_METRIC
      target:
        type: AverageValue
        averageValue: TARGET
```
如要建立 HPA 資訊清單，以便使用 TPU 指標進行資源調度，建議只使用 prometheus-adapter helm 值檔案中定義的 memory_used_percentage。memory_used_percentage 是下列 PromQL 查詢的名稱，可反映所有加速器目前的平均記憶體用量：
```
avg(kubernetes_io:node_accelerator_memory_used{cluster_name="CLUSTER_NAME"}) / avg(kubernetes_io:node_accelerator_memory_total{cluster_name="CLUSTER_NAME"})
```
如要建立 HPA 資訊清單，以便使用 memory_used_percentage 調整資源配置，請建立下列 hpa.yaml 檔案：
```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: jetstream-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: maxengine-server
  minReplicas: MIN_REPLICAS
  maxReplicas: MAX_REPLICAS
  metrics:
  - type: External
    external:
      metric:
        name: memory_used_percentage
      target:
        type: AverageValue
        averageValue: TARGET
```

使用多項指標進行調整

您也可以根據多個指標設定資源調度。如要瞭解如何使用多項指標判斷副本數量，請參閱 Kubernetes 自動調度資源說明文件。如要建構這類 HPA 資訊清單，請將每個 HPA 資源的 spec.metrics 欄位中的所有項目收集到單一 HPA 資源中。以下程式碼片段顯示如何將 HPA 資源組合在一起：

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: jetstream-hpa-multiple-metrics
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: maxengine-server
  minReplicas: MIN_REPLICAS
  maxReplicas: MAX_REPLICAS
  metrics:
  - type: Pods
    pods:
      metric:
        name: jetstream_METRIC
      target:
        type: AverageValue
      averageValue: JETSTREAM_METRIC_TARGET
  - type: External
    external:
      metric:
        name: memory_used_percentage
      target:
        type: AverageValue
      averageValue: EXTERNAL_METRIC_TARGET

監控及測試自動調度資源

您可以根據 HPA 設定，觀察 JetStream 工作負載的資源調度情形。

如要即時觀察副本數量，請執行下列指令：

kubectl get hpa --watch

這個指令的輸出內容應如下所示：

NAME            REFERENCE                     TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
jetstream-hpa   Deployment/maxengine-server   0/10 (avg)   1         2         1          1m

如要測試 HPA 的擴充能力，請使用下列指令，向模型端點傳送 100 個要求。這會耗盡可用的解碼運算單元，並導致預填佇列中積壓大量要求，進而觸發 HPA 增加模型部署大小。

seq 100 | xargs -P 100 -n 1 curl --request POST --header "Content-type: application/json" -s localhost:8000/generate --data '{ "prompt": "Can you provide a comprehensive and detailed overview of the history and development of artificial intelligence.", "max_tokens": 200 }'

後續步驟

瞭解如何根據 Cloud Monitoring 的指標，最佳化 Pod 自動調度資源。
如要進一步瞭解水平 Pod 自動調度資源，請參閱開放原始碼 Kubernetes 說明文件。