本教學課程說明如何使用 GKE Inference Gateway,在 Google Kubernetes Engine (GKE) 上部署大型語言模型 (LLM)。本教學課程包含叢集設定、模型部署、GKE Inference Gateway 設定,以及處理 LLM 要求等步驟。
本教學課程適用於機器學習 (ML) 工程師、平台管理員和營運人員,以及資料和 AI 專家。他們想使用 GKE Inference Gateway,在 GKE 上部署及管理 LLM 應用程式。
閱讀本頁內容前,請先熟悉下列概念:
背景
本節說明本教學課程中使用的主要技術。如要進一步瞭解模型服務概念和術語,以及 GKE 生成式 AI 功能如何提升及支援模型服務效能,請參閱「關於 GKE 上的模型推論」。
vLLM
vLLM 是經過高度最佳化的開放原始碼 LLM 服務架構,可提高 GPU 的服務輸送量,並提供下列功能:
- 使用 PagedAttention 實作最佳化轉換器
- 持續批次處理,提升整體服務輸送量
- 張量平行處理和分散式服務,可跨多個 GPU 運作
詳情請參閱 vLLM 說明文件。
GKE 推論閘道
GKE 推論閘道可強化 GKE 的功能,以便提供 LLM 服務。這項服務可透過下列功能,將推論工作負載最佳化:
- 根據負載指標,以推論最佳化負載平衡。
- 支援 LoRA 適應器的密集多工作負載服務。
- 可感知模型的路由,簡化作業。
詳情請參閱「關於 GKE Inference Gateway」。
目標
事前準備
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
前往「IAM」頁面 - 選取專案。
- 按一下「授予存取權」 。
-
在「New principals」(新增主體) 欄位中,輸入您的使用者 ID。 這通常是 Google 帳戶的電子郵件地址。
- 在「Select a role」(選取角色) 清單中,選取角色。
- 如要授予其他角色,請按一下 「新增其他角色」,然後新增每個其他角色。
- 按一下 [Save]。
-
- 如果沒有 Hugging Face 帳戶,請先建立一個。
- 確認專案有足夠的 H100 GPU 配額。詳情請參閱「規劃 GPU 配額」和「分配配額」。
取得模型存取權
如要將 Llama3.1
模型部署至 GKE,請簽署授權同意聲明協議,並產生 Hugging Face 存取權杖。
簽署授權同意聲明協議
您必須簽署同意聲明,才能使用 Llama3.1
模型。請按照以下步驟操作:
- 前往同意聲明頁面,確認您同意使用 Hugging Face 帳戶。
- 接受模型條款。
產生存取權杖
如要透過 Hugging Face 存取模型,您需要 Hugging Face 權杖。
如要產生新權杖 (如果沒有),請按照下列步驟操作:
- 依序點選「Your Profile」(你的個人資料) >「Settings」(設定) >「Access Tokens」(存取權杖)。
- 選取「New Token」。
- 指定所選名稱和至少
Read
的角色。 - 選取「產生權杖」。
- 將產生的權杖複製到剪貼簿。
準備環境
在本教學課程中,您將使用 Cloud Shell 管理託管於Google Cloud的資源。Cloud Shell 已預先安裝本教學課程所需的軟體,包括 kubectl
和
gcloud CLI。
如要使用 Cloud Shell 設定環境,請執行下列步驟:
在 Google Cloud 控制台中,按一下
Google Cloud 控制台中的「啟用 Cloud Shell」,啟動 Cloud Shell 工作階段。系統會在 Google Cloud 控制台的底部窗格啟動工作階段。
設定預設環境變數:
gcloud config set project PROJECT_ID gcloud config set billing/quota_project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export REGION=REGION export CLUSTER_NAME=CLUSTER_NAME export HF_TOKEN=HF_TOKEN
替換下列值:
PROJECT_ID
:您的 Google Cloud 專案 ID。REGION
:支援您要使用的加速器類型,例如 H100 GPU 的us-central1
。CLUSTER_NAME
:叢集名稱。HF_TOKEN
:您先前產生的 Hugging Face 權杖。
建立及設定 Google Cloud 資源
如要建立必要資源,請按照這些操作說明進行。
建立 GKE 叢集和節點集區
在 GKE Autopilot 或 Standard 叢集的 GPU 上提供 LLM。建議您使用 Autopilot 叢集,享受全代管 Kubernetes 體驗。如要為工作負載選擇最合適的 GKE 作業模式,請參閱「選擇 GKE 作業模式」。
Autopilot
在 Cloud Shell 中執行下列指令:
gcloud container clusters create-auto CLUSTER_NAME \
--project=PROJECT_ID \
--region=REGION \
--release-channel=rapid
替換下列值:
PROJECT_ID
:您的 Google Cloud 專案 ID。REGION
:支援您要使用的加速器類型,例如 H100 GPU 的us-central1
。CLUSTER_NAME
:叢集名稱。
GKE 會根據部署的工作負載要求,建立含 CPU 和 GPU 節點的 Autopilot 叢集。
標準
在 Cloud Shell 中執行下列指令,建立 Standard 叢集:
gcloud container clusters create CLUSTER_NAME \ --project=PROJECT_ID \ --region=REGION \ --workload-pool=PROJECT_ID.svc.id.goog \ --release-channel=rapid \ --num-nodes=1 \ --enable-managed-prometheus \ --monitoring=SYSTEM,DCGM
替換下列值:
PROJECT_ID
:您的 Google Cloud 專案 ID。REGION
:支援您要使用的加速器類型,例如 H100 GPU 的us-central1
。CLUSTER_NAME
:叢集名稱。
建立叢集可能需要幾分鐘的時間。
如要建立節點集區,並為執行
Llama-3.1-8B-Instruct
模型設定適當的磁碟大小,請執行下列指令:gcloud container node-pools create gpupool \ --accelerator type=nvidia-h100-80gb,count=2,gpu-driver-version=latest \ --project=PROJECT_ID \ --location=REGION \ --node-locations=REGION-a \ --cluster=CLUSTER_NAME \ --machine-type=a3-highgpu-2g \ --num-nodes=1 \ --disk-type="pd-standard"
GKE 會建立包含 H100 GPU 的單一節點集區。
如要設定授權來擷取指標,請建立
inference-gateway-sa-metrics-reader-secret
密鑰:kubectl apply -f - <<EOF --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: inference-gateway-metrics-reader rules: - nonResourceURLs: - /metrics verbs: - get --- apiVersion: v1 kind: ServiceAccount metadata: name: inference-gateway-sa-metrics-reader namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: inference-gateway-sa-metrics-reader-role-binding namespace: default subjects: - kind: ServiceAccount name: inference-gateway-sa-metrics-reader namespace: default roleRef: kind: ClusterRole name: inference-gateway-metrics-reader apiGroup: rbac.authorization.k8s.io --- apiVersion: v1 kind: Secret metadata: name: inference-gateway-sa-metrics-reader-secret namespace: default annotations: kubernetes.io/service-account.name: inference-gateway-sa-metrics-reader type: kubernetes.io/service-account-token --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: inference-gateway-sa-metrics-reader-secret-read rules: - resources: - secrets apiGroups: [""] verbs: ["get", "list", "watch"] resourceNames: ["inference-gateway-sa-metrics-reader-secret"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: gmp-system:collector:inference-gateway-sa-metrics-reader-secret-read namespace: default roleRef: name: inference-gateway-sa-metrics-reader-secret-read kind: ClusterRole apiGroup: rbac.authorization.k8s.io subjects: - name: collector namespace: gmp-system kind: ServiceAccount EOF
為 Hugging Face 憑證建立 Kubernetes 密鑰
在 Cloud Shell 中執行下列操作:
如要與叢集通訊,請設定
kubectl
:gcloud container clusters get-credentials CLUSTER_NAME \ --location=REGION
替換下列值:
REGION
:支援您要使用的加速器類型,例如 H100 GPU 的us-central1
。CLUSTER_NAME
:叢集名稱。
建立包含 Hugging Face 權杖的 Kubernetes Secret:
kubectl create secret generic HF_SECRET \ --from-literal=token=HF_TOKEN \ --dry-run=client -o yaml | kubectl apply -f -
更改下列內容:
HF_TOKEN
:您先前產生的 Hugging Face 權杖。HF_SECRET
:Kubernetes Secret 的名稱。例如:hf-secret
。
安裝 InferenceModel
和 InferencePool
CRD
在本節中,您將安裝 GKE Inference Gateway 的必要自訂資源定義 (CRD)。
CRD 會擴充 Kubernetes API,這可讓您定義新的資源類型。如要使用 GKE Inference Gateway,請在 GKE 叢集中安裝 InferencePool
和 InferenceModel
CRD,方法是執行下列指令:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml
部署模型伺服器
這個範例會使用 vLLM 模型伺服器部署 Llama3.1
模型。部署作業標示為 app:vllm-llama3-8b-instruct
。這項部署作業也會使用來自 Hugging Face 的兩個 LoRA 適應器,分別命名為 food-review
和 cad-fabricator
。您可以更新這項部署作業,加入自己的模型伺服器和模型容器、服務通訊埠和部署作業名稱。您可以選擇在部署作業中設定 LoRA 配接器,或部署基礎模型。
如要在
nvidia-h100-80gb
加速器類型上部署,請將下列資訊清單儲存為vllm-llama3-8b-instruct.yaml
。這個資訊清單會定義含有模型和模型伺服器的 Kubernetes 部署:apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama3-8b-instruct spec: replicas: 3 selector: matchLabels: app: vllm-llama3-8b-instruct template: metadata: labels: app: vllm-llama3-8b-instruct spec: containers: - name: vllm image: "vllm/vllm-openai:latest" imagePullPolicy: Always command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--tensor-parallel-size" - "1" - "--port" - "8000" - "--enable-lora" - "--max-loras" - "2" - "--max-cpu-loras" - "12" env: # Enabling LoRA support temporarily disables automatic v1, we want to force it on # until 0.8.3 vLLM is released. - name: VLLM_USE_V1 value: "1" - name: PORT value: "8000" - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: token - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING value: "true" ports: - containerPort: 8000 name: http protocol: TCP lifecycle: preStop: # vLLM stops accepting connections when it receives SIGTERM, so we need to sleep # to give upstream gateways a chance to take us out of rotation. The time we wait # is dependent on the time it takes for all upstreams to completely remove us from # rotation. Older or simpler load balancers might take upwards of 30s, but we expect # our deployment to run behind a modern gateway like Envoy which is designed to # probe for readiness aggressively. sleep: # Upstream gateway probers for health should be set on a low period, such as 5s, # and the shorter we can tighten that bound the faster that we release # accelerators during controlled shutdowns. However, we should expect variance, # as load balancers may have internal delays, and we don't want to drop requests # normally, so we're often aiming to set this value to a p99 propagation latency # of readiness -> load balancer taking backend out of rotation, not the average. # # This value is generally stable and must often be experimentally determined on # for a given load balancer and health check period. We set the value here to # the highest value we observe on a supported load balancer, and we recommend # tuning this value down and verifying no requests are dropped. # # If this value is updated, be sure to update terminationGracePeriodSeconds. # seconds: 30 # # IMPORTANT: preStop.sleep is beta as of Kubernetes 1.30 - for older versions # replace with this exec action. #exec: # command: # - /usr/bin/sleep # - 30 livenessProbe: httpGet: path: /health port: http scheme: HTTP # vLLM's health check is simple, so we can more aggressively probe it. Liveness # check endpoints should always be suitable for aggressive probing. periodSeconds: 1 successThreshold: 1 # vLLM has a very simple health implementation, which means that any failure is # likely significant. However, any liveness triggered restart requires the very # large core model to be reloaded, and so we should bias towards ensuring the # server is definitely unhealthy vs immediately restarting. Use 5 attempts as # evidence of a serious problem. failureThreshold: 5 timeoutSeconds: 1 readinessProbe: httpGet: path: /health port: http scheme: HTTP # vLLM's health check is simple, so we can more aggressively probe it. Readiness # check endpoints should always be suitable for aggressive probing, but may be # slightly more expensive than readiness probes. periodSeconds: 1 successThreshold: 1 # vLLM has a very simple health implementation, which means that any failure is # likely significant, failureThreshold: 1 timeoutSeconds: 1 # We set a startup probe so that we don't begin directing traffic or checking # liveness to this instance until the model is loaded. startupProbe: # Failure threshold is when we believe startup will not happen at all, and is set # to the maximum possible time we believe loading a model will take. In our # default configuration we are downloading a model from HuggingFace, which may # take a long time, then the model must load into the accelerator. We choose # 10 minutes as a reasonable maximum startup time before giving up and attempting # to restart the pod. # # IMPORTANT: If the core model takes more than 10 minutes to load, pods will crash # loop forever. Be sure to set this appropriately. failureThreshold: 3600 # Set delay to start low so that if the base model changes to something smaller # or an optimization is deployed, we don't wait unnecessarily. initialDelaySeconds: 2 # As a startup probe, this stops running and so we can more aggressively probe # even a moderately complex startup - this is a very important workload. periodSeconds: 1 httpGet: # vLLM does not start the OpenAI server (and hence make /health available) # until models are loaded. This may not be true for all model servers. path: /health port: http scheme: HTTP resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 volumeMounts: - mountPath: /data name: data - mountPath: /dev/shm name: shm - name: adapters mountPath: "/adapters" initContainers: - name: lora-adapter-syncer tty: true stdin: true image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main restartPolicy: Always imagePullPolicy: Always env: - name: DYNAMIC_LORA_ROLLOUT_CONFIG value: "/config/configmap.yaml" volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths - name: config-volume mountPath: /config restartPolicy: Always # vLLM allows VLLM_PORT to be specified as an environment variable, but a user might # create a 'vllm' service in their namespace. That auto-injects VLLM_PORT in docker # compatible form as `tcp://<IP>:<PORT>` instead of the numeric value vLLM accepts # causing CrashLoopBackoff. Set service environment injection off by default. enableServiceLinks: false # Generally, the termination grace period needs to last longer than the slowest request # we expect to serve plus any extra time spent waiting for load balancers to take the # model server out of rotation. # # An easy starting point is the p99 or max request latency measured for your workload, # although LLM request latencies vary significantly if clients send longer inputs or # trigger longer outputs. Since steady state p99 will be higher than the latency # to drain a server, you may wish to slightly this value either experimentally or # via the calculation below. # # For most models you can derive an upper bound for the maximum drain latency as # follows: # # 1. Identify the maximum context length the model was trained on, or the maximum # allowed length of output tokens configured on vLLM (llama2-7b was trained to # 4k context length, while llama3-8b was trained to 128k). # 2. Output tokens are the more compute intensive to calculate and the accelerator # will have a maximum concurrency (batch size) - the time per output token at # maximum batch with no prompt tokens being processed is the slowest an output # token can be generated (for this model it would be about 100ms TPOT at a max # batch size around 50) # 3. Calculate the worst case request duration if a request starts immediately # before the server stops accepting new connections - generally when it receives # SIGTERM (for this model that is about 4096 / 10 ~ 40s) # 4. If there are any requests generating prompt tokens that will delay when those # output tokens start, and prompt token generation is roughly 6x faster than # compute-bound output token generation, so add 20% to the time from above (40s + # 16s ~ 55s) # # Thus we think it will take us at worst about 55s to complete the longest possible # request the model is likely to receive at maximum concurrency (highest latency) # once requests stop being sent. # # NOTE: This number will be lower than steady state p99 latency since we stop receiving # new requests which require continuous prompt token computation. # NOTE: The max timeout for backend connections from gateway to model servers should # be configured based on steady state p99 latency, not drain p99 latency # # 5. Add the time the pod takes in its preStop hook to allow the load balancers have # stopped sending us new requests (55s + 30s ~ 85s) # # Because the termination grace period controls when the Kubelet forcibly terminates a # stuck or hung process (a possibility due to a GPU crash), there is operational safety # in keeping the value roughly proportional to the time to finish serving. There is also # value in adding a bit of extra time to deal with unexpectedly long workloads. # # 6. Add a 50% safety buffer to this time since the operational impact should be low # (85s * 1.5 ~ 130s) # # One additional source of drain latency is that some workloads may run close to # saturation and have queued requests on each server. Since traffic in excess of the # max sustainable QPS will result in timeouts as the queues grow, we assume that failure # to drain in time due to excess queues at the time of shutdown is an expected failure # mode of server overload. If your workload occasionally experiences high queue depths # due to periodic traffic, consider increasing the safety margin above to account for # time to drain queued requests. terminationGracePeriodSeconds: 130 nodeSelector: cloud.google.com/gke-accelerator: "nvidia-h100-80gb" volumes: - name: data emptyDir: {} - name: shm emptyDir: medium: Memory - name: adapters emptyDir: {} - name: config-volume configMap: name: vllm-llama3-8b-adapters --- apiVersion: v1 kind: ConfigMap metadata: name: vllm-llama3-8b-adapters data: configmap.yaml: | vLLMLoRAConfig: name: vllm-llama3.1-8b-instruct port: 8000 defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct ensureExist: models: - id: food-review source: Kawon/llama3.1-food-finetune_v14_r8 - id: cad-fabricator source: redcathode/fabricator --- kind: HealthCheckPolicy apiVersion: networking.gke.io/v1 metadata: name: health-check-policy namespace: default spec: targetRef: group: "inference.networking.x-k8s.io" kind: InferencePool name: vllm-llama3-8b-instruct default: config: type: HTTP httpHealthCheck: requestPath: /health port: 8000
將資訊清單套用至叢集:
kubectl apply -f vllm-llama3-8b-instruct.yaml
建立 InferencePool
資源
InferencePool
Kubernetes 自訂資源會定義一組 Pod,這些 Pod 具有共同的基礎 LLM 和運算設定。
InferencePool
自訂資源包含下列重要欄位:
selector
:指定哪些 Pod 屬於這個集區。這個選取器中的標籤必須與套用至模型伺服器 Pod 的標籤完全一致。targetPort
:定義 Pod 中模型伺服器使用的通訊埠。
InferencePool
資源可讓 GKE 推論閘道將流量轉送至模型伺服器 Pod。
如要使用 Helm 建立 InferencePool
,請執行下列步驟:
helm install vllm-llama3-8b-instruct \
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
--set provider.name=gke \
--set healthCheckPolicy.create=false \
--version v0.3.0 \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
變更下列欄位,以符合您的部署作業:
inferencePool.modelServers.matchLabels.app
:用於選取模型伺服器 Pod 的標籤鍵。
這個指令會建立 InferencePool
物件,以邏輯方式代表模型伺服器部署作業,並參照 Selector
選取的 Pod 內模型端點服務。
建立具有服務重要性的 InferenceModel
資源
InferenceModel
Kubernetes 自訂資源會定義特定模型,包括 LoRA 微調模型,以及其服務重要性。
InferenceModel
自訂資源包含下列重要欄位:
modelName
:指定基礎模型或 LoRA 適應器的名稱。Criticality
:指定模型的服務重要性。poolRef
:參照模型服務的InferencePool
。
InferenceModel
可讓 GKE Inference Gateway 根據模型名稱和重要性,將流量導向至模型伺服器 Pod。
如要建立 InferenceModel
,請執行下列步驟:
將下列範例資訊清單儲存為
inferencemodel.yaml
:apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: MODEL_NAME criticality: CRITICALITY poolRef: name: INFERENCE_POOL_NAME
更改下列內容:
MODEL_NAME
:基礎模型或 LoRA 轉接程式的名稱。例如:food-review
。CRITICALITY
:所選的放送重要性。選擇Critical
、Standard
或Sheddable
。例如:Standard
。INFERENCE_POOL_NAME
:您在上一個步驟中建立的InferencePool
名稱。例如:vllm-llama3-8b-instruct
。
將範例資訊清單套用至叢集:
kubectl apply -f inferencemodel.yaml
以下範例會建立 InferenceModel
物件,在 vllm-llama3-8b-instruct
InferencePool
上設定 food-review
LoRA 模型,並使用 Standard
服務重要性。InferenceModel
物件也會設定要以 Critical
優先順序層級放送的基礎模型。
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: food-review
spec:
modelName: food-review
criticality: Standard
poolRef:
name: vllm-llama3-8b-instruct
targetModels:
- name: food-review
weight: 100
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: llama3-base-model
spec:
modelName: meta-llama/Llama-3.1-8B-Instruct
criticality: Critical
poolRef:
name: vllm-llama3-8b-instruct
建立閘道
Gateway 資源可做為外部流量進入 Kubernetes 叢集的進入點。定義接受連入連線的接聽程式。
GKE 推論閘道支援 gke-l7-rilb
和
gke-l7-regional-external-managed
閘道類別。詳情請參閱 GKE 說明文件中的 Gateway 類別。
如要建立閘道,請執行下列步驟:
將下列範例資訊清單儲存為
gateway.yaml
:apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: GATEWAY_NAME spec: gatewayClassName: gke-l7-regional-external-managed listeners: - protocol: HTTP # Or HTTPS for production port: 80 # Or 443 for HTTPS name: http
將
GATEWAY_NAME
替換成 Gateway 資源的專屬名稱。例如:inference-gateway
。將資訊清單套用至叢集:
kubectl apply -f gateway.yaml
建立 HTTPRoute
資源
在本節中,您將建立 HTTPRoute
資源,定義 Gateway 如何將傳入的 HTTP 要求轉送至 InferencePool
。
HTTPRoute 資源定義 GKE 閘道如何將傳入的 HTTP 要求轉送至後端服務 (即 InferencePool
)。指定相符規則 (例如標頭或路徑),以及流量應轉送至的後端。
如要建立 HTTPRoute,請執行下列步驟:
將下列範例資訊清單儲存為
httproute.yaml
:apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: HTTPROUTE_NAME spec: parentRefs: - name: GATEWAY_NAME rules: - matches: - path: type: PathPrefix value: PATH_PREFIX backendRefs: - name: INFERENCE_POOL_NAME group: inference.networking.x-k8s.io kind: InferencePool
更改下列內容:
HTTPROUTE_NAME
:HTTPRoute
資源的專屬名稱。例如:my-route
。GATEWAY_NAME
:您建立的Gateway
資源名稱。例如:inference-gateway
。PATH_PREFIX
:用於比對傳入要求的路徑前置字元。例如,/
可比對所有項目。INFERENCE_POOL_NAME
:要將流量導向的InferencePool
資源名稱。例如:vllm-llama3-8b-instruct
。
將資訊清單套用至叢集:
kubectl apply -f httproute.yaml
傳送推論要求
設定 GKE 推論閘道後,您就可以將推論要求傳送至已部署的模型。
如要傳送推論要求,請執行下列步驟:
- 擷取閘道端點。
- 建構格式正確的 JSON 要求。
- 使用
curl
將要求傳送至/v1/completions
端點。
根據輸入提示和指定參數生成文字。
如要取得 Gateway 端點,請執行下列指令:
IP=$(kubectl get gateway/GATEWAY_NAME -o jsonpath='{.status.addresses[0].value}') PORT=PORT_NUMBER # Use 443 for HTTPS, or 80 for HTTP
更改下列內容:
GATEWAY_NAME
:Gateway 資源的名稱。PORT_NUMBER
:您在 Gateway 中設定的通訊埠號碼。
如要使用
curl
將要求傳送至/v1/completions
端點,請執行下列指令:curl -i -X POST https://${IP}:${PORT}/v1/completions \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer $(gcloud auth print-access-token)' \ -d '{ "model": "MODEL_NAME", "prompt": "PROMPT_TEXT", "max_tokens": MAX_TOKENS, "temperature": "TEMPERATURE" }'
更改下列內容:
MODEL_NAME
:要使用的模型或 LoRA 轉接器名稱。PROMPT_TEXT
:模型的輸入提示。MAX_TOKENS
:回覆中生成的權杖數量上限。TEMPERATURE
:控制輸出內容的隨機性。如要取得確定性輸出內容,請使用0
值;如要取得更多創意輸出內容,請使用較大的數字。
請注意下列行為:
- 要求主體:要求主體可以包含其他參數,例如
stop
和top_p
。如需完整的選項清單,請參閱 OpenAI API 規格。 - 錯誤處理:在用戶端程式碼中實作適當的錯誤處理機制,處理回應中可能發生的錯誤。舉例來說,請檢查
curl
回應中的 HTTP 狀態碼。一般來說,如果狀態碼不是 200,就表示發生錯誤。 - 驗證和授權:針對實際工作環境部署作業,請使用驗證和授權機制保護 API 端點。在要求中加入適當的標頭 (例如
Authorization
)。
設定 Inference Gateway 的可觀測性
GKE 推論閘道可讓您觀測推論工作負載的健康狀態、效能和行為。這有助於找出及解決問題、提升資源使用效率,並確保應用程式的可靠性。您可以在 Cloud Monitoring 中透過 Metrics Explorer 查看這些可觀測性指標。
如要設定 GKE Inference Gateway 的觀測功能,請參閱「設定觀測功能」。
刪除已部署的資源
如要避免系統向您的 Google Cloud 帳戶收取您透過本指南建立的資源費用,請執行下列指令:
gcloud container clusters delete CLUSTER_NAME \
--region=REGION
替換下列值:
REGION
:支援您要使用的加速器類型,例如 H100 GPU 的us-central1
。CLUSTER_NAME
:叢集名稱。