使用自定义ComputeClass控制自动扩缩的节点属性

Autopilot Standard

本文档介绍了如何使用自定义ComputeClass，根据工作负载的特定需求控制 Google Kubernetes Engine (GKE) 集群的计算基础设施和自动扩缩行为。

本文档适用于希望以声明方式为节点定义自动扩缩配置文件的平台管理员，以及希望在特定 ComputeClass 上运行工作负载的集群运维人员。

自定义 ComputeClass 简介

自定义 ComputeClass 是一种 Kubernetes 自定义资源，可让您定义 GKE 在预配节点以运行工作负载时应遵循的优先级。您可以使用自定义 ComputeClass 来执行以下操作：

为 GKE 提供一组在预配节点时按顺序遵循的优先级，每个优先级都具有特定的参数，例如 Compute Engine 机器系列或最小资源容量
定义自动扩缩阈值和参数，以便移除未充分利用的节点，并高效地将工作负载整合到现有计算容量中
指示 GKE 自动将次优先节点配置替换为更优先的节点配置，以实现最佳工作负载性能

如需了解所有配置选项及彼此如何进行交互，以及如何与 GKE Autopilot 模式和 GKE Standard 模式交互，请参阅自定义ComputeClass简介。

价格

在 GKE 中，使用 ComputeClass 自定义资源无需额外付费。需考虑以下价格注意事项：

GKE Autopilot 模式：使用基于节点的结算模式付费。如需了解详情，请参阅 Autopilot 模式价格。
GKE Standard 模式：请参阅 Standard 模式价格。

限制

ComputeClass 的名称不能以 gke 或 autopilot 开头。

准备工作

在开始之前，请确保您已执行以下任务：

启用 Google Kubernetes Engine API。

启用 Google Kubernetes Engine API

如果您要使用 Google Cloud CLI 执行此任务，请安装并初始化 gcloud CLI。如果您之前安装了 gcloud CLI，请运行 gcloud components update 命令以获取最新版本。较早版本的 gcloud CLI 可能不支持运行本文档中的命令。
注意：对于现有 gcloud CLI 安装，请务必设置 compute/region 属性。如果您主要使用可用区级集群，请改为设置 compute/zone。通过设置默认位置，您可以避免 gcloud CLI 中出现如下错误：One of [--zone, --region] must be supplied: Please specify location。如果集群的位置与您设置的默认位置不同，您可能需要在某些命令中指定位置。

确保您已有运行 1.30.3-gke.1451000 版或更高版本的 GKE 集群。如需了解详情，请参阅创建 Autopilot 集群。
如果您使用的是 Standard 模式集群，请确保满足以下要求之一：
- 在集群中的至少一个节点池上启用自动扩缩。
- 如果您的 Standard 集群运行的版本低于 1.33.3-gke.1136000 且未在快速发布渠道中注册，请启用集群级节点自动预配。

ComputeClass 的示例场景

本文档展示了一个示例场景，您需要为其定义自定义 ComputeClass。在实践中，您应考虑特定工作负载和组织的要求，并定义满足这些要求的 ComputeClass。如需查看ComputeClass的所有选项的完整说明并了解特殊注意事项，请参阅自定义ComputeClass简介。

请考虑以下示例场景：

您的目标是优化工作负载的运行费用
您的工作负载具有容错性，不需要优雅关停或延长运行时
您的工作负载至少需要 64 个 vCPU 才能以最佳状态运行
您只能使用 N4 Compute Engine 机器系列

根据示例场景，您决定需要一个ComputeClass来执行以下操作：

优先使用至少具有 64 个 vCPU 的 N4 Spot 节点
允许 GKE 回退到任何 N4 Spot 节点，无论计算容量如何
如果没有可用的 N4 Spot 节点，则让 GKE 使用按需 N4 节点
告知 GKE 在 Spot 节点再次可用时将工作负载移至这些节点

在 Autopilot 模式下配置 ComputeClass

在 GKE Autopilot 中，您可以定义 ComputeClass，将其部署到集群，然后在工作负载中请求该 ComputeClass。GKE 会为您执行任何节点配置步骤，例如应用标签和污点。

将以下清单保存为 compute-class.yaml：

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: cost-optimized
spec:
  priorities:
  - machineFamily: n4
    spot: true
    minCores: 64
  - machineFamily: n4
    spot: true
  - machineFamily: n4
    spot: false
  activeMigration:
    optimizeRulePriority: true
  nodePoolAutoCreation:
    enabled: true

在 Standard 模式下配置ComputeClass

在 GKE Standard 模式集群中，您可能需要执行手动配置，以确保 ComputeClass Pod 按预期调度。手动配置取决于是否自动创建节点池，如下所示：

自动创建的节点池：无需手动配置。GKE 会自动为您执行 ComputeClass 配置步骤。如需了解详情，请参阅节点池自动创建和 ComputeClass。
手动创建的节点池：需要手动配置。您必须向手动创建的节点池添加节点标签和节点污点，以将节点与特定 ComputeClass 相关联。如需了解详情，请参阅配置手动创建的节点池以使用ComputeClass。

如需让 GKE 自动为您的 ComputeClass 创建节点池，请按以下步骤操作：

对于运行 1.33.3-gke.1136000 之前版本且未在快速发布渠道中注册的 Standard 模式集群，请启用集群级节点自动预配。

将以下示例清单保存为 compute-class.yaml：

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: cost-optimized
spec:
  priorities:
  - machineFamily: n4
    spot: true
    minCores: 64
  - machineFamily: n4
    spot: true
  - machineFamily: n4
    spot: false
  activeMigration:
    optimizeRulePriority: true
  nodePoolAutoCreation:
    enabled: true

当您部署请求此示例ComputeClass的 Pod 并且需要创建新节点时，GKE 会优先根据 priorities 字段中的排序项创建节点。如果需要，GKE 会创建满足 ComputeClass 硬件要求的新节点池。

您还可以在优先级中指定精确的自定义机器类型。使用自定义机器类型需要 GKE 1.33.2-gke.1111000 版或更高版本。以下示例配置了一个 ComputeClass，该 ComputeClass 优先使用 n4-custom-8-20480 自定义机器类型的 Spot 虚拟机，如果 Spot 容量不可用，则回退到相同类型的按需虚拟机：

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: custom-machine-type
spec:
  priorities:
  - machineType: n4-custom-8-20480
    spot: true
  - machineType: n4-custom-8-20480
    spot: false
  nodePoolAutoCreation:
    enabled: true

将 ComputeClass 与手动创建的节点池搭配使用

本部分介绍如何在仅使用手动创建的节点池的集群中定义 ComputeClass。

将以下清单保存为 compute-class.yaml：

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: cost-optimized
spec:
  priorities:
  - machineFamily: n4
    spot: true
    minCores: 64
  - machineFamily: n4
    spot: true
  - machineFamily: n4
    spot: false
  activeMigration:
    optimizeRulePriority: true

创建一个使用 Spot 虚拟机的新自动扩缩节点池，并将其与 ComputeClass 相关联：

gcloud container node-pools create cost-optimized-pool \
    --location=LOCATION \
    --cluster=CLUSTER_NAME \
    --machine-type=n4-standard-64 \
    --spot \
    --enable-autoscaling \
    --max-nodes=9 \
    --node-labels="cloud.google.com/compute-class=cost-optimized" \
    --node-taints="cloud.google.com/compute-class=cost-optimized:NoSchedule"

替换以下内容：

LOCATION：您的集群的位置。
CLUSTER_NAME：现有集群的名称。

创建一个具有按需虚拟机的新自动扩缩的节点池，并将其与ComputeClass关联：

gcloud container node-pools create on-demand-pool \
    --location=LOCATION \
    --cluster=CLUSTER_NAME \
    --machine-type=n4-standard-64 \
    --enable-autoscaling \
    --max-nodes=9 \
    --num-nodes=0 \
    --node-labels="cloud.google.com/compute-class=cost-optimized" \
    --node-taints="cloud.google.com/compute-class=cost-optimized:NoSchedule"

当您部署请求此 ComputeClass 的 Pod 并且需要创建新节点时，GKE 会优先在 cost-optimized-pool 节点池中创建节点。如果无法创建新节点，GKE 会在 on-demand-pool 节点池中创建节点。

如需详细了解手动创建的节点池如何与自定义 ComputeClass 交互，请参阅配置手动创建的节点池以使用 ComputeClass。

为节点整合自定义自动扩缩阈值

默认情况下，GKE 会移除利用率过低的节点，并将您的工作负载重新调度到其他可用节点上。您可以通过在ComputeClass定义中使用 autoscalingPolicy 字段，进一步自定义节点成为移除候选节点所需达到的阈值和时间，如以下示例所示：

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: cost-optimized
spec:
  priorities:
  - machineFamily: n4
    spot: true
    minCores: 64
  - machineFamily: n4
    spot: true
  - machineFamily: n4
    spot: false
  activeMigration:
    optimizeRulePriority: true
  autoscalingPolicy:
    consolidationDelayMinutes : 5
    consolidationThreshold    : 70

在此示例中，如果节点的 70% 可用 CPU 和内存容量利用率过低超过 5 分钟，则该节点会成为移除的候选节点。如需查看可用参数的列表，请参阅为节点整合设置自动扩缩参数。

在集群中部署 ComputeClass

定义 ComputeClass 后，将其部署到集群：

kubectl apply -f compute-class.yaml

此 ComputeClass 已准备就绪，可在集群中使用。您可以在 Pod 规范中请求 ComputeClass，也可以选择将其设置为特定命名空间中的默认 ComputeClass。

在工作负载中请求 ComputeClass

如需在工作负载中请求 ComputeClass，请在清单中为该 ComputeClass 添加节点选择器，如以下步骤所示：

将以下清单保存为 cc-workload.yaml：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-workload
spec:
  replicas: 2
  selector:
    matchLabels:
      app: custom-workload
  template:
    metadata:
      labels:
        app: custom-workload
    spec:
      nodeSelector:
        cloud.google.com/compute-class: cost-optimized
      containers:
      - name: test
        image: gcr.io/google_containers/pause
        resources:
          requests:
            cpu: 1.5
            memory: "4Gi"

部署工作负载：
```
kubectl apply -f cc-workload.yaml
```

部署此工作负载时，GKE 会自动向 Pod 添加与所请求 ComputeClass 的节点污点相对应的容忍设置。此容忍设置可确保只有请求 ComputeClass 的 Pod 才能在 ComputeClass 节点上运行。

更新已部署的 ComputeClass

如需更新已部署的 ComputeClass，请修改相应 ComputeClass 的 YAML 清单。然后，运行以下命令来部署修改后的清单：

kubectl apply -f PATH_TO_FILE

将 PATH_TO_FILE 替换为修改后的清单的路径。确保 name 字段中的值保持不变。

部署更新后的 ComputeClass 时，GKE 会使用更新后的配置来创建新节点。GKE 不会使用更新后的配置修改任何现有节点。

随着时间的推移，如果 ComputeClass 使用主动迁移，并且现有 Pod 符合迁移条件，GKE 可能会将现有 Pod 迁移到使用更新后配置的节点。