NVIDIA Data Center GPU Manager (DCGM)

本文档介绍了如何配置 Google Kubernetes Engine 部署，以便使用 Google Cloud Managed Service for Prometheus 从 NVIDIA 数据中心 GPU 管理器收集指标。本页面介绍如何完成以下任务：

为 DCGM 设置导出器以报告指标。

以下说明仅在您将代管式收集功能与 Managed Service for Prometheus 搭配使用时适用。如果您改为使用自行部署的收集功能，请参阅 DCGM 导出器的源代码库以了解安装信息。

这些说明仅作为示例提供，应该适用于大多数 Kubernetes 环境。如需了解托管式 DCGM 产品，请参阅收集和查看 DCGM 指标。

如果您因为限制性安全或组织政策而无法安装应用或导出器，则我们建议您查阅开源文档以获取支持。

如需了解 NVIDIA Data Center GPU Manager，请参阅 NVIDIA DCGM。

前提条件

如需使用 Managed Service for Prometheus 和代管式收集功能从 DCGM 收集指标，您的部署必须满足以下要求：

您的集群必须运行 Google Kubernetes Engine 1.28.15-gke.2475000 或更高版本。
您必须运行 Managed Service for Prometheus，并启用代管式收集功能。如需了解详情，请参阅代管式收集功能使用入门。

验证您是否有足够的 NVIDIA GPU 配额。

如需枚举 GKE 集群中的 GPU 节点及其在相关集群中的 GPU 类型，请运行以下命令：

kubectl get nodes -l cloud.google.com/gke-gpu -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.metadata.labels.cloud\.google\.com/gke-accelerator}{"\n"}{end}'

请注意，如果已停用自动安装或者 GKE 版本不支持自动安装，您可能需要在节点上安装兼容的 NVIDIA GPU 驱动程序。如需验证 NVIDIA GPU 设备插件是否正在运行，请运行以下命令：
```
kubectl get pods -n kube-system | grep nvidia-gpu-device-plugin
```

安装 DCGM 导出器

我们建议您使用以下配置来安装 DCGM 导出器 DCGM-Exporter：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-dcgm
  namespace: gmp-public
  labels:
    app: nvidia-dcgm
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-dcgm
        app: nvidia-dcgm
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      tolerations:
      - operator: "Exists"
      volumes:
      - name: nvidia-install-dir-host
        hostPath:
          path: /home/kubernetes/bin/nvidia
          type: Directory
      containers:
      - image: "nvcr.io/nvidia/cloud-native/dcgm:3.3.0-1-ubuntu22.04"
        command: ["nv-hostengine", "-n", "-b", "ALL"]
        ports:
        - containerPort: 5555
          hostPort: 5555
        name: nvidia-dcgm
        securityContext:
          privileged: true
        volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-dcgm-exporter
  namespace: gmp-public
  labels:
    app.kubernetes.io/name: nvidia-dcgm-exporter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-dcgm-exporter
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nvidia-dcgm-exporter
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      tolerations:
      - operator: "Exists"
      volumes:
      - name: nvidia-dcgm-exporter-metrics
        configMap:
          name: nvidia-dcgm-exporter-metrics
      - name: nvidia-install-dir-host
        hostPath:
          path: /home/kubernetes/bin/nvidia
          type: Directory
      - name: pod-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources
      containers:
      - name: nvidia-dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
        command: ["/bin/bash", "-c"]
        args:
        - hostname $NODE_NAME; dcgm-exporter --remote-hostengine-info $(NODE_IP) --collectors /etc/dcgm-exporter/counters.csv
        ports:
        - name: metrics
          containerPort: 9400
        securityContext:
          privileged: true
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
          value: "device-name"
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
        - name: NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: DCGM_EXPORTER_KUBERNETES
          value: 'true'
        - name: DCGM_EXPORTER_LISTEN
          value: ':9400'
        volumeMounts:
        - name: nvidia-dcgm-exporter-metrics
          mountPath: "/etc/dcgm-exporter"
          readOnly: true
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        - name: pod-resources
          mountPath: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-dcgm-exporter-metrics
  namespace: gmp-public
data:
  counters.csv: |
    # Utilization (the sample period varies depending on the product),,
    DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).

    # Temperature and power usage,,
    DCGM_FI_DEV_GPU_TEMP, gauge, Current temperature readings for the device in degrees C.
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature for the device.
    DCGM_FI_DEV_POWER_USAGE, gauge, Power usage for the device in Watts.

    # Utilization of IP blocks,,
    DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned
    DCGM_FI_PROF_SM_OCCUPANCY, gauge, The fraction of resident warps on a multiprocessor
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, The ratio of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles)
    DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, The fraction of cycles the FP64 (double precision) pipe was active.
    DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, The fraction of cycles the FP32 (single precision) pipe was active.
    DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, The fraction of cycles the FP16 (half precision) pipe was active.

    # Memory usage,,
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
    DCGM_FI_DEV_FB_TOTAL, gauge, Total Frame Buffer of the GPU in MB.

    # PCIE,,
    DCGM_FI_PROF_PCIE_TX_BYTES, gauge, Total number of bytes transmitted through PCIe TX
    DCGM_FI_PROF_PCIE_RX_BYTES, gauge, Total number of bytes received through PCIe RX

    # NVLink,,
    DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, The number of bytes of active NvLink tx (transmit) data including both header and payload.
    DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, The number of bytes of active NvLink rx (read) data including both header and payload.

如需验证 DCGM 导出器是否在预期的端点上发出指标，请执行以下操作：

使用以下命令设置端口转发：

kubectl -n gmp-public port-forward POD_NAME 9400

使用浏览器或另一个终端会话中的 curl 实用程序访问端点 localhost:9400/metrics。

您可以自定义 ConfigMap 部分，以选择要发出的 GPU 指标。

或者，请考虑使用官方 Helm 图表来安装 DCGM 导出器。

如需从本地文件应用配置更改，请运行以下命令：

kubectl apply -n NAMESPACE_NAME -f FILE_NAME

您还可以使用 Terraform 管理您的配置。

定义 PodMonitoring 资源

对于目标发现，Managed Service for Prometheus Operator 需要与同一命名空间中的 DCGM 导出器对应的 PodMonitoring 资源。

您可以使用以下 PodMonitoring 配置：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: monitoring.googleapis.com/v1
kind: ClusterPodMonitoring
metadata:
  name: nvidia-dcgm-exporter
  labels:
    app.kubernetes.io/name: nvidia-dcgm-exporter
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-dcgm-exporter
  endpoints:
  - port: metrics
    interval: 30s
  targetLabels:
    metadata: []

如需从本地文件应用配置更改，请运行以下命令：

kubectl apply -n NAMESPACE_NAME -f FILE_NAME

您还可以使用 Terraform 管理您的配置。

验证配置

您可以使用 Metrics Explorer 来验证是否正确配置了 DCGM 导出器。Cloud Monitoring 可能需要一两分钟时间来注入您的指标。

要验证指标是否已注入，请执行以下操作：

在 Google Cloud 控制台中，前往 Metrics Explorer 页面：
进入 Metrics Explorer

如果您使用搜索栏查找此页面，请选择子标题为监控的结果。
在查询构建器窗格的工具栏中，选择名为 MQL 或 PromQL 的按钮。
验证已在PromQL 切换开关中选择 PromQL。语言切换开关位于同一工具栏中，用于设置查询的格式。

输入并运行以下查询：

DCGM_FI_DEV_GPU_UTIL{cluster="CLUSTER_NAME", namespace="gmp-public"}

问题排查

如需了解如何排查指标注入问题，请参阅排查注入端问题中的从导出器收集的问题。