Node Problem Detector

Node Problem Detector 是一个开源库，可监控节点的健康状况并检测常见节点问题，例如硬件、内核或容器运行时问题。在 Google Distributed Cloud 中，它作为每个节点上的 systemd 服务运行。

从 Google Distributed Cloud 1.10.0 版开始，Node Problem Detector 默认处于启用状态。

如果您需要其他帮助，请与 Cloud Customer Care 联系。您还可以参阅获取支持，详细了解支持资源，包括以下内容：

提交支持请求的要求。
可帮助您排查问题的工具，例如环境配置、日志和指标。
受支持的组件。

它可以检测哪些问题？

Node Problem Detector 可以检测以下类型的问题：

容器运行时问题，例如运行时守护程序无响应
硬件问题，例如 CPU、内存或磁盘故障
内核问题，例如内核死锁情况或文件系统损坏

它在节点上运行，并以 NodeCondition 或 Event 的形式向 Kubernetes API 服务器报告问题。NodeCondition 是导致节点无法运行 pod 的问题，而 Event 是暂时性问题，对 pod 的影响有限，但其严重性仍被视为需要报告。

下表介绍了 Node Problem Detector 发现的 NodeConditions 以及这些问题是否可以自动修复：

条件	原因	支持自动修复¹
`KernelDeadlock`	内核进程卡在等待其他内核进程释放所需资源的状态。	否
`ReadonlyFilesystem`	由于磁盘已满等问题，集群无法写入文件系统。	否
`FrequentKubeletRestart`	kubelet 频繁重启，导致节点无法有效运行 pod。	否
`FrequentDockerRestart`	Docker 守护程序在 20 分钟内重启了 5 次以上。	否
`FrequentContainerdRestart`	容器运行时在 20 分钟内重启了 5 次以上。	否
`FrequentUnregisterNetDevice`	节点频繁出现网络设备取消注册的情况。	否
`KubeletUnhealthy`	节点无法正常运行或无法响应控制平面。	否
`ContainerRuntimeUnhealthy`	容器运行时无法正常运行，导致 pod 无法在节点上运行或调度。	否
`CorruptDockerOverlay2`	Docker overlay2 存储驱动程序目录中存在文件系统问题或不一致情况。	否
`OrphanContainers`²	特定于某个容器的 pod 已被删除，但相应容器仍存在于节点中。	否
`FailedCgroupRemoval`²	部分 cgroup 处于冻结状态。	是

¹ 对于 1.32 及更高版本，在特定条件下支持自动修复检测到的问题。

² 支持 1.32 及更高版本。

Node Problem Detector 报告的 Events 种类的一些示例包括：

Warning TaskHung node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: task docker:7 blocked for more than 300 seconds.
Warning KernelOops node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: BUG: unable to handle kernel NULL pointer dereference at 00x0.

它可以修复哪些问题？

从版本 1.32 开始，当 Node Problem Detector 发现某些 NodeConditions 时，它可以自动修复节点上的相应问题。自版本 1.32 起，唯一支持自动修复的 NodeCondition 是 FailedCgroupRemoval。

如何查看检测到的问题

运行以下 kubectl describe 命令以查找 NodeConditions 和 Events：

kubectl describe node NODE_NAME \
    --kubeconfig=KUBECONFIG

替换以下内容：

NODE_NAME：您要检查的节点的名称。
KUBECONFIG：集群 kubeconfig 文件的路径。

如何启用和停用 Node Problem Detector

默认情况下，Node Problem Detector 处于启用状态，但可以在 node-problem-detector-config ConfigMap 资源中将其停用。除非您明确停用 Node Problem Detector，否则它会持续监控节点，以查找表明节点存在问题的特定条件。

如需在给定集群上停用 Node Problem Detector，请按以下步骤操作：

修改 node-problem-detector-config ConfigMap 资源：
```
kubectl edit configmap node-problem-detector-config \
    --kubeconfig=KUBECONFIG \
    --namespace=CLUSTER_NAMESPACE
```
替换以下内容：
- KUBECONFIG：集群 kubeconfig 文件的路径。
- CLUSTER_NAMESPACE：要在其中启用 Node Problem Detector 的集群命名空间。
此命令会自动启动一个文本编辑器，您可以在其中修改 node-problem-detector-config 资源。
在 node-problem-detector-config 资源定义中将 data.enabled 设置为 false。
```
apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: "2025-04-19T21:36:44Z"
  name: node-problem-detector-config
...
data:
  enabled: "false"
```
最初，node-problem-detector-config ConfigMap 没有 data 字段，因此您可能需要添加该字段。
如需更新资源，请保存更改并关闭编辑器。

如需重新启用 Node Problem Detector，请执行上述步骤，但在 node-problem-detector-config 资源定义中将 data.enabled 设置为 true。

如何启用和停用自动修复

从版本 1.32 开始，Node Problem Detector 会检查特定的 NodeConditions，并自动修复节点上的相应问题。默认情况下，系统会为受支持的 NodeConditions 启用自动修复，但您可以在 node-problem-detector-config ConfigMap 资源中停用此功能。

如需在给定集群上停用自动修复行为，请按以下步骤操作：

修改 node-problem-detector-config ConfigMap 资源：
```
kubectl edit configmap node-problem-detector-config \
    --kubeconfig=KUBECONFIG \
    --namespace=CLUSTER_NAMESPACE
```
替换以下内容：
- KUBECONFIG：集群 kubeconfig 文件的路径。
- CLUSTER_NAMESPACE：要在其中启用 Node Problem Detector 的集群命名空间。
此命令会自动启动一个文本编辑器，您可以在其中修改 node-problem-detector-config 资源。
在 node-problem-detector-config 资源定义中将 data.check-only 设置为 true。
```
apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: "2025-04-19T21:36:44Z"
  name: node-problem-detector-config
...
data:
  enabled: "true"
  check-only: "true"
```
最初，node-problem-detector-config ConfigMap 没有 data 字段，因此您可能需要添加该字段。将 check-only 设置为 "true" 可针对所有支持的条件停用自动修复功能。
如需更新资源，请保存更改并关闭编辑器。

如需为所有支持自动修复的 NodeConditions 重新启用自动修复功能，请在 node-problem-detector-config ConfigMap 中将 data.check-only 设置为 "false"。

如何停止和重启 Node Problem Detector

Node Problem Detector 在每个节点上作为 systemd 服务运行。要管理给定节点的 Node Problem Detector，请使用 SSH 访问节点，并运行以下 systemctl 命令。

要停用 Node Problem Detector，请运行以下命令：
```
systemctl stop node-problem-detector
```
要重启 Node Problem Detector，请运行以下命令：
```
systemctl restart node-problem-detector
```
要检查 Node Problem Detector 是否正在特定节点上运行，请运行以下命令：
```
systemctl is-active node-problem-detector
```

不支持的功能

Google Distributed Cloud 不支持 Node Problem Detector 的以下自定义功能：

将 Node Problem Detector 报告导出到其他监控系统（例如 Stackdriver 或 Prometheus）。
自定义要查找的 NodeConditions 或 Events。
运行用户定义的监控脚本。

后续步骤

如果您需要其他帮助，请与 Cloud Customer Care 联系。您还可以参阅获取支持，详细了解支持资源，包括以下内容：

提交支持请求的要求。
可帮助您排查问题的工具，例如环境配置、日志和指标。
受支持的组件。