Node Problem Detector

Node Problem Detector is an open source library that monitors the health of nodes and detects common node problems, such as hardware, kernel, or container runtime issues. In Google Distributed Cloud, it runs as a systemd service on each node.

Starting with Google Distributed Cloud release 1.10.0, Node Problem Detector is enabled by default.

If you need additional assistance, reach out to Cloud Customer Care. You can also see Getting support for more information about support resources, including the following:

Requirements for opening a support case.
Tools to help you troubleshoot, such as your environment configuration, logs, and metrics.
Supported components.

What problems does it detect?

Node Problem Detector can detect the following kinds of issues:

Container runtime problems, such as unresponsive runtime daemons
Hardware problems, such as CPU, memory, or disk failures
Kernel problems, such as kernel deadlock conditions or corrupted file systems

It runs on a node and reports problems to the Kubernetes API server as either a NodeCondition or as an Event. A NodeCondition is a problem that makes a node unable to run pods whereas an Event is a temporary problem that has a limited effect on pods, but is nonetheless considered important enough to report.

The following table describes the NodeConditions discovered by Node Problem Detector and whether or not they can be repaired automatically:

Condition	Reason	Auto repair supported¹
`KernelDeadlock`	Kernel processes are stuck waiting for other kernel processes to release required resources.	No
`ReadonlyFilesystem`	The cluster is unable to write to the file system due to a problem, such as the disk being full.	No
`FrequentKubeletRestart`	The kubelet is restarting frequently, which prevents the node from running pods effectively.	No
`FrequentDockerRestart`	The Docker daemon has restarted more than 5 times in 20 minutes.	No
`FrequentContainerdRestart`	The container runtime has restarted more than 5 times in 20 minutes.	No
`FrequentUnregisterNetDevice`	The node is experiencing frequent unregistration of network devices.	No
`KubeletUnhealthy`	The node isn't functioning properly or isn't responding to the control plane.	No
`ContainerRuntimeUnhealthy`	The container runtime isn't functioning correctly, preventing pods from running or scheduling on the node.	No
`CorruptDockerOverlay2`	There are file system issues or inconsistencies within the Docker overlay2 storage driver directory.	No
`OrphanContainers`²	A pod specific to a container has been deleted, but the corresponding container still exists in the node.	No
`FailedCgroupRemoval`²	Some cgroups are in a frozen state.	Yes

¹ For versions 1.32 and higher, the ability to automatically repair detected problems is supported for select conditions.

² Supported for versions 1.32 and higher.

Some examples of the kinds of Events reported by Node Problem Detector are:

Warning TaskHung node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: task docker:7 blocked for more than 300 seconds.
Warning KernelOops node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: BUG: unable to handle kernel NULL pointer dereference at 00x0.

What problems does it repair?

Starting in version 1.32, when Node Problem Detector discovers select NodeConditions, it can automatically repair the corresponding problem on the node. As of version 1.32, the only NodeCondition that supports automatic repair is FailedCgroupRemoval.

How to view detected problems

Run the following kubectl describe command to look for NodeConditions and Events:

kubectl describe node NODE_NAME \
    --kubeconfig=KUBECONFIG

Replace the following:

NODE_NAME: the name of the node you're checking.
KUBECONFIG: the path of the cluster kubeconfig file.

How to enable and disable Node Problem Detector

By default, Node Problem Detector is enabled, but it can be disabled in the node-problem-detector-config ConfigMap resource. Unless you explicitly disable it, Node Problem Detector continuously monitors nodes for specific conditions that indicate problems for the node.

To disable Node Problem Detector on a given cluster, use the following steps:

Edit the node-problem-detector-config ConfigMap resource:
```
kubectl edit configmap node-problem-detector-config \
    --kubeconfig=KUBECONFIG \
    --namespace=CLUSTER_NAMESPACE
```
Replace the following:
- KUBECONFIG: the path of the cluster kubeconfig file.
- CLUSTER_NAMESPACE: the namespace of the cluster in which you want to enable Node Problem Detector.
This command automatically starts up a text editor in which you can edit the node-problem-detector-config resource.
Set data.enabled to false in the node-problem-detector-config resource definition.
```
apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: "2025-04-19T21:36:44Z"
  name: node-problem-detector-config
...
data:
  enabled: "false"
```
Initially, the node-problem-detector-config ConfigMap doesn't have a data field, so you may need to add it.
To update the resource, save your changes and close the editor.

To re-enable Node Problem Detector, perform the preceding steps, but set data.enabled to true in the node-problem-detector-config resource definition.

How to enable and disable automatic repair

Starting in version 1.32, Node Problem Detector checks for specific NodeConditions and automatically repairs the corresponding problem on the node. By default, automatic repair is enabled for supported NodeConditions, but it can be disabled in the node-problem-detector-config ConfigMap resource.

To disable the automatic repair behavior on a given cluster, use the following steps:

Edit the node-problem-detector-config ConfigMap resource:
```
kubectl edit configmap node-problem-detector-config \
    --kubeconfig=KUBECONFIG \
    --namespace=CLUSTER_NAMESPACE
```
Replace the following:
- KUBECONFIG: the path of the cluster kubeconfig file.
- CLUSTER_NAMESPACE: the namespace of the cluster in which you want to enable Node Problem Detector.
This command automatically starts up a text editor in which you can edit the node-problem-detector-config resource.
Set data.check-only to true in the node-problem-detector-config resource definition.
```
apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: "2025-04-19T21:36:44Z"
  name: node-problem-detector-config
...
data:
  enabled: "true"
  check-only: "true"
```
Initially, the node-problem-detector-config ConfigMap doesn't have a data field, so you may need to add it. Setting check-only to "true" disables automatic repair for all supported conditions.
To update the resource, save your changes and close the editor.

To re-enable automatic repair for all NodeConditions that support it, set data.check-only to "false" in the node-problem-detector-config ConfigMap.

How to stop and restart Node Problem Detector

Node Problem Detector runs as a systemd service on each node. To manage Node Problem Detector for a given node, use SSH to access the node, and run the following systemctl commands.

To disable Node Problem Detector, run the following command:
```
systemctl stop node-problem-detector
```
To restart Node Problem Detector, run the following command:
```
systemctl restart node-problem-detector
```
To check if Node Problem Detector is running on a particular node, run the following command:
```
systemctl is-active node-problem-detector
```

Unsupported features

Google Distributed Cloud doesn't support the following customizations of Node Problem Detector:

Exporting Node Problem Detector reports to other monitoring systems, such as Stackdriver or Prometheus.
Customizing which NodeConditions or Events to look for.
Running user-defined monitoring scripts.

What's next