Node Problem Detector is an open source library that monitors the health of nodes and detects common node problems, such as hardware, kernel or container runtime issues. In Google Distributed Cloud, it runs as a systemd service on each node.
While in Preview, Node Problem Detector is disabled by default.
What problems does it detect?
Node Problem Detector can detect the following kinds of issues:
- Container runtime problems, such as unresponsive runtime daemons
- Hardware problems, such as CPU, memory, or disk failures
- Kernel problems, such as kernel deadlock conditions or corrupted file systems
It runs on a node and reports problems to the Kubernetes API
server as either a NodeCondition 
or as an Event.
(A NodeCondition is a problem that makes a node unable to run pods whereas an
Event is a temporary problem that has a limited effect on pods, but is
nonetheless considered important enough to report).
Some of the NodeConditions discovered by Node Problem Detector are:
- KernelDeadlock
- ReadonlyFilesystem
- FrequentKubeletRestart
- FrequentDockerRestart
- FrequentContainerdRestart
- FrequentUnregisterNetDevice
- KubeletUnhealthy
- ContainerRuntimeUnhealthy
- CorruptDockerOverlay2
Some examples of the kinds of Events reported by Node Problem Detector are:
- Warning TaskHung node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: task docker:7 blocked for more than 300 seconds.
- Warning KernelOops node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: BUG: unable to handle kernel NULL pointer dereference at 00x0.
How to view detected problems
Run the following kubectl describe command on a node to look for
NodeConditions and Events:
kubectl --kubeconfig=KUBECONFIG_PATH describe node NODE_NAMEIn the command, replace the following entries with information specific to your environment:
- KUBECONFIG_PATH: the path to the target cluster kubeconfig file. (The path to the kubeconfig file is usually- bmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig. However, if you specified your workspace with the WORKSPACE_DIR flag, the path is- WORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig).
- NODE_NAME: the name of the node about which you want health information.
How to enable/disable Node Problem Detector
Here are the steps to take to enable Node Problem Detector on a given cluster:
- Edit the cluster's - ConfigMapfile which is called- node-problem-detector-config- kubectl --kubeconfig=KUBECONFIG_PATH edit configmap \ node-problem-detector-config --namespace=CLUSTER_NAMESPACE- This command automatically starts up a text editor (such as vim or nano) in which you can edit the - node-problem-detector-configfile. In the command, replace the following entries with information specific to your cluster environment:- KUBECONFIG_PATH: the path to the admin cluster kubeconfig
file. (The path to the kubeconfig file is usually
bmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig. However, if you specified your workspace with the WORKSPACE_DIR flag, the path isWORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig).
- CLUSTER_NAMESPACE: the namespace of the cluster in which you want to enable Node Problem Detector.
 
- KUBECONFIG_PATH: the path to the admin cluster kubeconfig
file. (The path to the kubeconfig file is usually
- Initially, the - node-problem-detector-config- ConfigMapdoesn't have a- datafield. Add the- datafield to the configuration map with the following key-value pair:- data: enabled: "true"
To disable Node Problem Detector in a cluster namespace, perform the preceding
steps 1 and 2, but in step 2, change the value of the enabled key to
'false'.
How to stop/start Node Problem Detector
Node Problem Detector runs as a systemd service on each node. To manage Node Problem Detector for a given node, use SSH to access the node, and run the following systemctl commands.
To disable Node Problem Detector, run the following command:
systemctl stop node-problem-detector
To restart Node Problem Detector, run the following command:
systemctl restart node-problem-detector
To check if Node Problem Detector is running on a particular node, run the following command:
systemctl is-active node-problem-detector