This page explains how to debug node issues on Google Distributed Cloud using a suite of preinstalled debugging tools.
Overview
Each Google Distributed Cloud cluster you create is composed of several
nodes. Each node includes a distribution of
CoreOS' toolbox, a shell
script that unpacks and runs a debugging container, debug-toolbox.
debug-toolbox is a container image that includes several useful debugging
tools.
If you encounter issues with a specific node, you can attempt debugging by
connecting to the affected node, run the toolbox script to unpack and run the
debug-toolbox container, and run the tools included in the container.
Tools included in debug-toolbox container
The debug-toolbox container runs a Debian base image that includes the
following packages:
- bash
- curl
- dnsutils
- hping3
- iperf3
- lsof
- netcat
- mtr
- procps
- strace
- tcpdump
- traceroute
- util-linux
Since these tools are included in the container, they don't require an internet
connection. If you want to install additional debugging tools, you use
apt-get, which does require an internet connection.
Using toolbox
- Run the - toolboxcommand:- sudo toolbox - This command starts a - debug-toolboxcontainer.
- While inside the container, run one of the tools. For example, - tcpdump.
- When you're finished, exit the container and close the SSH connection to the node. 
Node Problem Detector
Beginning with Google Distributed Cloud version 1.4, Node Problem
Detector,
which is enabled for all the nodes in a cluster, helps in quick detection of
some common node problems. Node Problem Detector keeps checking for possible
problems and reports the same as events and conditions on the node. If a node
misbehaves, you can check whether Node Problem Detector detected the problem by
running kubectl describe on the node and looking for the corresponding events
and conditions.
Node Problem Detector monitors generate several conditions on the node. If the
reported condition is KubeletUnhealthy or ContainerRuntimeUnhealthy, a
restart of the corresponding systemd service (kubelet or Docker) might help in
making the node healthy again.
Beginning with Google Distributed Cloud version 1.5, kubelet and docker
systemd service auto repair is enabled in Node Problem Detector. If
Node Problem Detector detects a KubeletUnhealthy or
ContainerRuntimeUnhealthy condition on the node, it tries to restart the
kubelet or docker service automatically if the duration since last restart is
above a certain threshold.