Debugging node issues

This page explains how to debug node issues using a suite of preinstalled debugging tools.

Overview

Each GKE on-prem cluster you create is composed of several nodes. Each GKE on-prem node includes a distribution of CoreOS' toolbox, a shell script that unpacks and runs a debugging container, debug-toolbox. debug-toolbox is a container image that includes several useful debugging tools.

If you encounter issues with a specific node, you can attempt debugging by connecting to the affected node, run the toolbox script to unpack and run the debug-toolbox container, and run the tools included in the container.

Tools included in debug-toolbox container

The debug-toolbox container runs a Debian base image that includes the following packages:

  • bash
  • curl
  • dnsutils
  • hping3
  • iperf3
  • lsof
  • netcat
  • mtr
  • procps
  • strace
  • tcpdump
  • traceroute
  • util-linux

Since these tools are included in the container, they don't require an internet connection. If you want to install additional debugging tools, you use apt-get, which does require an internet connection.

Using toolbox

  1. SSH into the cluster node.

  2. Run the toolbox command:

    sudo toolbox

    This command starts a debug-toolbox container.

  3. While inside the container, run one of the tools. For example, tcpdump.

  4. When you're finished, exit the container and close the SSH connection to the node.

Node Problem Detector

Beginning with GKE on-prem version 1.4, Node Problem Detector, which is enabled for all the nodes in a cluster, helps in quick detection of some common node problems. Node Problem Detector keeps checking for possible problems and reports the same as events and conditions on the node. If a node misbehaves, you can check whether Node Problem Detector detected the problem by running kubectl describe on the node and looking for the corresponding events and conditions.

Node Problem Detector monitors generate several conditions on the node. If the reported condition is KubeletUnhealthy or ContainerRuntimeUnhealthy, a restart of the corresponding systemd service (kubelet or Docker) might help in making the node healthy again.