This document explains how to troubleshoot issues with your GPU clusters using the Cluster Health Scanner (CHS) tool.
The CHS tool checks the health of your GPU clusters, running tests to verify that the clusters are ready to run your workloads. You can use CHS to perform proactive health checks, or as a diagnostic tool when you encounter problems with a workload. In addition to checking the configuration of your cluster, you can perform the following tests:
- NCCL check: validates the network communication between GPUs using the NVIDIA Collective Communications Library (NCCL).
- GPU check: utilizes NVIDIA's Data Center GPU Manager (DCGM) tool to check the health of individual GPUs.
- Neper check: uses the Neper tool to assess network performance within the cluster.
- Straggler detection: runs a network traffic pattern between nodes that closely resemble patterns seen during LLM training workload pipeline parallelism.
- Tinymax check: uses Maxtext, an open source LLM framework, to assess ML training within the cluster.
You can only run CHS checks and tests on nodes that aren't running any jobs or workloads. If you try to run a check or a test on a busy node, the check or test fails.
The CHS tool is available for GPU clusters that are orchestrated by Google Kubernetes Engine (GKE) or Slurm, regardless of what provisioning model that you used to create the clusters. However, CHS is only available for the following machine types:
- A4
- A3 Ultra
- A3 Mega
- A3 High
The following sections describe how to install CHS, and then how to use it to perform health checks and check your configuration.
Install CHS
Use the following procedure to install CHS:
Go to the Compute Engine > VM instances page.
Locate the login node. It might have a name with the pattern
DEPLOYMENT_NAME
+login-001
.From the Connect column of the login node, click SSH.
Use the following command to clone the repository and move to the root directory for the repository:
git clone https://github.com/GoogleCloudPlatform/cluster-health-scanner && cd cluster-health-scanner
Use the following command to install dependencies for Google Cloud CLI:
pip3 install -r cli/requirements.txt
Optional: to let the
configcheck
command fetch configuration values from your cluster without needing to reauthenticate for each machine, use the following command to add your Google Cloud CLI SSH key to your local SSH agent:ssh-add ~/.ssh/google_compute_engine
Use the following command to add the alias
cluster_diag
forcluster_diag.py
:alias cluster_diag="python3 cli/cluster_diag.py"
Perform a health check
After you've installed CHS, do the following to check the health of your GPU cluster:
Go to the Compute Engine > VM instances page.
Locate the login node. It might have a name with the pattern
DEPLOYMENT_NAME
+login-001
.From the Connect column of the login node, click SSH.
Verify that you're in the root directory for the repository.
Use the following command to check the current status of your cluster:
cluster_diag -o ORCHESTRATOR healthscan GPU_TYPE status
Replace the following:
ORCHESTRATOR
: eithergke
orslurm
, depending on which orchestrator you're using.GPU_TYPE
: the GPU machine type that you're using, which can be one of the following values:a4-highgpu-8g
a3-ultragpu-8g
a3-megagpu-8g
a3-highgpu-8g
a3-highgpu-4g
a3-highgpu-2g
a3-highgpu-1g
Use the following command to check the health of individual GPUs within your cluster:
cluster_diag -o ORCHESTRATOR healthscan GPU_TYPE gpu
Replace the following:
ORCHESTRATOR
: eithergke
orslurm
, depending on which orchestrator you're using.GPU_TYPE
: the GPU machine type that you're using, which can be one of the following values:a4-highgpu-8g
a3-ultragpu-8g
a3-megagpu-8g
a3-highgpu-8g
a3-highgpu-4g
a3-highgpu-2g
a3-highgpu-1g
Optional: use the following template command to run additional checks. Consider adding the
--run_only_on_available_nodes
flag to skip unavailable nodes:cluster_diag -o ORCHESTRATOR healthscan GPU_TYPE CHECK
Replace the following:
ORCHESTRATOR
: eithergke
orslurm
, depending on which orchestrator you're using.GPU_TYPE
: the GPU machine type that you're using, which can be one of the following values:a4-highgpu-8g
a3-ultragpu-8g
a3-megagpu-8g
a3-highgpu-8g
a3-highgpu-4g
a3-highgpu-2g
a3-highgpu-1g
CHECK
: the check that you want to run. Use one of the following options:- status
- nccl
- gpu
- straggler
- neper
- tinymax
Check your configuration
After you've installed CHS, do the following to check the configuration of your cluster:
- Verify that you're in the root directory for the repository.
Use the following command to check the configuration of your cluster. By default, this command produces a diff; to skip the diff and just print the configuration, add the
--no-diff
flag:cluster_diag -o ORCHESTRATOR configcheck GPU_TYPE
Replace the following:
ORCHESTRATOR
: eithergke
orslurm
, depending on which orchestrator you're using.GPU_TYPE
: the GPU machine type that you're using, which can be one of the following values:a4-highgpu-8g
a3-ultragpu-8g
a3-megagpu-8g
a3-highgpu-8g
a3-highgpu-4g
a3-highgpu-2g
a3-highgpu-1g
The following screenshot shows the result from a successful configuration check:
