Stay organized with collections
Save and categorize content based on your preferences.
This document explains how to test for issues with your
GPU clusters using the cluster health scanner
(CHS) tool.
The
CHS
tool checks the health of your GPU clusters, running
tests to verify that the clusters are ready to run your workloads. You can
use CHS to perform proactive health checks, or as a diagnostic tool when you
encounter problems with a workload. In addition to checking the configuration
of your cluster, you can perform the following tests:
NCCL check: validates the network communication between GPUs using the
NVIDIA Collective Communications Library (NCCL).
GPU check: utilizes NVIDIA's Data Center GPU Manager (DCGM) tool to
check the health of individual GPUs.
Neper check: uses the Neper tool to assess network performance within
the cluster.
Straggler detection: runs a network traffic pattern between nodes that
closely resemble patterns seen during LLM training workload pipeline
parallelism. Learn more about straggler detection.
Tinymax check: uses
Maxtext, an
open source LLM framework, to assess ML training within the cluster.
You can only run CHS checks and tests on nodes that aren't running any jobs or
workloads. If you try to run a check or a test on a busy node, the check or
test fails.
The CHS tool is available for GPU clusters that are orchestrated by
Google Kubernetes Engine (GKE) or Slurm, regardless of what provisioning model that
you used to create the clusters. However, CHS is only available for the
following machine types:
A4
A3 Ultra
A3 Mega
A3 High
The following sections describe how to install CHS, and then
how to use it to perform health checks and check your configuration.
Locate the login node. It might have a name with the pattern
DEPLOYMENT_NAME +login-001.
From the Connect column of the login node, click SSH.
Use the following command to clone the repository and move to the root
directory for the repository:
git clone https://github.com/GoogleCloudPlatform/cluster-health-scanner && cd cluster-health-scanner
Use the following command to install dependencies for Google Cloud CLI:
pip3 install -r cli/requirements.txt
Optional: to let the configcheck command fetch configuration values from
your cluster without needing to reauthenticate for each machine, use the
following command to add your Google Cloud CLI SSH key to your local SSH agent:
ssh-add ~/.ssh/google_compute_engine
Use the following command to add the alias cluster_diag for
cluster_diag.py:
alias cluster_diag="python3 cli/cluster_diag.py"
Perform a health check
After you've installed CHS, do the following to check the health of your
GPU cluster:
ORCHESTRATOR: either gke or slurm, depending on
which orchestrator you're using.
GPU_TYPE: the GPU machine type that you're using, which
can be one of the following values:
a4-highgpu-8g
a3-ultragpu-8g
a3-megagpu-8g
a3-highgpu-8g
a3-highgpu-4g
a3-highgpu-2g
a3-highgpu-1g
Optional: use the following template command to run additional checks.
Consider adding the --run_only_on_available_nodes flag to skip unavailable
nodes:
ORCHESTRATOR: either gke or slurm, depending on
which orchestrator you're using.
GPU_TYPE: the GPU machine type that you're using, which
can be one of the following values:
a4-highgpu-8g
a3-ultragpu-8g
a3-megagpu-8g
a3-highgpu-8g
a3-highgpu-4g
a3-highgpu-2g
a3-highgpu-1g
CHECK: the check that you want to run. Use one of the
following options:
status
nccl
gpu
straggler
neper
tinymax
Check your configuration
After you've installed CHS, do the following to check the configuration of your
cluster:
Verify that you're in the root directory for the repository.
Use the following command to check the configuration of your cluster. By
default, this command produces a diff; to skip the diff and just print the
configuration, add the --no-diff flag:
cluster_diag -o ORCHESTRATOR configcheck GPU_TYPE
Replace the following:
ORCHESTRATOR: either gke or slurm, depending on
which orchestrator you're using.
GPU_TYPE: the GPU machine type that you're using, which
can be one of the following values:
a4-highgpu-8g
a3-ultragpu-8g
a3-megagpu-8g
a3-highgpu-8g
a3-highgpu-4g
a3-highgpu-2g
a3-highgpu-1g
The following screenshot shows the result from a successful configuration check:
A successful configuration check result (click to enlarge).
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[],[],null,["# Test clusters\n\nThis document explains how to test for issues with your\n[GPU clusters](/ai-hypercomputer/docs/gpu) using the cluster health scanner\n(CHS) tool.\n\nThe\n[CHS](https://github.com/GoogleCloudPlatform/cluster-health-scanner)\ntool checks the health of your GPU clusters, running\ntests to verify that the clusters are ready to run your workloads. You can\nuse CHS to perform proactive health checks, or as a diagnostic tool when you\nencounter problems with a workload. In addition to checking the configuration\nof your cluster, you can perform the following tests:\n\n- **NCCL check**: validates the network communication between GPUs using the NVIDIA Collective Communications Library (NCCL).\n- **GPU check**: utilizes NVIDIA's Data Center GPU Manager (DCGM) tool to check the health of individual GPUs.\n- **Neper check**: uses the Neper tool to assess network performance within the cluster.\n- **Straggler detection** : runs a network traffic pattern between nodes that closely resemble patterns seen during LLM training workload pipeline parallelism. Learn more about [straggler detection](/ai-hypercomputer/docs/monitor).\n- **Tinymax check** : uses [Maxtext](https://github.com/AI-Hypercomputer/maxtext), an open source LLM framework, to assess ML training within the cluster.\n\nYou can only run CHS checks and tests on nodes that aren't running any jobs or\nworkloads. If you try to run a check or a test on a busy node, the check or\ntest fails.\n\nThe CHS tool is available for GPU clusters that are orchestrated by\nGoogle Kubernetes Engine (GKE) or Slurm, regardless of what provisioning model that\nyou used to create the clusters. However, CHS is only available for the\nfollowing machine types:\n\n- A4\n- A3 Ultra\n- A3 Mega\n- A3 High\n\nThe following sections describe how to install CHS, and then\nhow to use it to perform health checks and check your configuration.\n\nInstall CHS\n-----------\n\nUse the following procedure to install CHS:\n\n1. Go to the **Compute Engine** \\\u003e **VM instances** page.\n\n [Go to the VM instances page](https://console.cloud.google.com/compute/instances)\n2. Locate the login node. It might have a name with the pattern\n \u003cvar translate=\"no\"\u003eDEPLOYMENT_NAME\u003c/var\u003e +`login-001`.\n\n3. From the **Connect** column of the login node, click **SSH**.\n\n4. Use the following command to clone the repository and move to the root\n directory for the repository:\n\n ```\n git clone https://github.com/GoogleCloudPlatform/cluster-health-scanner && cd cluster-health-scanner\n ```\n5. Use the following command to install dependencies for Google Cloud CLI:\n\n ```\n pip3 install -r cli/requirements.txt\n ```\n6. Optional: to let the `configcheck` command fetch configuration values from\n your cluster without needing to reauthenticate for each machine, use the\n following command to add your Google Cloud CLI SSH key to your local SSH agent:\n\n ```\n ssh-add ~/.ssh/google_compute_engine\n ```\n7. Use the following command to add the alias `cluster_diag` for\n `cluster_diag.py`:\n\n ```\n alias cluster_diag=\"python3 cli/cluster_diag.py\"\n ```\n\nPerform a health check\n----------------------\n\nAfter you've installed CHS, do the following to check the health of your\nGPU cluster:\n\n1. Go to the **Compute Engine** \\\u003e **VM instances** page.\n\n [Go to the VM instances page](https://console.cloud.google.com/compute/instances)\n2. Locate the login node. It might have a name with the pattern\n \u003cvar translate=\"no\"\u003eDEPLOYMENT_NAME\u003c/var\u003e +`login-001`.\n\n3. From the **Connect** column of the login node, click **SSH**.\n\n4. Verify that you're in the root directory for the repository.\n\n5. Use the following command to check the current status of your\n cluster:\n\n ```\n cluster_diag -o ORCHESTRATOR healthscan GPU_TYPE status\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eORCHESTRATOR\u003c/var\u003e: either `gke` or `slurm`, depending on which orchestrator you're using.\n - \u003cvar translate=\"no\"\u003eGPU_TYPE\u003c/var\u003e: the GPU machine type that you're using, which can be one of the following values:\n - `a4-highgpu-8g`\n - `a3-ultragpu-8g`\n - `a3-megagpu-8g`\n - `a3-highgpu-8g`\n - `a3-highgpu-4g`\n - `a3-highgpu-2g`\n - `a3-highgpu-1g`\n6. Use the following command to check the health of individual GPUs within your\n cluster:\n\n ```\n cluster_diag -o ORCHESTRATOR healthscan GPU_TYPE gpu\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eORCHESTRATOR\u003c/var\u003e: either `gke` or `slurm`, depending on which orchestrator you're using.\n - \u003cvar translate=\"no\"\u003eGPU_TYPE\u003c/var\u003e: the GPU machine type that you're using, which can be one of the following values:\n - `a4-highgpu-8g`\n - `a3-ultragpu-8g`\n - `a3-megagpu-8g`\n - `a3-highgpu-8g`\n - `a3-highgpu-4g`\n - `a3-highgpu-2g`\n - `a3-highgpu-1g`\n7. Optional: use the following template command to run additional checks.\n Consider adding the `--run_only_on_available_nodes` flag to skip unavailable\n nodes:\n\n ```\n cluster_diag -o ORCHESTRATOR healthscan GPU_TYPE CHECK\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eORCHESTRATOR\u003c/var\u003e: either `gke` or `slurm`, depending on which orchestrator you're using.\n - \u003cvar translate=\"no\"\u003eGPU_TYPE\u003c/var\u003e: the GPU machine type that you're using, which can be one of the following values:\n - `a4-highgpu-8g`\n - `a3-ultragpu-8g`\n - `a3-megagpu-8g`\n - `a3-highgpu-8g`\n - `a3-highgpu-4g`\n - `a3-highgpu-2g`\n - `a3-highgpu-1g`\n - \u003cvar translate=\"no\"\u003eCHECK\u003c/var\u003e: the check that you want to run. Use one of the following options:\n - status\n - nccl\n - gpu\n - straggler\n - neper\n - tinymax\n\nCheck your configuration\n------------------------\n\nAfter you've installed CHS, do the following to check the configuration of your\ncluster:\n\n1. Verify that you're in the root directory for the repository.\n2. Use the following command to check the configuration of your cluster. By\n default, this command produces a diff; to skip the diff and just print the\n configuration, add the `--no-diff` flag:\n\n ```\n cluster_diag -o ORCHESTRATOR configcheck GPU_TYPE\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eORCHESTRATOR\u003c/var\u003e: either `gke` or `slurm`, depending on which orchestrator you're using.\n - \u003cvar translate=\"no\"\u003eGPU_TYPE\u003c/var\u003e: the GPU machine type that you're using, which can be one of the following values:\n - `a4-highgpu-8g`\n - `a3-ultragpu-8g`\n - `a3-megagpu-8g`\n - `a3-highgpu-8g`\n - `a3-highgpu-4g`\n - `a3-highgpu-2g`\n - `a3-highgpu-1g`\n\nThe following screenshot shows the result from a successful configuration check:\n[](/static/ai-hypercomputer/images/chs-config-check.png) A successful configuration check result (click to enlarge).\n\nWhat's next\n-----------\n\n- [Monitor VMs and clusters](/ai-hypercomputer/docs/monitor)\n- [Troubleshoot slow performance](/ai-hypercomputer/docs/troubleshooting/troubleshoot-slow-performance)"]]