Identify and address any suspected stragglers for your workload:
Complete the following steps:
Check if you can use straggler detection for your workload. To review
the limitations and requirements for using straggler detection,
see Monitor VMs and Slurm clusters.
If you use the search bar to find this page, then select the result whose subheading is
Monitoring.
In the Type arrow_right
section of the filters pane, click Google Services.
In the Name column, click
Cluster Director Health Monitoring.
The details page for the dashboard opens.
Use the time-range selector in the toolbar to select the time range
of the slow performance. Straggler detection typically takes up to
10 minutes to report a straggler.
To check if any VMs for your workload are suspected stragglers,
review the Straggler Detection section. Use this query to
see if the Suspected Straggler Instances table lists any VMs
for your workload.
Based on the number of VMs for your workload that are suspected
stragglers, proceed as follows:
If your project doesn't have straggler detection logs while VMs
are running for at least 10 minutes, then the straggler
detection service is not running for your project. To resolve
this, contact Cloud Customer Care or try
again later.
Otherwise, if you've verified that straggler detection is running
for your project and your workload supports straggler detection,
then the slow performance might be caused by a different issue.
Use other options for troubleshooting slow-performance.
If a small number of VMs in your workload are reported as
suspected stragglers, test migrating your workload off of the
suspected VMs. Then, proceed as follows:
If migration does restore performance for your workload, then
the suspected VMs might be faulty. For each of these VMs,
follow steps to
report a faulty host,
and set FAULT_REASON to
PERFORMANCE and set DESCRIPTION to
straggler node.
Use other options for troubleshooting slow performance: If the reported
list of suspected straggler VMs is large or if removing reported straggler
VMs doesn't restore performance, use other options to troubleshoot slow
performance, such as the following:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[],[],null,["# Troubleshoot slow performance\n\nThis document explains how to troubleshoot slow performance that you've\nidentified for workloads that run on AI-optimized VMs or clusters.\n\nTo learn how to identify slow performance, see\n[Monitor VMs and Slurm clusters](/ai-hypercomputer/docs/monitor).\n\n\n1. **Identify and address any suspected stragglers for your workload**:\n Complete the following steps:\n\n 1. Check if you can use straggler detection for your workload. To review\n the limitations and requirements for using straggler detection,\n see [Monitor VMs and Slurm clusters](/ai-hypercomputer/docs/monitor).\n\n If you can't use straggler detection, then\n [use other options for troubleshooting slow-performance](#other-options-troubleshooting-slow-performance).\n 2. To check if any VMs for your workload are suspected stragglers, view\n straggler detection metrics.\n\n For example, to visualize all the suspected stragglers for your project\n in Cloud Monitoring, complete the following steps:\n | **Tip:** Alternatively, if you want to filter the suspected stragglers for your project, follow the instructions to [view straggler detection logs](/ai-hypercomputer/docs/monitor#view-straggler-detection-logs) and specify the [query for logs with suspected stragglers for specific VMs](/ai-hypercomputer/docs/monitor#suspected-straggler-logs-specific-vm-query).\n 1. In the Google Cloud console, go to the **Dashboards** page:\n\n [Go to **Dashboards**](https://console.cloud.google.com/monitoring/dashboards)\n\n \u003cbr /\u003e\n\n If you use the search bar to find this page, then select the result whose subheading is\n **Monitoring**.\n 2. In the **Type arrow_right**\n section of the filters pane, click **Google Services**.\n\n 3. In the **Name** column, click\n **Cluster Director Health Monitoring**.\n\n The details page for the dashboard opens.\n 4. Use the time-range selector in the toolbar to select the time range\n of the slow performance. Straggler detection typically takes up to\n 10 minutes to report a straggler.\n\n 5. To check if any VMs for your workload are suspected stragglers,\n review the **Straggler Detection** section. Use this query to\n see if the **Suspected Straggler Instances** table lists any VMs\n for your workload.\n\n 3. Based on the number of VMs for your workload that are suspected\n stragglers, proceed as follows:\n\n - If *no* VMs are suspected stragglers, then verify if straggler\n detection is running correctly. To verify if the straggler detection\n service is running for your project, follow the instructions to\n [view straggler detection logs](/ai-hypercomputer/docs/monitor#view-straggler-detection-logs)\n and specify the\n [query for all straggler detection logs in your project](/ai-hypercomputer/docs/monitor#all-straggler-logs-query).\n Then, proceed as follows:\n\n - If your project doesn't have straggler detection logs while VMs\n are running for at least 10 minutes, then the straggler\n detection service is not running for your project. To resolve\n this, [contact Cloud Customer Care](/support/docs/overview) or try\n again later.\n\n - Otherwise, if you've verified that straggler detection is running\n for your project and your workload supports straggler detection,\n then the slow performance might be caused by a different issue.\n [Use other options for troubleshooting slow-performance](#other-options-troubleshooting-slow-performance).\n\n - If a *small number* of VMs in your workload are reported as\n suspected stragglers, test migrating your workload off of the\n suspected VMs. Then, proceed as follows:\n\n - If migration does restore performance for your workload, then\n the suspected VMs might be faulty. For each of these VMs,\n follow steps to\n [report a faulty host](/ai-hypercomputer/docs/manage/report-faulty-host),\n and set \u003cvar translate=\"no\"\u003eFAULT_REASON\u003c/var\u003e to\n `PERFORMANCE` and set \u003cvar translate=\"no\"\u003eDESCRIPTION\u003c/var\u003e to\n `straggler node`.\n\n - If migration doesn't restore performance, then there might be\n more suspected straggler VMs or the slow performance might be\n caused by a different issue. You can\n [check if more VMs for your workload are suspected stragglers](#check-suspected-stragglers)\n or\n [use other options for troubleshooting slow-performance](#other-options-troubleshooting-slow-performance).\n\n - If a *large number* of VMs in your workload are reported as\n suspected stragglers, then\n [use other options for troubleshooting slow-performance](#other-options-troubleshooting-slow-performance).\n\n2. **Use other options for troubleshooting slow performance**: If the reported\n list of suspected straggler VMs is large or if removing reported straggler\n VMs doesn't restore performance, use other options to troubleshoot slow\n performance, such as the following:\n\n - [Test clusters using cluster health scanner](/ai-hypercomputer/docs/troubleshooting/test-clusters).\n - [Review other metrics for performance](/ai-hypercomputer/docs/monitor#view-metrics).\n - Review other troubleshooting documentation. For example, see [Troubleshoot GPU VMs](/compute/docs/troubleshooting/troubleshooting-gpus) in the Compute Engine documentation."]]