This document explains how to troubleshoot slow performance that you've identified for workloads that run on AI-optimized VMs or clusters.
To learn how to identify slow performance, see Monitor VMs and Slurm clusters.
Identify and address any suspected stragglers for your workload: Complete the following steps:
Check if you can use straggler detection for your workload. To review the limitations and requirements for using straggler detection, see Monitor VMs and Slurm clusters.
If you can't use straggler detection, then use other options for troubleshooting slow-performance.
To check if any VMs for your workload are suspected stragglers, view straggler detection metrics.
For example, to visualize all the suspected stragglers for your project in Cloud Monitoring, complete the following steps:
-
In the Google Cloud console, go to the
Dashboards page:
If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the Type
section of the filters pane, click Google Services.In the Name column, click Cluster Director Health Monitoring.
The details page for the dashboard opens.
Use the time-range selector in the toolbar to select the time range of the slow performance. Straggler detection typically takes up to 10 minutes to report a straggler.
To check if any VMs for your workload are suspected stragglers, review the Straggler Detection section. SpecificallyUse this query to check ifeck to see if the Suspected Straggler Instances table lists any VMs for your workload.
-
Based on the number of VMs for your workload that are suspected stragglers, proceed as follows:
If no VMs are suspected stragglers, then verify if straggler detection is running correctly. To verify if the straggler detection service is running for your project, follow the instructions to view straggler detection logs and specify the query for all straggler detection logs in your project. Then, proceed as follows:
If your project doesn't have straggler detection logs while VMs are running for at least 10 minutes, then the straggler detection service is not running for your project. To resolve this, contact Cloud Customer Care or try again later.
Otherwise, if you've verified that straggler detection is running for your project and your workload supports straggler detection, then the slow performance might be caused by a different issue. Use other options for troubleshooting slow-performance.
If a small number of VMs in your workload are reported as suspected stragglers, test migrating your workload off of the suspected VMs. Then, proceed as follows:
If migration does restore performance for your workload, then the suspected VMs might be faulty. For each of these VMs, follow steps to report a faulty host, and set the
FAULT_REASON
as"STRAGGLER"
.If migration doesn't restore performance, then there might be more suspected straggler VMs or the slow performance might be caused by a different issue. You can check if more VMs for your workload are suspected stragglers or use other options for troubleshooting slow-performance.
If a large number of VMs in your workload are reported as suspected stragglers, then use other options for troubleshooting slow-performance.
Use other options for troubleshooting slow performance: If the reported list of suspected straggler VMs is large or if removing reported straggler VMs doesn't restore performance, use other options to troubleshoot slow performance, such as the following:
- Test clusters using cluster health scanner.
- Review other metrics for performance.
- Review other troubleshooting documentation. For example, see Troubleshoot GPU VMs in the Compute Engine documentation.