Troubleshoot slow performance

This document explains how to troubleshoot slow performance that you've identified for workloads that run on AI-optimized VMs or clusters.

To learn how to identify slow performance, see Monitor VMs and Slurm clusters.

  1. Identify and address any suspected stragglers for your workload: Complete the following steps:

    1. Check if you can use straggler detection for your workload. To review the limitations and requirements for using straggler detection, see Monitor VMs and Slurm clusters.

      If you can't use straggler detection, then use other options for troubleshooting slow-performance.

    2. To check if any VMs for your workload are suspected stragglers, view straggler detection metrics.

      For example, to visualize all the suspected stragglers for your project in Cloud Monitoring, complete the following steps:

      1. In the Google Cloud console, go to the  Dashboards page:

        Go to Dashboards

        If you use the search bar to find this page, then select the result whose subheading is Monitoring.

      2. In the Type section of the filters pane, click Google Services.

      3. In the Name column, click Cluster Director Health Monitoring.

        The details page for the dashboard opens.

      4. Use the time-range selector in the toolbar to select the time range of the slow performance. Straggler detection typically takes up to 10 minutes to report a straggler.

      5. To check if any VMs for your workload are suspected stragglers, review the Straggler Detection section. SpecificallyUse this query to check ifeck to see if the Suspected Straggler Instances table lists any VMs for your workload.

    3. Based on the number of VMs for your workload that are suspected stragglers, proceed as follows:

  2. Use other options for troubleshooting slow performance: If the reported list of suspected straggler VMs is large or if removing reported straggler VMs doesn't restore performance, use other options to troubleshoot slow performance, such as the following: