This document explains how to monitor A4 and A3 Ultra virtual machine (VM) instances and clusters that are deployed on Hypercompute Cluster. For more information about Hypercompute Cluster, see Hypercompute Cluster.
By using the available metrics in this document, you can create or use prebuilt Monitoring dashboards to monitor the following:
VM and GPU performance
Network transmission efficiency
Network efficiency among blocks and sub-blocks
Machine learning (ML) workloads efficiency
Monitoring this data helps you identify and troubleshoot performance bottlenecks, as well as ensure the health and stability of your VMs and clusters. To learn more about Monitoring dashboards, see Dashboards overview.
Before you begin
- When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.
Required roles
To get the permissions that
you need to view and create Monitoring dashboards,
ask your administrator to grant you the
Monitoring Editor (roles/monitoring.editor
) IAM role on the project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to view and create Monitoring dashboards. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to view and create Monitoring dashboards:
-
To view dashboards:
monitoring.dashboards.get
on the project -
To create dashboards:
monitoring.dashboards.create
on the project
You might also be able to get these permissions with custom roles or other predefined roles.
Available metrics
Depending on your use case, the following metrics are available for monitoring your VMs and clusters:
To monitor the health, performance, and network performance of the GPUs attached to your VMs, see GPU metrics.
To monitor the efficiency of the GPUs in your ML workloads, see ML workload metrics.
These metrics are available in prebuilt dashboards, and you can also use them to create custom dashboards.
GPU metrics
To monitor the health, performance, and network performance of the GPUs attached to your GPUs, do one or more the following:
To monitor the health of your GPUs, use the following metrics:
Name Metric type Description NVSwitch Status instance/gpu/nvswitch_status
Whether a NVLink Switch on an NVIDIA GPU attached to a VM is encountering issues. VM Infra Health instance/gpu/infra_health
The health of the cluster, block, sub-block, and host on which your GPU VMs are running. If this metric shows that a VM's infrastructure is unhealthy, then the metric also outputs the reason. To monitor the performance of your GPUs, use the following metrics:
Name Metric type Description GPU Power Consumption instance/gpu/power_consumption
The power in watts consumed on individual GPUs on the host as a double value. For VMs with multiple GPUs attached, the metric provides the power consumption separately for each GPU on the host. SM Utilization instance/gpu/sm_utilization
A non-zero value indicates that the streaming multiprocessors (SMs) on your GPUs are actively being used. GPU Temperature instance/gpu/temperature
The temperature in celsius of individual GPUs on the host as a double value. For VMs with multiple GPUs attached, the metric provides the temperature separately for each GPU on the host. GPU Thermal Margin instance/gpu/tlimit
The thermal headroom in celsius that individual GPUs have before they need to slow down due to high temperature. The value for this metric is displayed as a double value. For VMs with multiple GPUs attached, the metric provides the thermal headroom separately for each GPU on the host. To monitor the network performance across blocks and sub-blocks, use the following metrics:
Name Metric type Description Network Traffic at Inter-Block instance/gpu/network/inter_block_tx
The number of bytes of network traffic among blocks. Network Traffic at Inter-Subblock instance/gpu/network/inter_subblock_tx
The number of bytes of network traffic among sub-blocks. Network Traffic at Intra-Subblock instance/gpu/network/intra_subblock_tx
The number of bytes of network traffic within a single sub-block. To monitor the network performance of your GPUs, use the following metrics:
Name Metric type Description Throughput Rx Bytes instance/gpu/throughput_rx_bytes
The number of bytes received from network traffic. Throughput TX Bytes instance/gpu/throughput_tx_bytes
The number of bytes transmitted to network traffic.
For an overview of available metrics in Compute Engine, see Google Cloud metrics.
ML workload metrics
Before you monitor ML workload metrics, you must set up monitoring for your workload.
To monitor the efficiency of the accelerators in your ML workloads, use the following metrics:
Name | Metric type | Description |
---|---|---|
Productive time | workload/goodput_time |
The time, in seconds, the workload spends on goodput activities. These activities are core, useful tasks, such as a forward or backward pass during model training. |
Non-productive time | workload/badput_time |
The time, in seconds, the workload spends on badput activities. These activities are overhead tasks, such as loading or preprocessing data for training. |
For an overview of available metrics in Compute Engine, see Google Cloud metrics.
Visualize metrics
To monitor metrics data of your VMs and clusters using Monitoring dashboards, use one of the following methods:
For a quick overview of health and performance, or to customize an existing dashboard, use prebuilt dashboards.
For specific monitoring needs, create custom dashboards.
If you encounter issues when using a dashboard, see Troubleshoot slow performance or errors in this document.
Use prebuilt dashboards
You can monitor your VMs and clusters using prebuilt Monitoring dashboards for Hypercompute Cluster. You can also create a copy of a prebuilt dashboard and modify it to fit to your needs.
To use a prebuilt Monitoring dashboard, do the following:
-
In the Google Cloud console, go to the
Dashboards page:
If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the Categories pane, click
GCP.In the Name column, click one of the following dashboards based on the metrics that you want to monitor:
To monitor VMs and GPUs performance, click Hypercompute Cluster Health Monitoring.
To monitor networks transmission efficiency, click Hypercompute Cluster Transmission Efficiency.
To monitor networks efficiency among blocks and sub-blocks, click Hypercompute Cluster Block Network.
The details page of your chosen dashboard opens.
Optional: To create a copy of a dashboard and customize it to fit your needs, click
Copy dashboard.
Create custom dashboards
To create a custom Monitoring dashboard, do the following:
Choose the metrics to monitor. If you haven't already, then see Available metrics in this document.
Troubleshoot slow performance or errors
If you experience slow performance or errors in your jobs or workloads, then you can troubleshoot them by doing the following:
-
In the Google Cloud console, go to the
Dashboards page:
If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the Categories pane, click
GCP.To learn more about the metrics, do the following:
For network efficiency among blocks and sub-blocks, click GCE Interactive Playbook - Hypercompute Cluster Block Network.
For VM and GPU performance, and network transmission efficiency, click GCE Interactive Playbook - Hypercompute Cluster Health Monitoring.
Optional: To create a copy of a dashboard and customize it to fit your needs, click
Copy dashboard.