If you notice issues on a reserved A4X, A4, or A3 Ultra VM that you can't resolve otherwise—such as slower performance within a cluster or consistently high GPU temperatures—then we recommend that you report its host as faulty. When you report a host as faulty, Compute Engine reports the host as faulty and then automatically repairs the VM by running host maintenance. For A4 and A3 Ultra VMs, Compute Engine attempts to migrate the VM to a different host when maintenance starts, which can help minimize the downtime for your workload.
This document explains how to report and repair faulty hosts for virtual machine (VM) instances that are part of a Slurm cluster or other VM-based clusters. For Google Kubernetes Engine (GKE) clusters, see Report faulty hosts through GKE instead.
Limitations
When you report a faulty host, the following limitations apply:
You can only report a faulty host if the VM that runs on the host meets all of the following conditions:
The VM is running.
The VM uses an A4X, A4, or A3 Ultra machine type.
The VM uses the reservation-bound provisioning model.
Google Cloud makes best-effort attempts to fulfill all your report faulty host requests. However, due to capacity constraints or rate limits, a request might not always be fulfilled.
Before you begin
Select the tab for how you plan to use the samples on this page:
Console
When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.
gcloud
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
REST
To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.
Install the Google Cloud CLI. After installation, initialize the Google Cloud CLI by running the following command:
gcloud init
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
For more information, see Authenticate for using REST in the Google Cloud authentication documentation.
Required roles
To get the permissions that you need to report a faulty host, ask your administrator to grant you the following IAM roles:
-
Compute Instance Admin (v1) (
roles/compute.instanceAdmin.v1
) on the VM or the project -
To view the state of a faulty host report operation by using Cloud Logging:
Logs Viewer (
roles/logging.viewer
) on the project
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to report a faulty host. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to report a faulty host:
-
To create a faulty host report:
compute.instances.update
on the VM -
To view a list of operations by using Logging:
logging.operations.list
on the project -
To view the details of an operation by using Logging:
logging.operations.get
on the project -
To view a list of operations in Compute Engine:
compute.zoneOperations.list
on the project -
To view the details of an operation in Compute Engine:
compute.zoneOperations.describe
on the project
You might also be able to get these permissions with custom roles or other predefined roles.
Understand the faulty host report process
After you report a faulty host for a VM, the time when the VM restarts varies based on the reservation operational mode that is specified in the reservation that the VM uses. To verify the reservation operational mode for a reservation, view thereservationOperationalMode
field in the reservation.
The following table summarizes the faulty host process for the two available reservation operational
modes: all capacity mode and managed mode.
All capacity mode (ALL_CAPACITY ) |
Managed mode (HIGHLY_AVAILABLE_CAPACITY ) |
|
---|---|---|
Supported machine types | A4X | A4 and A3 Ultra |
Faulty host report API rate limiting | No rate limits apply. | Calls to the API may be rate-limited. |
Faulty host report process |
When you report a faulty host for a VM that runs in the all capacity mode, the following occurs:
|
When you report a faulty host for a VM that runs in the managed mode, the following occurs:
|
Report a faulty host
To report a faulty host, complete the following steps:
Review the host on which your VM runs.
For instructions, see View VMs topology.
Optional: Back up Local SSD data. When the VM stops, Compute Engine automatically discards the data of any Local SSD disks that are attached to the VM. You can't recover Local SSD data after Compute Engine discards it.
For instructions on how to preserve Local SSD data, see Local SSD data backup.
Report the faulty host. To report a faulty host, select one of the following options. The host repair operation starts immediately, within a minute after the report faulty host operation completes. If the VM becomes unresponsive after you start the faulty host report operation, then, after you wait for at least 15 minutes, we recommend that you restart the VM.
gcloud
To report a faulty host, use the following
gcloud compute instances report-host-as-faulty
command:gcloud compute instances report-host-as-faulty VM_NAME \ --async \ --disruption-schedule=IMMEDIATE \ --fault-reasons=behavior=FAULT_REASON,description=DESCRIPTION \ --zone=ZONE
Replace the following:
VM_NAME
: the name of the VM.FAULT_REASON
: a list of host issues that your VM encountered, separated by commas—for example,ISSUE_1,ISSUE_2
. You can specify the following values:PERFORMANCE
: that GPUs that are attached to the VM have performance issues compared to other GPUs in the cluster, you see no XID errors in the logs, and the Compute Engine detects no other usual failure patterns such as silent data corruption.SILENT_DATA_CORRUPTION
: you see data corruption in your VM, but the VM keeps running. Silent data corruption can be due to issues like vCPUs defects, software bugs, or kernel issues.UNRECOVERABLE_GPU_ERROR
: you identified an unrecoverable GPU error with an XID.BEHAVIOR_UNSPECIFIED
: you aren't sure about what the issue to your VM is.
DESCRIPTION
: a description of the issue that is affecting your VM, such as XID information or suspected performance problems.ZONE
: the zone where the VM exists.
REST
To report a faulty host, make the following
POST
request to theinstances.reportHostAsFaulty
method.When you report a faulty host, you can specify multiple fault reasons at once. For example, to specify two fault reasons, make a request as follows:
POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME/reportHostAsFaulty { "disruptionSchedule": "IMMEDIATE", "faultReasons": [ { "behavior": "FAULT_REASON_1", "description": "DESCRIPTION_1" }, { "behavior": "FAULT_REASON_2", "description": "DESCRIPTION_2" } ] }
Replace the following:
PROJECT_ID
: the ID of the project where the VM exists.ZONE
: the zone where the VM exists.VM_NAME
: the name of the VM.FAULT_REASON_1
andFAULT_REASON_2
: each host issue that your VM encountered. You can specify the following values:PERFORMANCE
: that GPUs that are attached to the VM have performance issues compared to other GPUs in the cluster, you see no XID errors in the logs, and the Compute Engine detects no other usual failure patterns such as silent data corruption.SILENT_DATA_CORRUPTION
: you see data corruption in your VM, but the VM keeps running. Silent data corruption can be due to issues like vCPUs defects, software bugs, or kernel issues.UNRECOVERABLE_GPU_ERROR
: you identified an unrecoverable GPU error with an XID.BEHAVIOR_UNSPECIFIED
: you aren't sure about what the issue to your VM is.
DESCRIPTION_1
andDESCRIPTION_2
: a description for each host issue that you specified, such as XID information or suspected performance problems.
Review report faulty host operations
After you report a faulty host, Compute Engine starts a series of operations to mark the host as faulty and prepares the host for repair. Specifically, during a report faulty host operation, the following process happens:
Mark the host as faulty. Compute Engine creates the report faulty host operation. The report faulty host operation then creates a sequence of sub-operations. These sub-operations mark the underlying host as faulty.
Prepare the host for repairs. After all sub-operations complete, the report faulty host operation starts. Compute Engine stops the VM and starts the repair faulty host operation. Based on the reservation operational mode that is specified in the reservation that the VM uses, and if healthy hosts are available, Compute Engine either keeps the VM stopped or attempts to automatically migrate and restart the VM.
Report completion and repair the host. Compute Engine completes the report faulty host operation, and the host repair operation runs.
To track the status of the report faulty host
(compute.instances.reportHostAsFaulty
) operations in your project, select one
of the following options. For more information about other operations that you
can use to track repairs, migration, and automatic restart, see
Maintenance and restart behaviors
and
Monitor and plan for a host maintenance event
in the Compute Engine documentation.
Console (VM operations)
In the Google Cloud console, go to the Operations page.
In the table that appears, locate the VM that you reported.
In the row that contains the VM, in the Status column, you can see the status of the report faulty host operation. When the operation completes, the value is Done.
Optional: To verify if Compute Engine has restarted the VM, view the details of the VM.
Console (VM logs)
In the Google Cloud console, go to the Logs Explorer page.
Verify that the Show query toggle is set to the on position.
In the query editor, enter the following query:
resource.type="gce_instance" AND protoPayload.methodName=~"compute\.instances\.reportHostAsFaulty"
Click Run query. The Query results pane displays the query results.
gcloud
To view the status of the report faulty host operations in your project, use the
gcloud compute operations list
command with the--filter
flag set tooperationType:compute.instances.reportHostAsFaulty
:gcloud compute operations list --filter="operationType:compute.instances.reportHostAsFaulty"
If you want to view the details of a specific faulty host operation, then use the
gcloud compute operations describe
command:gcloud compute operations describe OPERATION_NAME \ --zone="ZONE"
Replace the following:
OPERATION_NAME
: the name of the operation.ZONE
: the zone where the operation exists.
REST
To view the status of the report faulty host operations in your project,
make a GET
request to the
zoneOperations.list
method.
In the request URL, include the filter
query parameter set to
items.operationType:compute.instances.reportHostAsFaulty
.
GET https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/operations&filter=items.operationType:compute.instances.reportHostAsFaulty
Replace the following:
PROJECT_ID
: the name of the operation.ZONE
: the zone where the operations exist.
What's next?
- If you encounter issues when reporting a faulty host, then see Troubleshoot faulty host API.