This document outlines how to report a faulty host machine that is running your artificial intelligence (AI), machine learning (ML) and high performance computing (HPC) workloads. A host, also known as a node, is the physical machine on which your virtual machine (VM) instances are running.
This document is for Slurm and other VM-based clusters. For Google Kubernetes Engine clusters, see Report a faulty host in the Manage AI-optimized GKE clusters document.
Before you begin
-
Select the tab for how you plan to use the samples on this page:
gcloud
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
REST
To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.
After installing the Google Cloud CLI, initialize it by running the following command:
gcloud init
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
For more information, see Authenticate for using REST in the Google Cloud authentication documentation.
Overview
To report a faulty host, the VM instance must be meet the following requirements:
- It must be in a
RUNNING
state. If you try to report a faulty host after deleting the VM, an error message is returned, and the host machine won't be marked as faulty. - The VM must be running on an A4 or A3 Ultra machine type.
- The VM must be a part of a reserved block of capacity. If you want to report a faulty host for A4 or A3 Ultra VMs that aren't part of a reserved block, contact your Google Cloud account team.
Limitations
Google makes best-effort attempts to fulfill all requests to report faulty hosts. However, due to capacity constraints or rate limits, your request might not always be fulfilled.
How it works
When you report a host as faulty, then the following takes place:
- The report faulty host operation starts on the VM instance. At this
stage the VM stays in a
RUNNING
state. This operation takes 10 - 12 minutes. For more details, see Review operations. - The VM instance shuts down.
- Depending on the setting for the
automatically restart
(
automaticRestart
) host maintenance option, one of the following takes place:- If the VM isn't configured to automatically restart, then the VM stays shutdown.
- If the VM is configured to automatically restart, then the VM is
restarted as follows:
- If healthy hosts are available in your
reserved block of capacity,
the VM instance goes into the
RUNNING
state on a new host from your reserved block of capacity. In parallel, Compute Engine also attempts to update your reserved block of capacity by replacing your faulty host with a new host. - If your reserved block of capacity is depleted, the VM instance stays
in the
REPAIRING
state until you obtain more capacity. For a VM instance inREPAIRING
state you can either leave it in that state or shutdown the VM. If the VM is powered on again, it might return a stockout error because of a lack of a host machine.
- If healthy hosts are available in your
reserved block of capacity,
the VM instance goes into the
Report a faulty host
To report a faulty host, complete the following steps:
- Ensure that you have thoroughly investigated your environment to identify the root cause of your issues. Keep in mind that this process involves a termination and migration of VMs that disrupt your workloads.
- Take note of the physical host on which the VM is running. To do this, see Review the physical host information.
- Back up Local SSD data. When the VM instance shuts down and is moved to a new host machine, the Local SSD data is deleted. To backup your Local SSD data, see Local SSD data backup.
Report the faulty host. To report a faulty host, select one of the following options:
gcloud
To report a faulty host, use the
gcloud compute instances report-host-as-faulty
command with the following flags.gcloud compute instances report-host-as-faulty VM_NAME \ --fault-reasons=behavior=FAULT_REASON,description=DESCRIPTION \ --disruption-schedule=IMMEDIATE
Replace the following:
VM_NAME
: the name of the VM instance.FAULT_REASON
: the issue with the host. You can specify one or more of the following values for the fault reason.If specifying multiple issues, use a comma-separated list:
--fault-reasons=behavior=FAULT_REASON_1,FAULT_REASON_2
PERFORMANCE
: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.SILENT_DATA_CORRUPTION
: use this value if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, by kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.UNRECOVERABLE_GPU_ERROR
: use this value if you identified an unrecoverable GPU error with an XID for a VM.BEHAVIOR_UNSPECIFIED
: use this value if you are not sure what is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
Optional:
DESCRIPTION
: additional details on the failure, such as XID information or suspected performance problems.DISRUPTION_SCHEDULE
: specifies when to replace the host. Only the valueIMMEDIATE
is supported..
REST
To report a faulty host, make a
POST
request to theinstances.reportHostAsFaulty
method.POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instance/VM_NAME/reportHostAsFaulty { "faultReasons":[ { "behavior":"FAULT_REASON", "description":"DESCRIPTION" } ], "disruptionSchedule":"IMMEDIATE" }
Replace the following:
PROJECT_ID
: your project ID.VM_NAME
: the name of the VM instanceZONE
: the zone where the VM is locatedFAULT_REASON
: the issue with the host. You can specify one or more of the following values for the fault reason:PERFORMANCE
: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.SILENT_DATA_CORRUPTION
: use this value if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, by kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.UNRECOVERABLE_GPU_ERROR
: use this value if you identified an unrecoverable GPU error with an XID for a VM.BEHAVIOR_UNSPECIFIED
: use this value if you are not sure what is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
- Optional:
DESCRIPTION
: additional details on the failure, such as XID information or suspected performance problems. DISRUPTION_SCHEDULE
: specifies when to replace the host. Only the valueIMMEDIATE
is supported.
The output resembles the following:
Http Status 200 Created Header: Location:"/instances/VM_NAME"
When making a request, you can report multiple issues at a time as follows:
POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instance/VM_NAME/reportHostAsFaulty { "faultReasons":[ { "behavior":"FAULT_REASON", "description":"DESCRIPTION" }, { "behavior":"FAULT_REASON", "description":"DESCRIPTION" } ], "disruptionSchedule":"IMMEDIATE" }
Review operations
After you report a faulty host, the following sequence of operations takes place during the VM shutdown and restart.
The VM shutdown
- A
reportHostAsFaulty
operation is created. - The
reportHostAsFaulty
operation creates a sequence of sub-operations that work to mark the underlying host machine as faulty. When all these sub-operations are complete, thereportHostAsFaulty
operation goes intoRUNNING
mode. - Once the
reportHostAsFaulty
operation goes intoRUNNING
mode, anupcomingMaintenance
operation is then created to log the upcoming maintenance event. - Then an
instance terminated during maintenance
operation is created as the VM is terminated. After this step, the
reportHostAsFaulty
operation completes.It takes about 10-12 minutes for all these operations to take place. Throughout this time the VM is in the
RUNNING
state.
The VM restart
- Based on the VM's maintenance configuration for
automaticRestart,
one of the following occurs:
- If
automaticRestart
is set to true, theAutomatically restart an instance
operation is created as Compute Engine attempts to restart the VM on another host machine. - If
automaticRestart
is set to false, the VM stays in theTERMINATED
state. You can manually restart the VM. Compute Engine provisions the VM on a healthy machine within the same block.
- If
- To confirm that the VM has moved to a different host, review the
physicalHost
value for the VM instance. To do this, see Review the physical host information.
To review operations, you can use one of the following options from the Google Cloud console.
VM operations
In the Google Cloud console, go to the Operations page.
Locate the VM that you reported.
If the VM is powered down and the host is reported then the Status column shows Done for the VM. This page doesn't track if the VM is restarted on a new host.
To see if the VM restarted, go to the VM instances page.
Cloud Logs
In the Google Cloud console, go to the Logs Explorer page.
If you use the search bar to find this page, select the result whose subheading is Logging. Your most recent logs are displayed in the Query results pane
In the toolbar, ensure that Show query is enabled.
Copy and paste the following query into the query box:
resource.type="gce_instance" AND protoPayload.methodName=~"compute\.instances\.(reportHostAsFaulty|terminateOnHostMaintenance|upcomingMaintenance|automaticRestart)"
Click Run query. The results of the query are displayed in the Query results pane.
Review the physical host information
You can check the physical host for a VM by running the following command.
gcloud
After installing the Google Cloud CLI, initialize it by running the following command:
gcloud init
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
Use the
gcloud compute instances describe
command to view the physical host that a VM is running on.gcloud beta compute instances describe VM_NAME \ --zone=ZONE \ --format="yaml(resourceStatus.physicalHost)"
Replace the following:
VM_NAME
: the name of the VM instance.ZONE
: the zone where the VM is located.
REST
Use the
instances.get
method to
view the physical host that a VM is running on.
GET https://compute.googleapis.com/compute/beta/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME
In the output review the value for the "physicalHost"
field.
Replace the following:
PROJECT_ID
: your project ID.VM_NAME
: the name of the VM instance.ZONE
: the zone where the VM is located.
What's next?
- If you encounter issues when working with this API, see Troubleshoot faulty host API.