Manage GKE node disruption for GPUs and TPUs


During the lifecycle of a long-running GKE cluster, periodic disruptions to workloads occur due to automatic maintenance events issued by Google Cloud for the Compute Engine resources that underlie the GKE infrastructure. This page helps you understand what node disruption means in GKE, monitor maintenance notifications, and minimize disruption impact in your GKE nodes with attached GPUs and TPUs.

This document is for Platform admins and operators who manage the lifecycle of the underlying tech infrastructure. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.

What does node disruption mean in GKE?

Your GKE clusters manage the lifecycle of the GKE nodes. These nodes are provisioned on Compute Engine VMs, which periodically experience host events that are caused by a variety of reasons, such as the following:

  • Hardware or software updates and maintenance. Automatic maintenance events are issued by Google Cloud.
  • Hardware failures.
  • Upgrades to Compute Engine VMs. These upgrades are different from upgrades to the GKE node or the node pool version.

Host events are issued for the underlying Google Cloud infrastructure and they bypass GKE maintenance policies and exclusions. During the lifecycle of a long-running GKE cluster, the nodes might experience periodic disruptions to training or serving workloads. When these disruptions affect your GKE nodes that run AI/ML workloads, GKE needs to restart both the running workloads and the underlying node.

Why GPUs and TPUs require disruption management

Most Compute Engine VMs, with some exceptions, have their host maintenance policy set to live migrate, which means that running workloads typically experience little to no disruption. However, certain classes of VMs don't support live migration, including VMs with attached GPUs and TPUs. When a host event happens to the VM within a TPU slice, the entire slice gets interrupted and then rescheduled because all maintenance events are coordinated at the slice level. So if you create a TPU slice that has hundreds of VMs, all of those VMs will receive the same maintenance event schedule.

When a host event occurs, GKE terminates the node and its Pods. If the Pods are deployed as part of a larger workload, like a Job or Deployment, GKE restarts the Pods on the affected node.

It is up to you, or the frameworks that you use, to handle the workload configuration to react appropriately to maintenance events. For example, you can save the state of your AI training job to reduce data loss.

To manage disruption on AI/ML workloads, you can do the following:

Monitor maintenance notifications

Compute Engine issues notifications when nodes and their underlying VMs are scheduled for disruptive host events, and when these events become active. The notifications include information about planned start time, the type of event, and other details.

On GKE version 1.31.1-gke.2008000 and later, you can monitor upcoming maintenance events, including the events that are described in this section.

Upcoming maintenance is scheduled but not active

Before a VM with attached GPUs or TPUs has a scheduled maintenance event, Compute Engine pushes notifications out to all its VMs. These notifications report the start of the maintenance window. When an upcoming maintenance is scheduled by the VM but not active, GKE adds scheduled-maintenance-time to the node label.

To query these notifications at the node level, run the following command:

kubectl get nodes -l cloud.google.com/scheduled-maintenance-time \
    -L cloud.google.com/scheduled-maintenance-time

The output is similar to the following:

NAME                         STATUS    SCHEDULED-MAINTENANCE-TIME
<gke-accelerator-node-name>  Ready     1733083200
<gke-accelerator-node-name>  Ready     1733083200
[...]

The SCHEDULED-MAINTENANCE-TIME column represents seconds, which are displayed in Unix epoch time format.

To query these notifications at the level of node metadata, check instances for a maintenance event notification.

Scheduled maintenance starts

For accelerator-optimized machine families that support advanced maintenance, you can access the upcoming-maintenance endpoint that provides information about scheduled and started maintenance events.

When scheduled maintenance starts, Compute Engine updates the metadata in the http://metadata.google.internal/computeMetadata/v1/instance/attributes/ directory. Compute Engine updates the metadata labels as follows:

  • Sets maintenance-event to TERMINATE_ON_HOST_MAINTENANCE.
  • In upcoming-maintenance, sets maintenance_status to ONGOING.

GKE will gracefully evict Pods and terminate workloads within the limited maximum predefined time of the maintenance notification window.

Minimize disruption impact

To minimize the impact of node disruption, you can manually start a host maintenance event.

If you don't start a maintenance event, Compute Engine will complete the regularly scheduled maintenance.

Manually start a host maintenance event

When Compute Engine issues a notification about a scheduled maintenance event, you can manually start maintenance at a time that aligns with your operational schedule, for example, during periods of reduced activity.

On a node in the node pool, set the node label cloud.google.com/perform-maintenance to true. For example:

kubectl label nodes <node-name> cloud.google.com/perform-maintenance=true

GKE will gracefully evict Pods and terminate workloads before the maintenance event starts with the perform-maintenance action. The duration between label application and maintenance start varies.

Configure GKE to terminate your workloads gracefully

In this section, you configure GKE to manage your application lifecycle and minimize the disruption to your workload. If you don't configure a grace period, the grace period defaults to 30 seconds.

GKE makes a best effort to terminate these Pods gracefully and to execute the termination action that you define, for example, saving a training state. GKE sends a SIGTERM signal to Pods at the beginning of the grace period. If Pods don't exit by the end of the grace period, GKE sends a follow-up SIGKILL signal to any processes still running in any container in the Pod.

To configure the graceful termination period, set the termination grace period (seconds) in the spec.terminationGracePeriodSeconds field of your Pod manifest. For example, to get a notification time of 10 minutes, set the spec.terminationGracePeriodSecondsfield in your Pod manifest to 600 seconds, as follows:

    spec:
      terminationGracePeriodSeconds: 600

We recommend that you set a termination grace period that is long enough for any ongoing tasks to finish within the notification timeframe. If your workload uses a ML framework such as MaxText, Pax, or JAX with Orbax, the workloads can capture the shutdown SIGTERM signal and initiate a checkpointing process. To learn more, see TPU Autocheckpoint.

Process of graceful termination

When a disruption event begins, whether it's triggered manually or automatically by the VM, Compute Engine signals the impending machine shutdown by updating the maintenance-event metadata key. In both cases of impending node shutdown, GKE will start graceful termination.

The following workflow shows how GKE executes graceful node termination when there is an impending node shutdown:

  1. Within 60 seconds, the following occurs:
    1. The system components apply the cloud.google.com/active-node-maintenance node label set to ONGOING to indicate that workloads are being stopped.
    2. GKE applies the node taint to prevent new Pods from being scheduled on the node. The taint has the cloud.google.com/impending-node-termination:NoSchedule key. We recommend that you modify your workloads to tolerate this taint due to the known termination that occurs.
  2. The maintenance-handler component begins to evict Pods by first evicting workload Pods, and then evicting system Pods (for example, kube-system).
  3. GKE sends a SIGTERM shutdown signal to workload Pods that are running on the node to alert them of an imminent shutdown. Pods can use this alert to finish any ongoing tasks. GKE makes a best effort to terminate these Pods gracefully.
  4. After eviction finishes, GKE updates the value of the cloud.google.com/active-node-maintenance label to terminating to indicate that the node is ready to terminate.

Afterwards, the node termination occurs and a replacement node is allocated. GKE clears the labels and taints when the process is finished. To increase the termination window for your workloads using GPUs or TPUs, complete the steps in the Manually start a host maintenance event section.

Monitor the progress of an active graceful termination

You can filter the GKE logs by the following graceful termination events:

  • When the VM detects a disruption due to an impending node termination like Compute Engine host maintenance event, GKE sets the cloud.google.com/active-node-maintenance to ONGOING when workloads are being stopped, and to terminating when the workloads are finished and the node is ready to terminate.
  • When restricting new workloads from being scheduled, GKE applies the cloud.google.com/impending-node-termination:NoSchedule taint.

What's next