During the lifecycle of a long-running GKE cluster, periodic disruptions to workloads occur due to automatic maintenance events issued by Google Cloud for the Compute Engine resources that underlie the GKE infrastructure. This page helps you understand what node disruption means in GKE, monitor maintenance notifications, and minimize disruption impact in your GKE nodes with attached GPUs and TPUs.
This document is for Platform admins and operators who manage the lifecycle of the underlying tech infrastructure. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
What does node disruption mean in GKE?
Your GKE clusters manage the lifecycle of the GKE nodes. These nodes are provisioned on Compute Engine VMs, which periodically experience host events that are caused by a variety of reasons, such as the following:
- Hardware or software updates and maintenance. Automatic maintenance events are issued by Google Cloud.
- Hardware failures.
- Upgrades to Compute Engine VMs. These upgrades are different from upgrades to the GKE node or the node pool version.
Host events are issued for the underlying Google Cloud infrastructure and they bypass GKE maintenance policies and exclusions. During the lifecycle of a long-running GKE cluster, the nodes might experience periodic disruptions to training or serving workloads. When these disruptions affect your GKE nodes that run AI/ML workloads, GKE needs to restart both the running workloads and the underlying node.
Why GPUs and TPUs require disruption management
Most Compute Engine VMs, with some exceptions, have their host maintenance policy set to live migrate, which means that running workloads typically experience little to no disruption. However, certain classes of VMs don't support live migration, including VMs with attached GPUs and TPUs. When a host event happens to the VM within a TPU slice, the entire slice gets interrupted and then rescheduled because all maintenance events are coordinated at the slice level. So if you create a TPU slice that has hundreds of VMs, all of those VMs will receive the same maintenance event schedule.
When a host event occurs, GKE terminates the node and its Pods. If the Pods are deployed as part of a larger workload, like a Job or Deployment, GKE restarts the Pods on the affected node.
It is up to you, or the frameworks that you use, to handle the workload configuration to react appropriately to maintenance events. For example, you can save the state of your AI training job to reduce data loss.
To manage disruption on AI/ML workloads, you can do the following:
Monitor maintenance notifications
Compute Engine issues notifications when nodes and their underlying VMs are scheduled for disruptive host events, and when these events become active. The notifications include information about planned start time, the type of event, and other details.
On GKE version 1.31.1-gke.2008000 and later, you can monitor upcoming maintenance events, including the events that are described in this section.
Upcoming maintenance is scheduled but not active
Before a VM with attached GPUs or TPUs has a scheduled
maintenance event,
Compute Engine pushes notifications out to all its VMs. These
notifications report the start of the maintenance window. When an upcoming
maintenance is scheduled by the VM but not active, GKE adds
scheduled-maintenance-time
to the node label.
To query these notifications at the node level, run the following command:
kubectl get nodes -l cloud.google.com/scheduled-maintenance-time \
-L cloud.google.com/scheduled-maintenance-time
The output is similar to the following:
NAME STATUS SCHEDULED-MAINTENANCE-TIME
<gke-accelerator-node-name> Ready 1733083200
<gke-accelerator-node-name> Ready 1733083200
[...]
The SCHEDULED-MAINTENANCE-TIME
column represents
seconds, which are displayed in Unix epoch time format.
To query these notifications at the level of node metadata, check instances for a maintenance event notification.
Scheduled maintenance starts
For accelerator-optimized machine families that support
advanced maintenance, you
can access the upcoming-maintenance
endpoint that provides information about
scheduled and started maintenance events.
When scheduled maintenance starts, Compute Engine updates the metadata
in the http://metadata.google.internal/computeMetadata/v1/instance/attributes/
directory. Compute Engine updates the metadata labels as follows:
- Sets
maintenance-event
toTERMINATE_ON_HOST_MAINTENANCE
. - In
upcoming-maintenance
, setsmaintenance_status
toONGOING
.
GKE will gracefully evict Pods and terminate workloads within the limited maximum predefined time of the maintenance notification window.
Minimize disruption impact
To minimize the impact of node disruption, you can manually start a host maintenance event.
If you don't start a maintenance event, Compute Engine will complete the regularly scheduled maintenance.
Manually start a host maintenance event
When Compute Engine issues a notification about a scheduled maintenance event, you can manually start maintenance at a time that aligns with your operational schedule, for example, during periods of reduced activity.
On a node in the node pool, set the node label
cloud.google.com/perform-maintenance
to true
. For example:
kubectl label nodes <node-name> cloud.google.com/perform-maintenance=true
GKE will gracefully evict Pods and terminate workloads before the maintenance event starts with the perform-maintenance action. The duration between label application and maintenance start varies.
Configure GKE to terminate your workloads gracefully
In this section, you configure GKE to manage your application lifecycle and minimize the disruption to your workload. If you don't configure a grace period, the grace period defaults to 30 seconds.
GKE makes a best effort to terminate these Pods gracefully and to
execute the termination action that you define, for example, saving a training
state. GKE sends a SIGTERM
signal to Pods at the beginning of the grace
period. If Pods don't exit by the end of the grace period, GKE
sends a follow-up SIGKILL
signal to any processes still running in any
container in the Pod.
To configure the graceful termination period, set the termination grace period
(seconds) in the spec.terminationGracePeriodSeconds
field of your Pod
manifest. For example, to get a notification time of 10 minutes, set the
spec.terminationGracePeriodSeconds
field in your Pod manifest to 600 seconds,
as follows:
spec:
terminationGracePeriodSeconds: 600
We recommend that you set a termination grace period that is long enough for any ongoing
tasks to finish within the notification timeframe.
If your workload uses a ML framework such as MaxText, Pax, or JAX with
Orbax, the workloads
can capture the shutdown SIGTERM
signal and initiate a checkpointing process.
To learn more, see TPU Autocheckpoint.
Process of graceful termination
When a disruption event begins, whether it's triggered manually or automatically
by the VM, Compute Engine signals the impending machine shutdown by
updating the maintenance-event
metadata key. In both cases of impending node
shutdown, GKE will start graceful termination.
The following workflow shows how GKE executes graceful node termination when there is an impending node shutdown:
- Within 60 seconds, the following occurs:
- The system components apply the
cloud.google.com/active-node-maintenance
node label set toONGOING
to indicate that workloads are being stopped. - GKE applies the node taint to prevent new Pods from being
scheduled on the node. The taint has the
cloud.google.com/impending-node-termination:NoSchedule
key. We recommend that you modify your workloads to tolerate this taint due to the known termination that occurs.
- The system components apply the
- The maintenance-handler component begins to evict Pods by first evicting workload Pods, and then evicting system Pods (for example, kube-system).
- GKE sends a
SIGTERM
shutdown signal to workload Pods that are running on the node to alert them of an imminent shutdown. Pods can use this alert to finish any ongoing tasks. GKE makes a best effort to terminate these Pods gracefully. - After eviction finishes, GKE updates the value of the
cloud.google.com/active-node-maintenance
label toterminating
to indicate that the node is ready to terminate.
Afterwards, the node termination occurs and a replacement node is allocated. GKE clears the labels and taints when the process is finished. To increase the termination window for your workloads using GPUs or TPUs, complete the steps in the Manually start a host maintenance event section.
Monitor the progress of an active graceful termination
You can filter the GKE logs by the following graceful termination events:
- When the VM detects a disruption due to an impending node termination like
Compute Engine host maintenance event, GKE sets the
cloud.google.com/active-node-maintenance
toONGOING
when workloads are being stopped, and toterminating
when the workloads are finished and the node is ready to terminate. - When restricting new workloads from being scheduled, GKE
applies the
cloud.google.com/impending-node-termination:NoSchedule
taint.