Manage cluster lifecycle changes to minimize disruption

Autopilot Standard

This page explains how you and Google Kubernetes Engine (GKE) manage changes during the lifecycle of a cluster to maximize performance and availability while minimizing workload disruption.

This page is intended for platform administrators who want to plan and optimize their cluster environment to minimize disruption for their workloads. You can read this page either before or after learning how to perform the basic cluster management tasks described in Managing clusters and Cluster administration overview.

A managed platform and shared responsibility

GKE is a Google-managed implementation of the Kubernetes open source container orchestration platform. As mentioned in How GKE works, a GKE cluster consists of a control plane, which includes management nodes running system components, and worker nodes, where you deploy workloads.

Creating an optimal cluster environment for your workloads to run, with maximum performance, availability, and minimal disruption, is a shared responsibility:

GKE's responsibility is to maintain a reliable, available, secure, and performant cluster environment. To do this, GKE manages the control plane, system components, and, for Autopilot mode, the worker nodes.
Your responsibility as a platform administrator is to configure your cluster and manage your workloads, including preparing them to handle disruption. With Standard mode, you also create and manage the worker nodes, which are grouped in node pools.

To learn more, see GKE shared responsibility.

How GKE manages changes during the lifecycle of a cluster

As an implementation of Kubernetes, a GKE cluster is a network of processes and systems acting together to maintain the optimal environment to run your workloads. To manage the cluster, GKE performs maintenance tasks, makes changes, initiates operations, updates components, and upgrades the version of the control plane and nodes.

Most of the day-to-day running of your application occurs quietly in the background, keeping your workloads running without disruption. Some critical changes, however, must be completed in ways that could temporarily disrupt your workloads, as described in the next section.

Some cluster changes can be disruptive to workloads

While GKE strives to keep your workloads running seamlessly, some essential types of changes can require temporary disruptions to your workloads—primarily changes that restart the nodes running your workloads. Using GKE and Kubernetes features, you can specify when and how you want disruption to take place, so that when it does, your workloads can gracefully handle the changes.

The following sections explain what types of changes GKE makes to clusters, what type of disruption they cause, and how you can prepare.

Upgrades and updates with GKE cluster lifecycle management

In GKE, cluster upgrades and cluster updates have related meanings.

In GKE, the term cluster upgrades—or just upgrades—refers to updating the Kubernetes version of the control plane (control plane upgrades) or nodes (node upgrades), or both. When using Standard clusters, node upgrades can also be referred to as node pool upgrades because GKE uses a single operation to upgrade a node pool of nodes.

The term cluster updates—or just updates—is a more general term referring to any type of control plane or node changes, including updating their versions. GKE actively manages your cluster environment by performing upgrades, other types of updates, and necessary maintenance operations. These actions ensure your cluster remains performant, secure, and up-to-date with the latest features and bug fixes. GKE uses tools like node upgrade strategies and maintenance policies to minimize disruption during these processes.

Planning for node update disruptions

Certain types of cluster changes—mostly changes to nodes—can cause disruption.

GKE uses node upgrade strategies to update nodes—both Autopilot nodes or Standard cluster node pools—in a way that's optimized for your workload's needs. These strategies apply to version upgrades and also to some other types of node changes. The strategies allow GKE to minimize disruption while performing node updates, which are important for keeping clusters functional and performant.

Best practice:

Use maintenance windows and exclusions to choose when some cluster maintenance does and doesn't occur, and, for Standard clusters, pick a node upgrade strategy that best fits your workload profile and resource constraints.

For both manual and automatically-initiated changes to nodes, GKE makes changes with the following general characteristics:

Changes typically respect maintenance policies: When GKE makes changes to the nodes, these changes generally respect GKE maintenance policies. Consider the following if you initiate manual changes that require all the nodes in a node pool to be recreated:
- For some changes, GKE respects maintenance policies and doesn't apply the change you submitted until there is maintenance availability. If GKE is waiting for maintenance availability, and the change is urgent, you can manually apply the changes to apply the new configuration immediately.
- For other manual changes including manual upgrades, GKE doesn't respect maintenance policies. For these manual changes, ensure that your workloads are prepared for immediate disruption.
Changes generally use node upgrade strategies: When GKE applies most automatic or manually-initiated changes to nodes—including node updates other than version upgrades—GKE chooses a node upgrade strategy: surge upgrades or blue-green upgrades. Autopilot always uses surge upgrades. Changes to Standard cluster node pools typically use surge upgrades, except when you've configured blue-green upgrades and do certain types of changes.
Changes require sufficient resources: When GKE applies a change using a node upgrade strategy, this change requires a certain amount of resources depending on the strategy and its configuration. Your cluster's project must have enough resource quota, resource availability, and reservation capacity (for node pools with specific reservation affinity). To learn more, see Ensure resources for node upgrades.

For a detailed list of specific changes and their characteristics, see Types of changes to a GKE cluster in this page.

Maximize workload availability by preparing for disruptive changes

To maximize the availability of your workloads running on a GKE cluster, we recommend that you take the actions described in the following sections:

Choose your cluster availability

If control plane availability is a priority, choose an Autopilot cluster or regional Standard cluster rather than a zonal Standard cluster. To learn more, see About cluster configuration choices.

Control upgrades using GKE tools

You can use the following tools to control when and how GKE upgrades your cluster, making it possible to implement the best practices:

Release channels: Choose a release channel to get cluster versions with your chosen balance of feature availability and stability.
Maintenance windows: Specify a recurring window of time when certain types of GKE cluster maintenance, such as upgrades, can occur.
Maintenance exclusions: Prevent cluster maintenance from occurring for a specific time period.
Node upgrade strategies: If using Standard clusters, choose how your nodes are updated–surge upgrades or blue-green upgrades–to minimize disruption to your workloads.
Rollout sequencing: Qualify upgrades in a pre-production environment before GKE upgrades your production clusters.
Manual upgrades: Manually upgrade your cluster, and perform such actions as canceling, resuming, rolling back, and completing automatic or manual in-progress upgrades.

Manage and monitor your cluster

To manage potential disruption to your clusters, continuously perform the following tasks:

Monitor your cluster with GKE's observability suite.
Follow the GKE release notes for announcements.
View cluster notifications, such as when upgrades start or complete, when new versions are available, security bulletins, and end of support dates.
Get visibility into cluster upgrades to understand the state of upgrades for your clusters.
Check the GKE release schedule for a best-case estimate of when minor versions are available for upgrades, and reach the end of support.
Use prescriptive guidance that identifies potential optimization opportunities and explains how to optimize your cluster usage, including guidance for feature and API deprecations.

Prepare your workloads

Manage disruption by making your workloads as resilient to disruption as possible:

Run replicas of your workloads to ensure redundancy and avoid a single point of failure.
Specify a disruption budget for your application using a Pod Disruption Budget.
Set a termination grace period of the right length for your workload to gracefully shut down.
If your workload is using GPUs or TPUs, follow the instructions to manage GKE node disruptions for GPUs and TPUs.
For stateful applications, which often need time to cleanly stop I/O and unmount from storage, follow the steps in Ensure stateful workloads are disruption-ready.

For a general discussion of these topics, see the Manage disruption section of the GKE best practices: Day 2 operations for business continuity blog post.

Types of changes to a GKE cluster

The following tables show the most common types of major changes to a cluster, including characteristics of these changes such as frequency and level of disruption.

Types of upgrades

Review the following table to understand how upgrades can disrupt a cluster environment.

Change	Automatic or manually initiated	Respects maintenance policies	Frequency	Type of disruption	Level of disruption
Control plane upgrade	Automatic or manual	Automatic upgrades respect maintenance policies until the end of support, except for extremely rare emergency fixes, as necessary. Manual upgrades aren't blocked by maintenance policies.	Patch upgrades, as often as every week, depending on the release channel. Minor upgrades approximately every four months. For Extended channel clusters, minor upgrades only when the minor version nears the end of support.	Control plane	For Autopilot and regional Standard clusters, the control plane remains available. For zonal Standard clusters, multiple minutes where you can't communicate with the control plane, meaning that you can't configure the cluster, nodes, and workloads during that time.
Node upgrade	Automatic or manual	Automatic upgrades respect maintenance policies until the end of support, except for extremely rare emergency fixes, as necessary. Manual upgrades aren't blocked by maintenance policies.	Typically the same as the control plane upgrades. If your cluster isn't enrolled in a release channel and you disable node auto-upgrades, you're responsible for manually upgrading your cluster's node pools.	All nodes for Autopilot clusters, or one or more Standard cluster node pools.	Nodes must be shut down to be recreated, Pods must be replaced. GKE uses surge upgrades for Autopilot, or the configured node upgrade strategy (surge or blue-green) for Standard clusters.

Change

Automatic or manually initiated

Respects maintenance policies

Frequency

Type of disruption

Level of disruption

Control plane upgrade

Automatic or manual

Automatic upgrades respect maintenance policies until the end of support, except for extremely rare emergency fixes, as necessary.

Manual upgrades aren't blocked by maintenance policies.

Patch upgrades, as often as every week, depending on the release channel.

Minor upgrades approximately every four months.

For Extended channel clusters, minor upgrades only when the minor version nears the end of support.

Control plane

For Autopilot and regional Standard clusters, the control plane remains available.

For zonal Standard clusters, multiple minutes where you can't communicate with the control plane, meaning that you can't configure the cluster, nodes, and workloads during that time.

Node upgrade

Automatic or manual

Automatic upgrades respect maintenance policies until the end of support, except for extremely rare emergency fixes, as necessary.

Manual upgrades aren't blocked by maintenance policies.

Typically the same as the control plane upgrades.

If your cluster isn't enrolled in a release channel and you disable node auto-upgrades, you're responsible for manually upgrading your cluster's node pools.

All nodes for Autopilot clusters, or one or more Standard cluster node pools.

Nodes must be shut down to be recreated, Pods must be replaced.

GKE uses surge upgrades for Autopilot, or the configured node upgrade strategy (surge or blue-green) for Standard clusters.

Manual changes that recreate the nodes using a node upgrade strategy and respecting maintenance policies

Review the following table to understand how these manual changes can disrupt a cluster environment. This list includes, amongst other changes, manual changes that respect GKE maintenance policies.

Change	Automatic or manually initiated	Respects maintenance policies	Frequency	Type of disruption	Level of disruption
Disabling the kubelet readonly port	Manually initiated	Doesn't respect maintenance policies, immediately makes changes.	Once per change of this type.	All nodes in an Autopilot cluster All nodes in a Standard cluster node pool.	Nodes must be shut down to be recreated. Pods must be replaced. GKE immediately uses surge upgrades to recreate the nodes, regardless of any active maintenance policies.
Rotating the cluster credentials	Automatic if cluster credentials are expiring within 30 days, can also be manually initiated.	Does respect maintenance policies. However, GKE might override maintenance policies within 30 days of credential expiration. Within 30 days, GKE ignores maintenance availability for the first step, which is starting the rotation. Also, if you manually trigger specific operations after the first step, that operation doesn't respect maintenance policies.	Once per manual change of this type, or depends on cluster credential lifetime for automatic initiation. You can manually invoke operations for specific steps in the rotation process.	For some steps, the control plane. For other steps, all nodes for Autopilot clusters, all nodes in each Standard cluster node pool.	When you initiate the rotation and complete the rotation, the level of disruption is the following: For Autopilot and regional Standard clusters, the control plane remains available. For zonal Standard clusters, both operations cause brief downtime, meaning multiple minutes where you can't communicate with the control plane to perform operations like configuring the cluster, nodes, and workloads. When the nodes are recreated, the level of disruption is as follows: Nodes must be shut down to be recreated, Pods must be replaced. GKE uses surge upgrades to recreate the nodes. Warning: If GKE can't complete the rotation of the cluster credentials before the cluster credentials expire—because of maintenance availability and related constraints—your cluster and workloads become unrecoverable.
Rotating the control plane's IP address	Manually initiated	Does respect maintenance policies, however if you manually trigger specific operations after the first step, that operation doesn't respect maintenance policies.	Once per manual change of this type. You can manually invoke operations for specific steps in the rotation process.	For some steps, the control plane. For other steps, all nodes for Autopilot clusters, all nodes in each Standard cluster node pool.	When you initiate the rotation and complete the rotation, the level of disruption is the following: For Autopilot and regional Standard clusters, the control plane remains available. For zonal Standard clusters, both operations cause brief downtime, meaning multiple minutes where you can't communicate with the control plane to perform operations like configuring the cluster, nodes, and workloads. When the nodes are recreated, the level of disruption is as follows: Nodes must be shut down to be recreated, Pods must be replaced. GKE uses surge upgrades to recreate the nodes.
Configuring shielded nodes	Manually initiated	Recreating the control plane doesn't respect maintenance policies, immediately makes the changes. Recreating the nodes does respect maintenance policies.	Once per change of this type	The control plane is updated. After the control plane is updated, all nodes in each Standard cluster node pool must be recreated.	When the control plane is recreated, the level of disruption is the following: For Autopilot and regional Standard clusters, the control plane remains available. For zonal Standard clusters, both operations cause brief downtime, meaning multiple minutes where you can't communicate with the control plane to perform operations like configuring the cluster, nodes, and workloads. When the nodes are recreated, the level of disruption is as follows: Nodes must be shut down to be recreated, Pods must be replaced. GKE uses surge upgrades to recreate the nodes. Warning: If your cluster doesn't have active maintenance availability when you initiate the change, the control plane will be recreated but the nodes won't yet be recreated. The delay between the two steps might cause failed node registrations and disruption to your cluster.
Configuring network policies	Manually initiated	Does respect maintenance policies	Once per change of this type	All nodes for Autopilot clusters, all nodes in each Standard cluster node pool.	Nodes must be shut down to be recreated, Pods must be replaced. GKE uses surge upgrades to recreate the nodes.
Configuring intranode visibility	Manually initiated	Does respect maintenance policies	Once per change of this type	All nodes for Autopilot clusters, all nodes in each Standard cluster node pool.	Nodes must be shut down to be recreated, Pods must be replaced. GKE uses surge upgrades to recreate the nodes.
Configuring NodeLocal DNSCache	Manually initiated	Does respect maintenance policies	Once per change of this type	All nodes in the Standard cluster node pool being updated must be updated.	Nodes must be shut down to be recreated, Pods must be replaced. GKE uses surge upgrades to recreate the nodes.
Enabling Image streaming	Manually initiated	When updating at the cluster level, respects maintenance policies. When updating individual node pools, doesn't respect maintenance policies.	Once per change of this type	If toggled at the node pool level, all nodes in the Standard cluster node pool. If toggled at the cluster level, nodes of any Standard cluster node pools where you haven't individually enabled or disabled the setting for the node pool.	GKE uses surge upgrades to recreate the nodes of a node pool.

Automatic maintenance that doesn't respect maintenance policies

Review the following table to understand how automatic maintenance that doesn't respect maintenance policies can disrupt a cluster environment.

Change	Automatic or manually initiated	Respects maintenance policies	Frequency	Type of disruption	Level of disruption
Control plane repair or resize	Automatic	Doesn't respect maintenance policies	Control plane repair frequency is random but has no impact for Autopilot and regional Standard clusters. Control plane resize is infrequent, but increases in frequency with cluster scaling events, and also has no impact for Autopilot and regional Standard clusters.	Control plane	For Autopilot and regional Standard clusters, the control plane remains available. For zonal Standard clusters, multiple minutes where you can't communicate with the control plane, meaning that you can't configure the cluster, nodes, and workloads during that time.
Host maintenance event	Automatic	Doesn't respect maintenance policies	Refer to Maintenance events for approximate frequency.	One node	For most types of nodes, minimal effect. Some nodes, including those with GPUs or TPUs, might experience greater disruption. To learn more, see Other Google Cloud maintenance.
Node auto-repair	Automatic	Doesn't respect maintenance policies	Node auto-repair frequency is random.	One node	The node is restarted, so any Pods running on the node are disrupted.
Reclaim Spot VMs and Preemptible VMs	Automatic	Doesn't respect maintenance policies	For preemptible VMs, at least once every 24 hours. For Spot VMs, when Compute Engine needs the resources elsewhere.	One node	See details about the termination and graceful shutdown of Spot VMs, and the termination and graceful shutdown of preemptible VMs.
Maintenance of the Spanner-based cluster state database	Automatic	Doesn't respect maintenance policies	Events are random and have no impact on clusters or workloads.	None. The Spanner-based database runs separately from the cluster control plane and nodes in Google's infrastructure.	None. The Spanner-based database is replicated for all types of clusters and remains available during maintenance.

Manual changes that recreate the nodes using a node upgrade strategy without respecting maintenance policies

Review the following table to understand how these manual changes can disrupt a cluster environment. This list includes changes from when GKE uses surge upgrades and when GKE uses blue-green upgrades that aren't included in the other section as they don't respect maintenance policies.

Change	Automatic or manually initiated	Respects maintenance policies	Frequency	Type of disruption	Level of disruption
Node pool label update	Manually initiated	Doesn't respect maintenance policies, immediately makes the changes.	Once per change of this type	All nodes in a Standard cluster node pool	GKE immediately uses surge upgrades to recreate the node pool when you update the node labels on an existing node pool, regardless of any active maintenance policies.
Vertically scaling the nodes by changing the node machine attributes	Manually initiated	Doesn't respect maintenance policies, immediately makes the changes.	Once per change of this type	All nodes in a Standard cluster node pool	GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.
Image type changes	Manually initiated	Doesn't respect maintenance policies, immediately makes the changes.	Once per change of this type	All nodes in a Standard cluster node pool	Nodes must be shut down to be recreated, Pods must be replaced. GKE uses the configured node upgrade strategy (surge or blue-green) for Standard clusters.
Add or replace storage pools in a Standard cluster node pool	Manually initiated	Doesn't respect maintenance policies, immediately makes the changes.	Once per change of this type	All nodes in a Standard cluster node pool	Nodes must be shut down to be recreated, Pods must be replaced. GKE uses the configured node upgrade strategy (surge or blue-green) for Standard clusters.
Enabling Image streaming	Manually initiated	When updating at the cluster level, respects maintenance policies. When updating individual node pools, doesn't respect maintenance policies.	Once per change of this type	If toggled at the node pool level, all nodes in the Standard cluster node pool. If toggled at the cluster level, nodes of any Standard cluster node pools where you haven't individually enabled or disabled the setting for the node pool.	GKE uses surge upgrades to recreate the nodes of a node pool.
Network performance configuration updates	Manually initiated	Doesn't respect maintenance policies, immediately makes the changes.	Once per change of this type	All nodes in a Standard cluster node pool	Nodes must be shut down to be recreated, Pods must be replaced. GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.
Enabling gVNIC	Manually initiated	Doesn't respect maintenance policies, immediately makes the changes.	Once per change of this type	All nodes in a Standard cluster node pool	Nodes must be shut down to be recreated, Pods must be replaced. GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.
Node system configuration changes	Manually initiated	Doesn't respect maintenance policies, immediately makes the changes.	Once per change of this type	All nodes in a Standard cluster node pool	Nodes must be shut down to be recreated, Pods must be replaced. GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.
Confidential nodes	Manually initiated	Doesn't respect maintenance policies, immediately makes the changes.	Once per change of this type	All nodes in a Standard cluster node pool	Nodes must be shut down to be recreated, Pods must be replaced. GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.

Changes that don't require recreating the nodes

Review the following table to understand which changes to the node configuration don't require recreating the nodes. These changes aren't disruptive, however disruption is still possible if the updated node configuration affects your workload.

Change	Automatic or manually initiated	Respects maintenance policies	Frequency	Type of disruption	Level of disruption
Update the following settings: Any types of tags, including Tags (firewall), Tags (non-firewall), and network tags Kubernetes taints Kubernetes labels Cluster autoscaler settings Node pool auto-upgrade enablement and disablement Node auto-repair enablement and disablement Node upgrade strategy settings	Manually initiated	Doesn't respect maintenance policies, immediately makes the changes.	Once per change of this type	All relevant nodes are updated.	Pods don't have to be replaced because the node configuration is updated without recreating the nodes.