Standard cluster upgrades

Standard

This document discusses how automatic and manual upgrades work for Google Kubernetes Engine (GKE) Standard clusters, including links to more information about related tasks and settings. You can use this information to keep your clusters updated for stability and security with minimal disruptions to your workloads.

For a general overview of cluster upgrades, see About GKE cluster upgrades. For information on how cluster upgrades work specifically for Autopilot clusters, see Autopilot cluster upgrades.

How cluster and node pool upgrades work

This section discusses what happens in your cluster during automatic or manual upgrades. For auto-upgrades, GKE initiates the auto-upgrade. GKE observes automatic and manual upgrades across all GKE clusters, and intervenes if problems are observed.

To upgrade a cluster, GKE updates the version the control plane and nodes are running. Clusters are upgraded to either a newer minor version (for example, 1.24 to 1.25) or newer patch version (for example, 1.24.2-gke.100 to 1.24.5-gke.200). For more information, see GKE versioning and support.

If you enroll your cluster in a release channel, nodes run the same version of GKE as the cluster, except during a brief period (typically a few days, depending on the current release) between completing the cluster's control plane upgrade and starting the node pool upgrade, or if the control plane was manually upgraded. Check the release notes for more information.

Cluster upgrades

This section discusses what to expect when GKE auto-upgrades your cluster or you initiate a manual upgrade.

Zonal clusters have only a single control plane. During the upgrade, your workloads continue to run, but you cannot deploy new workloads, modify existing workloads, or make other changes to the cluster's configuration until the upgrade is complete.
Regional clusters have multiple replicas of the control plane, and only one replica is upgraded at a time, in an undefined order. During the upgrade, the cluster remains highly available, and each control plane replica is unavailable only while it is being upgraded.

If you configure a maintenance window or exclusion, it is honored if possible.

Node pool upgrades

This section discusses what to expect when GKE auto-upgrades your node pool or you initiate a manual node pool upgrade. This information is relevant to both Standard node pools, and Autopilot-managed node pools.

GKE automatically upgrades one node pool at a time in a cluster. Alternatively, you can manually upgrade one or more node pools in parallel. By default, nodes within a node pool are upgraded one at a time in an arbitrary order. In a node pool spread across multiple zones, upgrades take place zone-by-zone. Within a zone, the nodes will be upgraded in an undefined order.

With GKE Standard node pool upgrades, you can choose between the following node upgrade strategies:

Surge upgrades
Blue-green upgrades
Autoscaled blue-green upgrades (Preview)

For each of these node upgrade strategies, you can tune the upgrade process based on your cluster environment's needs. Autopilot-managed node pools in Standard clusters always use surge upgrades, and you can't modify the configuration of the strategy or its settings.

During a node pool upgrade, you can't make changes to the cluster configuration unless you cancel the upgrade.

GKE honors maintenance windows and exclusions during automatic upgrades when possible. Manual upgrades bypass your configured maintenance windows and exclusions.

How nodes are upgraded

During a node pool upgrade, how the nodes are upgraded depends on the node pool upgrade strategy and how you configure it, if possible. However, the basic steps remain consistent. To upgrade a node, GKE removes Pods from the node so that it can be upgraded.

When a node is upgraded, the following happens with the Pods:

The node is cordoned so that Kubernetes does not schedule new Pods on it.
The node is then drained, meaning that the Pods are removed. For surge upgrades, GKE respects the Pod's PodDisruptionBudget and GracefulTerminationPeriod settings for up to one hour. With blue-green upgrades, this can be extended if you configure a longer soaking time.
The control plane reschedules Pods managed by controllers onto other nodes. Pods that cannot be rescheduled stay in the Pending phase until they can be rescheduled.

The node pool upgrade process may take up to a few hours depending on the upgrade strategy, the number of nodes, and their workload configurations.

Considerations affecting node upgrade duration

Configurations that can cause a node upgrade to take longer to complete include:

A high value of terminationGracePeriodSeconds in a Pod's configuration.
A conservative Pod Disruption Budget.
Node affinity interactions.
Attached PersistentVolumes.

Node upgrade strategies

GKE offers built-in configurable strategies which determine how the node pool is upgraded. For more information about types of changes that use a node upgrade strategy, see the following:

Surge upgrades

By default, the surge upgrade strategy is used for node pool upgrades. This strategy is always used for Autopilot-managed node pools. Surge upgrades use a rolling method to upgrade nodes. This strategy is best for applications that can handle incremental, non-disruptive changes. With this strategy, nodes are upgraded in a rolling window. For Standard node pools, you can change settings like how many nodes can be upgraded at once, and how disruptive the upgrades can be, finding the optimal balance of speed and disruption for your environment's needs.

Blue-green upgrades

An alternative approach for Standard node pools is blue-green upgrades, where two sets of environments (the original and new environments) are maintained at once, making rolling back as straightforward as possible. Blue-green is more resource intensive and better for applications that are more sensitive to changes. With this strategy, workloads are gradually migrated from the original "blue" environment to the new "green" environment, and given soak time to validate them with the new configuration. If needed, the workloads can be quickly rolled back to the existing "blue" environment.

For more information about how the node upgrade strategies work, see Node upgrade strategies.

Autoscaled Blue-green upgrades

Autoscaled blue-green upgrades (Preview) are a node upgrade strategy where workloads can run for longer, while you minimize cost from idle or underutilized nodes.

Resource requirements for node upgrade strategies

Surge upgrades create extra nodes if maxSurge is set to more than 0, and blue-green upgrades temporarily double the number of nodes in a node pool. This requires additional resources, which is subject to Compute Engine quota, resource availability, and reservation capacity. If your node pool doesn't have sufficient resources, upgrades can take longer or fail.

For more information about how to ensure your project has enough resources for node upgrades, and what to do if your environment is resource-constrained, see Ensure resources for node upgrades.

Upgrading automatically

When you create a Standard cluster, by default, auto-upgrade is enabled on the cluster and its node pools.

GKE is responsible for securing your cluster's control plane, and upgrades your clusters when a new GKE version is selected for auto-upgrade. Infrastructure security is high priority for GKE, and as such control planes are upgraded on a regular basis, and cannot be disabled. However, you can apply maintenance windows and exclusions to temporarily suspend upgrades for control planes and nodes.

As part of the GKE shared responsibility model, you are responsible for securing your nodes, containers, and Pods. Node auto-upgrade is enabled by default. Although it is not recommended, you can disable node auto-upgrade. Opting out of node auto-upgrades does not block your cluster's control plane upgrade. If you opt out of node auto-upgrades you are responsible for ensuring that the cluster's nodes run a version compatible with the cluster's version, and that the version adheres to the GKE version skew policy.

For more control over when an auto-upgrade can occur (or must not occur), you can configure maintenance windows and exclusions.

A cluster's node pools can be no more than two minor versions behind the control plane version, to maintain compatibility with the cluster API. The node pool version also determines the versions of software packages installed on each node. It is recommended to keep node pools updated to the cluster version.

If you enroll your cluster in a release channel, nodes always run the same version of GKE as the cluster itself, except during a brief period (typically a few days, depending on the current release) between completing the cluster's control plane upgrade and beginning to upgrade a given node pool. Check the release notes for more information.

How versions are selected for auto-upgrade

New GKE versions are released regularly, but a version is not selected for auto-upgrade right away. When a GKE version has accumulated enough cluster usage to prove stability over time, GKE selects it as an auto-upgrade target for clusters running a subset of earlier versions. To get auto-upgrade targets for a specific cluster, see Get information about a cluster's upgrades.

New auto-upgrade targets are announced in the release notes. Until an available version is selected for auto-upgrade, you can upgrade to it manually. Occasionally, a version is selected for cluster auto-upgrade and node auto-upgrade during different weeks.

Soon after a new minor version becomes generally available, the oldest available minor version typically becomes unsupported. Clusters running minor versions that become unsupported are automatically upgraded to the next minor version.

Within a minor version (such as v1.14.x), clusters can be automatically upgraded to a new patch release.

Release channels allow you to control your cluster and node pool version based on a version's stability rather than managing the version directly.

Factors that affect version rollout timing

To ensure the stability and reliability of clusters on new versions, GKE follows certain practices during version rollouts.

These practices include, but are not limited to:

GKE gradually rolls out changes across Google Cloud regions and zones.
GKE gradually rolls out patch versions across release channels. A patch is given soak time in the Rapid release channel, then the Regular release channel, before being promoted to the Stable release channel once it has accumulated usage and continued to demonstrate stability. If an issue is found with a patch version during the soaking time on a release channel, that version is not promoted to the next channel and the issue is fixed on a newer patch version.
GKE gradually rolls out minor versions, following a similar soaking process to patch versions. Minor versions have longer soaking periods as they introduce more significant changes.
GKE may delay automatic upgrades when a new version impacts a group of clusters. For example, GKE pauses automatic upgrades for clusters that it detects are exposed to a deprecated API or feature that will be removed in the next minor version.
GKE might delay the rollout of new versions during peak times (for example, major holidays) to ensure business continuity.

Configuring when auto-upgrades can occur

By default, auto-upgrades can occur at any time to preserve infrastructure security. Auto-upgrades are minimally disruptive, especially for regional clusters. However, some workloads may require finer-grained control. You can configure maintenance windows and exclusions to manage when auto-upgrades can and must not occur.

Upgrading manually

You can request to manually upgrade your cluster's control plane or its node pools to an available and compatible version at any time. In Standard clusters, you can manually upgrade both Standard node pools and Autopilot-managed node pools. Manual upgrades bypass any configured maintenance windows and maintenance exclusions.

When you manually upgrade a cluster, its availability depends on whether the cluster is regional or not:

For zonal clusters, the control plane is unavailable while it is being upgraded. For the most part, workloads run normally but cannot be modified during the upgrade.
For regional clusters, one replica of the control plane is unavailable at a time while it is upgraded, but the cluster remains highly available during the upgrade.

You can manually initiate a node upgrade to a version compatible with the control plane.

How GKE responds to auto-upgrade failure

Node pool auto-upgrades can fail because of issues with the underlying Compute Engine instances, or because of issues with Kubernetes. For example, auto-upgrades fail in the following situations:

Your configured maxSurge setting exceeds your Compute Engine resource quota.
New surge nodes didn't register with the cluster control plane.
Nodes took too long to drain, or took too long to delete.

When issues occur with individual node upgrades, GKE retries the upgrade a few times, with an increasing interval between retries. If nodes in the node pool fail to upgrade, GKE does not roll back the upgraded nodes. Instead, GKE tries the node pool auto-upgrade again until all the nodes are successfully upgraded.

If your node upgrades fail because your surge node requests exceed your Compute Engine quota, GKE reduces the number of concurrent surge nodes to attempt to meet the quota and continue the upgrade.

Receiving upgrade notifications

GKE publishes notifications about events relevant to your cluster, such as version upgrades and security bulletins, to Pub/Sub, providing you with a channel to receive information from GKE about your clusters.

For more information, see Receiving cluster notifications.

Check upgrade logs

GKE logs control plane and node pool upgrade events to Cloud Logging by default. Upgrade events log provides visibility into the upgrade process, and includes valuable information for troubleshooting if needed.

Control plane upgrade logs

Cluster upgrade events can be queried using the following filter:

resource.type="gke_cluster"
protoPayload.metadata.operationType=~"(UPDATE_CLUSTER|UPGRADE_MASTER)"
resource.labels.cluster_name="CLUSTER_NAME"

These logs are recorded as structured logging formats. You can use the following fields for the details of the upgrade events:

Field	Description
protoPayload.metadata.operationType	There are two types of cluster upgrade events: `UPGRADE_MASTER`: an upgrade to the Kubernetes control plane version. `UPDATE_CLUSTER`: an update which doesn't change the Kubernetes control plane version. Both cluster upgrade types can cause the loss of control plane availability for zonal clusters. For more information, see How cluster and node pool upgrades work.
protoPayload.methodName	This field shows which API triggered the cluster upgrade. `google.container.v1.ClusterManager.UpdateCluster`: manual control plane upgrade `google.container.internal.ClusterManagerInternal.UpdateClusterInternal`: automatic control plane upgrade `google.container.v1.ClusterManager.PatchCluster`: cluster configuration change
protoPayload.metadata.previousMasterVersion	This field is used only for the `MASTER_UPGRADE` operation type, and contains the previous control plane version used before the upgrade.
protoPayload.metadata.currentMasterVersion	This field is used only for the `MASTER_UPGRADE` operation type, and contains the new control plane version number used after the upgrade.

Node pool upgrade logs

Use the following query to view node pool upgrade events:

resource.type="gke_nodepool"
protoPayload.metadata.operationType="UPGRADE_NODES"
resource.labels.cluster_name="CLUSTER_NAME"

Use the following field for details about the upgrade event:

protoPayload.methodName field shows whether the upgrade was triggered manually or triggered automatically as follows.

google.container.v1.ClusterManager.UpdateNodePool: manual node pool upgrade
google.container.internal.ClusterManagerInternal.UpdateClusterInternal: automatic node pool upgrade

Component upgrades

GKE runs system workloads on worker nodes to support specific capabilities for clusters. For example, the gke-metadata-server system workload supports Workload Identity Federation for GKE. GKE is responsible for the health of these workloads. To learn more about these components, refer to the documentation for the associated capabilities.

When new features or fixes become available for a component, GKE indicates the patch version in which they are included. To obtain the latest version of a component, refer to the associated documentation or release notes for instructions on upgrading your control plane or nodes to the appropriate version.

What's next

Learn more about cluster configuration choices.
Learn more about upgrading a cluster or its nodes.
Configure maintenance windows and exclusions.
Learn about managing automatic cluster upgrades across environments with rollout sequencing.
Best practices for upgrading clusters.
Watch GKE cluster upgrades: Best practices for GKE cluster stability, security, and performance