About cluster upgrades with rollout sequencing


You can manage the order of automatic cluster upgrades across Google Kubernetes Engine (GKE) clusters in multiple environments using rollout sequencing. For example, you can qualify a new version in pre-production clusters before upgrading production clusters. To use this feature, you should be familiar with cluster upgrades, release channels, and fleet management.

To get started, refer to Sequence the rollout of cluster upgrades.

Terminology

This document uses the term group to refer to both fleets or team scopes, because you can create a rollout sequence organized with either grouping method.

Qualify upgrades across environments

To automatically upgrade clusters with rollout sequencing, use fleets or team scopes where you've grouped your clusters with the same release channel and minor version into stages of deployment. Choose the sequence of fleets or sequence of team scopes and set how much soak testing time you want between each group of clusters. Then, when GKE selects a new version for automatic upgrades in the release channel, your groups of clusters are upgraded in the sequence you've defined, and you can validate that workloads run as expected with a new version before upgrades begin with your production clusters.

Fleet-based rollout sequence

The following diagram illustrates how GKE automatically upgrades clusters in a rollout sequence organized with fleets:

Fleet-based rollout sequence. You can organize clusters into fleets, or further subdivide them in fleets with scopes.

With a fleet-based sequence, when GKE makes available a new upgrade target in the release channel where all clusters in this sequence are enrolled, GKE upgrades these fleets of clusters in this sequence, with the upstream fleet's clusters qualifying the new version for clusters in the downstream fleet, for up to three fleets. Upstream, in a rollout sequence, refers to the previous group, and downstream refers to the next group.

During the configured soak time between fleets—after upgrades complete on the upstream fleet and before they begin on the downstream fleet—you can confirm that your workloads are running as expected on the upgraded clusters.

Team-based rollout sequence

If you've further subdivided the clusters in a fleet by team or application, you can create a rollout sequence between team scopes. Team scopes are an enterprise fleet-level construct for associating subsets of fleet clusters with specific application teams, and can be used to enable a range of team-based features, including access control and team-scoped observability as well as rollout sequencing.

Scope-based rollout sequences. You can organize clusters into fleets, or further subdivide them in fleets with scopes.

With team scopes, you can create multiple rollout sequences in a fleet, each with their own release channels, upgrade targets, and independent soak times. A team-based rollout sequence functions identically to a fleet-based rollout sequence, except upgrades are qualified between a specific team's clusters in each fleet, instead of fleet-to-fleet. This is particularly useful for application operators who want to manage upgrades within their own team's clusters.

Team-based rollout sequencing is in Preview, while fleet-based rollout sequencing is generally available (GA).

How GKE upgrades clusters in a rollout sequence

When GKE upgrades a cluster, first the control plane is upgraded, then the nodes are upgraded. In a rollout sequence, clusters are still upgraded using this process, but you also control the order in which groups (fleets or scopes) of clusters are upgraded, and you specify a soak time to choose for how long GKE pauses before upgrades proceed from one group to the next group.

Cluster upgrades in a rollout sequence proceed with the following steps:

  1. GKE sets a new automatic upgrade target for clusters on a minor version in a specific release channel, with the release note mentioning something similar to the following message: "Control planes and nodes with auto-upgrade enabled in the Regular channel will be upgraded from version 1.21 to version 1.22.15-gke.1000 with this release."
  2. GKE begins upgrading cluster control planes to the new version in the first group of clusters. After GKE upgrades a cluster's control plane, GKE begins upgrading the cluster's nodes. GKE respects maintenance availability when upgrading clusters in a rollout sequence.
  3. GKE does the following next steps for control plane upgrades:
    1. GKE begins the soaking period for control plane upgrades after all cluster control plane upgrades in the first group are finished, or 30 days have passed since control plane upgrades began.
    2. After the soaking period for cluster control plane upgrades in the first group is complete, GKE begins control plane upgrades in the second group.
  4. In parallel to control plane upgrades, GKE does the following next steps for node upgrades:
    1. GKE begins the soaking period for node upgrades after all cluster's node upgrades in the first group are finished or 30 days have passed since node upgrades began.
    2. GKE begins node upgrades in the second group for clusters where the control plane has already been upgraded after the soaking period for node upgrades in the first group is complete.
  5. GKE repeats these steps from the second group to the third group, until clusters in all groups in the rollout sequence have been upgraded to the new upgrade target.

As clusters are upgraded in each group, verify during the soak time that your workloads run as expected with clusters running the new GKE version.

Clusters might also be prevented from upgrading due to maintenance windows or exclusions, deprecated API usage, or other reasons. To learn more, see How rollout sequencing works with other upgrade features.

How to control upgrades in a rollout sequence

With cluster upgrades in a rollout sequence, groups of clusters are upgraded in the order that you've defined, and are soaked in each group for the amount of time that you've chosen. While upgrades are in-progress, you can check the status of a rollout sequence, and manage the rollout sequence as needed. You can also control the process in the following ways:

To learn more, see how rollout sequencing works with other upgrade features.

Example: Community bank gradually rolls out changes from Testing to Production

As an example, the platform administrator at a community bank manages three main deployment environments, each a group of clusters organized in a fleet: Testing, Staging, and Production. As is required for rollout sequencing, the administrator has enrolled each cluster across all three fleets in the same release channel—in these fleets, the Regular channel—with all clusters running the same minor version.

The administrator uses rollout sequencing to define the order in which GKE upgrades clusters in these environments. Ordering the rollout gives the administrator the opportunity to verify that their workloads run as expected with clusters on a new version of GKE before the Production environment is upgraded to the new version. This sequence is illustrated by the fleet-based rollout sequence diagram.

The administrator uses the soak time between these fleets being upgraded to verify that their workloads run as expected with clusters on the new version of GKE. For the Testing fleet, the administrator sets the soak time to 14 days so that they have two full weeks to test out how the workloads run. For Staging, they set the soak time to 7 days as they don't need as much additional time after the workloads have already been running in Testing.

The administrator can also override the default soak time for upgrades to specific versions, which they might want to do in one of the following situations:

  • The administrator finishes qualifying the version before the soak time is complete and wants upgrades to proceed to the next fleet, so they set the soak time to zero.
  • The administrator needs more time to qualify the new version before upgrades proceed to the next fleet as they've noticed an issue with some of their workloads, so they set the soak time to the maximum 30 days.

The administrator uses maintenance windows and exclusions to ensure that GKE upgrades clusters when it is least disruptive for the bank. GKE respects maintenance availability for clusters upgraded in a rollout sequence.

  • The administrator has configured maintenance windows for their clusters to ensure that GKE only upgrades clusters after business hours.
  • The administrator also uses maintenance exclusions to temporarily prevent clusters from being upgraded if they detect issues with the cluster's workloads.

The administrator uses a mix of surge upgrades and blue-green upgrades for their nodes, balancing between speed and risk tolerance depending on the workloads running on those nodes.

Administrator switches to team-based rollout sequences

If the administrator decides that they need to further group clusters inside a fleet by application, and give their application team admins greater control over their cluster upgrades, they can use team scopes. With team scopes, application team admins can create independent rollout sequences with the groups of clusters assigned to their teams, potentially running on different release channels, or with different soak times.

For example, if the database team want their clusters to use the Stable channel and longer soak times while the frontend website team's clusters use the Rapid channel and shorter soak times, they can use their team scopes to create separate rollout sequences. This type of sequence is illustrated by the team-based rollout sequence diagram. To do this for your environment, follow the instructions to switch between fleet-based and team-based rollout sequences.

Note that use of this feature requires single-tenancy clusters: in other words, each individual cluster is only associated with a single team. Shared clusters (which are supported in general fleet team management) are not supported for rollout sequencing. You can learn more about managing clusters for teams in Fleet team management.

Rollout eligibility

For clusters to be automatically upgraded with rollout sequencing, all clusters across all groups (fleets or scopes) in a rollout sequence must receive the same upgrade target. Clusters must be enrolled in the same release channel, and we recommend that clusters run the same minor version as upgrade targets are set per-minor version. However, for some releases, like the release in the following example, clusters from multiple minor versions received the same target, meaning that the clusters could be upgraded successfully in the rollout sequence running multiple minor versions.

You can check the status of version rollout in a sequence to get more information about the status and if version eligibility issues are preventing upgrades from proceeding. Depending on the version discrepancies, you might need to take actions such as manually upgrading a cluster or removing it from a group for cluster upgrades to proceed. If a cluster in a rollout sequence doesn't have an eligible upgrade target, GKE won't auto-upgrade the cluster until the cluster's existing minor version reaches end of life.

To troubleshoot rollout eligibility, see Troubleshoot rollout eligibility.

Example GKE release

As an example, the 2022-R25 release set an upgrade target for multiple minor versions in clusters enrolled in the Regular channel. An upgrade target can be a new minor version (1.20 to 1.21), or just a new patch version (1.21.x-gke.x to 1.21.14-gke.4300). In this release, in the Regular channel, the following new versions were made available for clusters on specific minor versions:

  • Clusters on 1.20 and 1.21 were upgraded to 1.21.14-gke.4300.
  • Clusters on 1.22 were upgraded to 1.23.8-gke.1900.
  • Clusters on 1.24 were upgraded to 1.24.5-gke.600.

The most-upstream group receives all upgrade targets

For clusters in the first group in a sequence, which does not have an upstream group to qualify new versions, GKE upgrades any clusters with eligible upgrade targets, regardless of if those upgrade targets are different from each other. For example, in the first group of a sequence, if some clusters were running 1.20, those clusters could be upgraded to 1.21.14-gke.4300, and clusters running 1.24 could be upgraded to 1.24.5-gke.600. This is because, for the first group in a sequence, GKE considers all upgrade targets to be qualified for these clusters as there is no upstream group to qualify a new version.

An upstream group must qualify only one version

In any downstream groups, whether GKE can upgrade clusters depends on if the upstream group qualified one upgrade target for which all clusters in this group are eligible. Typically, this means that all clusters start on the same minor version. However, from the example release, clusters on 1.20 and 1.21 had the same upgrade target, so clusters running both versions could, in the same group, qualify the upgrade to 1.21.14-gke.4300.

If all clusters in one group do not have the same upgrade target, this group cannot qualify one upgrade target for the next group. In this situation, GKE cannot automatically upgrade clusters in downstream groups. For example, if, in the first group, some clusters were upgraded to 1.21.14-gke.4300, and others to 1.23.8-gke.1900, the second group's clusters cannot be automatically upgraded as the group did not receive one qualified version. To advance the upgrades in this situation, see Fix eligibility in a group.

An upstream group must qualify a version matching with the next group's clusters

If clusters in an upstream group qualified a different version than the one for which clusters in the next group were eligible, GKE also cannot automatically upgrade the clusters in any downstream groups.

For example, if all clusters in the first group were upgraded to 1.21.14-gke.4300, but the clusters in the second group were running 1.22 (where the upgrade target is 1.23.8-gke.1900), the second group's clusters would not be automatically upgraded. The first group qualified 1.21.14-gke.4300, but the clusters in the second group (currently on 1.22) are only eligible for the upgrade target 1.23.8-gke.1900, so GKE cannot automatically upgrade these clusters. To advance upgrades in this situation, see Fix eligibility between groups.

How rollout sequencing works with other upgrade features

Rollout sequencing is one feature in a collection of features that give you control over the upgrade aspect of the cluster lifecycle. This section explains how this feature works with some of the other available features related to cluster upgrades.

How rollout sequencing works with maintenance windows and exclusions

GKE respects maintenance windows and maintenance exclusions when upgrading clusters with rollout sequencing. GKE only starts a cluster upgrade within a cluster's maintenance window. You can use a maintenance exclusion to temporarily prevent a cluster from being upgraded. If GKE cannot upgrade a cluster due to a maintenance window or exclusion, this can prevent cluster upgrades from finishing in a group. If a cluster upgrade cannot be completed within 30 days due to maintenance windows or exclusions, the group will enter its soak phase regardless of whether all clusters have finished upgrading.

You can use maintenance exclusions as a temporary measure to prevent a sequence from completing a rollout to a group and moving onto the next group. To learn more, see Delay the completion of group's version rollout.

How rollout sequencing works with deprecation usage detection

GKE pauses cluster upgrades when it detects usage of certain deprecated APIs and features. Automatic upgrades are also paused for clusters in a group in a rollout sequence. To learn more, see How Kubernetes deprecations work with GKE.

How rollout sequencing works with node upgrade strategies

Node upgrades will use their configured node upgrade strategy when being upgraded in a rollout sequence. As with cluster upgrades without rollout sequencing, GKE uses surge upgrades for Autopilot nodes. For more information, see Automatic node upgrades.

If node upgrades cannot complete within 30 days, the group will enter its soak phase regardless of whether all clusters have finished upgrading. This can happen if the node upgrade strategy causes a Standard cluster's node upgrade to take longer to complete, especially if it is a large node pool. It can also be exacerbated by maintenance windows not big enough for a node upgrade to complete. To learn more, see Considerations when configuring a maintenance window.

How rollout sequencing works with release channels

Release channels are required to use rollout sequencing. All clusters in all groups in a rollout sequence must be on the same release channel.

Receiving multiple upgrades across a sequence

If a new version becomes an upgrade target on the release channel while cluster upgrades to a previous upgrade target are still proceeding in the rollout sequence, an upstream group can begin the rollout of a new version while a downstream group is still receiving the previous upgrade. For example, if the third group in a sequence is rolling out 1.24.2-gke.100, the first group in the sequence can concurrently be rolling out 1.24.3-gke.500.

Considerations when choosing rollout sequencing

Consider using rollout sequencing if you want to manage cluster upgrades by qualifying new versions in one environment before rolling it out to another.

However, this might not be the right choice for your environment if any of the following statements are true:

  • You have clusters that are not on the same release channel or minor version in the same production environment.
  • You need to automate upgrades that cannot be mapped to only three stages of deployment, as you can only create a rollout sequence with up to three groups of clusters. You cannot link groups in multiple rollout sequences to create a rollout sequence with more than three groups.
  • You cannot use fleet management.
  • You frequently perform manual upgrades that cause clusters in one group to have different automatic upgrade target versions.

To create team-based rollout sequences, you also must be able to enable GKE Enterprise in your fleet host projects.

Limitations

To successfully upgrade your clusters with rollout sequencing, you must adhere to the following limitations:

  • If you are using team-based rollout sequencing, enroll a cluster in only one team scope. If a cluster is enrolled in multiple team scopes, GKE cannot automatically upgrade the cluster in a team-based rollout sequence.
  • Creating a team-based rollout sequence with multiple team scopes within the same fleet is unsupported.
  • Create a linear rollout sequence without cycles (a group has a downstream group as its upstream group) or branches (a group has more than one downstream group).
  • Create rollout sequences between a team's scopes, or rollout sequences between fleets. You cannot create mixed sequences with both fleets and team scopes in the same sequence.
  • Ensure that your clusters in a rollout sequence are all enrolled in the same release channel, and are running the same minor version.

Known issues

  • If a group contains clusters from different locations, a cluster upgrade might temporarily only be available to some of the clusters due to the gradual rollout of the new version. This is more likely to happen to the first group of clusters and should resolve within a week.
  • If there is an empty group in a rollout sequence, how this affects version qualification depends on the following conditions:
    • If the empty group has no upstream group, then cluster upgrades do not proceed to the downstream group as the empty group cannot qualify versions.
    • If the empty group has an upstream group, all pending cluster upgrades enter the COMPLETE status and propagate to the downstream group.
  • Due to how GKE tracks patch and minor upgrades, you might see two upgrades of the same type and version but with different statuses when checking the status of the scope.

What's next