Best practices for upgrading clusters

This page provides guidelines for keeping your Google Kubernetes Engine (GKE) cluster seamlessly up-to-date, and recommendations for creating an upgrade strategy that fits your needs and increases availability and reliability of your environments. You can use this information to keep your clusters updated for stability and security with minimal disruptions to your workloads.

Alternatively, to manage automatic cluster upgrades across production environments organized with fleets, see About cluster upgrades with rollout sequencing.

Set up multiple environments

As part of your workflow for delivering software updates, we recommend that you use multiple environments. Multiple environments help you minimize risk and unwanted downtime by testing software and infrastructure updates separately from your production environment. At minimum, you should have a production environment and a pre-production or test environment.

Consider the following recommended environments:

Environment	Description
Production	Used to serve live traffic to end users for mission critical business applications.
Staging	Used to ensure that all new changes deployed from previous environments are working as intended before the changes are deployed to production.
Testing	Used to performance benchmark, test and QA workloads against the GKE release you will use in production. In this environment, you can test the upgrade of the control plane and nodes before doing so in production.
Development	Used for active development that relies on the same version running in production. In this environment, you create fixes and incremental changes to be deployed in production.
Canary	Used as a secondary development environment for testing newer Kubernetes releases, GKE features and APIs to gain better time to market once these releases are promoted and become auto-upgrade targets.

Enroll clusters in release channels

Kubernetes often releases updates, to deliver security updates, fix known issues, and introduce new features. GKE release channels offer you the ability to balance between stability and feature set of the version deployed in the cluster. When you enroll a new cluster in a release channel, Google automatically manages the version and upgrade cadence for the cluster and its node pools.

To keep clusters up-to-date with the latest GKE and Kubernetes updates, here are some recommended environments and the respective release channels the clusters should be enrolled in:

Environment	Release channel	Description
Production	Stable or Regular	For stability and version maturity, use the Stable or Regular channel for production workloads.
Staging	Same as production	To ensure your tests are indicative of the version your production will be upgraded to, use the same release channel as production.
Testing
Development
Canary	Rapid	To test the latest Kubernetes releases and to get ahead of the curve by testing new GKE features or APIs, use the Rapid channel. You can improve your time to market when the version in the Rapid channel is promoted to the channel you're using for production.
N/A	Extended	To keep your cluster on a minor version for longer while still receiving security patches past the end of standard support date, use the Extended channel. To learn more, see Use the Extended channel when you need long-term support.

Cluster control planes are always upgraded on a regular basis, regardless of whether your cluster is enrolled in a release channel or not.

Create a continuous upgrade strategy

After enrolling your cluster in a release channel, that cluster is regularly upgraded to the version that meets the quality and stability bar for the channel. These updates include security and bug fixes, applied with increasing scrutiny at each channel:

Patches are pushed out to control plane and nodes in all channels gradually, accumulating soak time in the Rapid and Regular channels before landing in the Stable channel.
The control plane is upgraded first, followed by nodes to comply with the Kubernetes OSS policy that the kubelet must not be newer than kube-apiserver.
GKE will automatically roll out patches to channels based on their criticality and importance.
The Stable channel receives only critical patches.

Receive updates about new GKE versions

Information about new versions is published to the main GKE release notes page, as well as to an RSS feed. Each release channel has a simplified and dedicated release notes page (example: Release notes for the Stable channel) with information about the recommended GKE version for that channel.

To proactively receive updates about GKE upgrades before the upgrades occur, use Pub/Sub and subscribe to upgrade notifications.

Once a new version becomes available, you should plan an upgrade before the version becomes the auto-upgrade target for the cluster. This approach provides more control and predictability when needed, because GKE doesn't auto-upgrade a cluster if the available auto-upgrade target is earlier than or the same as the version to which you already manually upgraded the cluster. To get auto-upgrade targets for a specific cluster, see Get information about a cluster's upgrades.

Test and verify new patch and minor versions

All releases pass internal testing regardless of the channel they are released in. However, with the frequent updates and patches from upstream Kubernetes, and GKE, we highly recommend testing new releases on testing and/or staging environments before the releases are rolled out into your production environment, especially Kubernetes minor version upgrades.

Each release channel offers multiple available versions, including a default version for cluster creation, and auto-upgrade targets:

New patch releases are available a week prior to becoming auto-upgrade targets.
New Kubernetes minor releases are available four weeks prior to becoming auto-upgrade targets.

GKE automatically upgrades clusters to newer versions. If more control over the upgrade process is necessary, we recommend upgrading ahead of time to an available version. GKE doesn't auto-upgrade manually-upgraded clusters to the same auto-upgrade target.

A recommended approach to automate and streamline upgrades would involve:

A pre-production environment using the available version.
Upgrade notifications set up on the cluster to inform your team about new available versions to test and certify.
A production environment subscribed to a release channel using a version that you've already tested in your pre-production environment.
Gradual rollout of new available versions to production clusters. For example, if there are multiple production clusters, a gradual upgrade plan would start by upgrading a portion of these clusters to the available version while keeping the others on the existing version, followed by additional small portion upgrades until 100% is upgraded.

The following table summarizes the release events and recommended actions:

Event	Recommended action
New version X is made available in a channel.	Manually upgrade your testing cluster and qualify and test the new version.
Version X becomes an auto-upgrade target for the cluster's minor version.	GKE starts auto-upgrading to the auto-upgrade target. Consider upgrading production ahead of the fleet.
GKE starts auto-upgrading clusters.	Allow clusters to get auto-upgraded, or postpone the upgrade using maintenance exclusion windows.

Upgrade strategy for patch releases

Here's a recommended upgrade strategy for patch releases, using a scenario where:

All clusters are subscribed to the Stable channel.
New available versions are rolled out to the staging cluster first.
The production cluster is upgraded automatically to the new auto-upgrade target.
Regularly monitoring new available versions for GKE.

Time	Event	What should I do?
T - 1 week	New patch version becomes available.	Upgrade staging environment.
T	Patch version becomes the auto-upgrade target.	Consider upgrading the production control plane ahead of time for better predictability.
T	GKE will start upgrading control planes to the auto-upgrade target.	Consider upgrading the production node pools ahead of time for better predictability.
T + 1 week	GKE will start upgrading cluster node pools to the auto-upgrade target.	GKE will auto-upgrade clusters, skipping the manually-upgraded clusters.

Upgrade strategy for new minor releases

Here's a recommended upgrade strategy for new minor releases:

Time	Event	What should I do?
T - 3 weeks	New minor version becomes available	Upgrade testing control plane
T - 2 weeks		Given a successful control plane upgrade, consider upgrading the production control plane ahead of time. Upgrade testing node pools.
T - 1 week		Given a successful upgrade, consider upgrade production node pools ahead of time.
T	Minor version becomes the auto-upgrade target.
T	GKE will start upgrading cluster control planes to the auto-upgrade target.	Create an exclusion window if more testing or mitigation is needed before production rollout.
T + 1 week	GKE will start upgrading cluster node pools to the auto-upgrade target.	GKE will auto-upgrade clusters, skipping the manually-upgraded clusters.

Reduce disruption to existing workloads during an upgrade

Keeping your clusters up-to-date with security patches and bug fixes is critical for ensuring the vitality of your clusters, and for business continuity. Regular updates protect your workloads from vulnerabilities and failures.

Schedule maintenance windows and exclusions

To increase upgrade predictability and to align upgrades with off-peak business hours, you can control automatic upgrades of both the control plane and nodes by creating a maintenance window. GKE respects maintenance windows. Namely, if the upgrade process runs beyond the defined maintenance window, GKE attempts to pause the operation, and resumes the operation during the next maintenance window.

GKE follows a multi-day rollout schedule for making new versions available, as well as auto-upgrading cluster control planes and nodes in different regions. The rollout generally spans four or more days, and includes a buffer of time to observe and monitor for problems. In a multi-cluster environment, you can use separate maintenance window for each cluster to sequence the rollout across your clusters. For example, you might want to control when clusters in different regions receive maintenance by setting different maintenance windows for each cluster.

Another tool to reduce disruption, especially during high-demand business periods, is maintenance exclusions. Use maintenance exclusions to prevent automatic maintenance from occurring during these periods; maintenance exclusions can be set on new or existing clusters. You can also use exclusions in conjunction with your upgrade strategy. For example, you might want to postpone an upgrade to a production cluster if a testing or staging environment fails because of an upgrade.

Set your tolerance for disruption

You might be familiar with the concept of replicas in Kubernetes. Replicas ensure redundancy of your workloads for better performance and responsiveness. When set, replicas govern the number of Pod replicas running at any given time. However, during maintenance, Kubernetes removes the underlying node VMs, which can reduce the number of replicas. To ensure your workloads have a sufficient number of replicas for your applications, even during maintenance, use a Pod Disruption Budget (PDB).

In a Pod Disruption Budget, you can define a number (or percentage) of Pods that can be terminated, even if terminating the Pods brings the current replica count below the desired value. This process may speed up the node drain by removing the need to wait for migrated pods to become fully operational. Instead, drain evicts pods from a node following the PDB configuration, allowing deployment to deploy missing Pods on other available nodes. Once the PDB is set, GKE won't shut down Pods in your application if the number of Pods is equal to or less than a configured limit. GKE follows a PDB for up to 60 minutes.

Control node pool upgrades

With GKE, you can choose a node upgrade strategy to determine how the nodes in your node pools are upgraded. By default, node pools use surge upgrades. With surge upgrades, the upgrade process for GKE node pools involves recreating every VM in the node pool. A new VM is created with the new version (upgraded image) in a rolling update fashion. In turn, that requires shutting down all the Pods running on the old node and shifting the Pods to the new node. Your workloads can run with sufficient redundancy (replicas), and you can rely on Kubernetes to move and restart Pods as needed. However, a temporarily reduced number of replicas can still be disruptive to your business, and might slow down the workload performance until Kubernetes is able to meet the desired state again (that is, meet the minimum number of needed replicas). You can avoid such a disruption by using surge upgrades.

During an upgrade with surge upgrade enabled, GKE first secures the resources (machines) needed for the upgrade, then creates a new upgraded node, and only then drains the old node, and finally shuts it down. This way, the expected capacity remains intact throughout the upgrade process.

For large clusters where the upgrade process might take longer, you can accelerate the upgrade completion time by concurrently upgrading multiple nodes at a time. Use surge upgrade with maxSurge=20, maxUnavailable=0 to instruct GKE to upgrade 20 nodes at a time, without using any existing capacity.

Use the Extended channel when you need long-term support

If you want to keep your cluster on a minor version for longer, follow the best practice of enrolling your cluster in the Extended channel. With this channel, GKE supports a minor version for approximately 24 months. To learn more, see Get long-term support with the Extended channel.

To get the most benefit from the channel, we recommend that you adhere to the following best practices. Some of these best practices require taking some manual action, including manually upgrading a cluster and changing the release channel of a cluster. Review the following supported scenarios, as well as When not to use the Extended channel.

Temporarily stay on a minor version for longer

If you need to temporarily keep a cluster on a minor version for longer than the 14-month standard support period, for example, to mitigate the use of deprecated APIs removed in the next minor version, use the following process. You can temporarily move the cluster from another release channel to the Extended channel to continue to receive security patches while preparing to upgrade to the next minor version. When you're ready to upgrade to the next minor version, you manually upgrade the cluster, then move the cluster back to the original release channel.

Minor version upgrades 1-2 times per year

If you want minimal disruption for your cluster while still receiving some new features when your cluster is ready to be upgraded to a new minor version, do the following:

Enroll a cluster in the Extended channel.
Perform two successive minor version upgrades 1-2 times per year. For example, upgrade from 1.30 to 1.31 to 1.32.

This process ensures that the cluster stays on an available minor version, receives features from new minor versions, but only receives minor version upgrades when you decide the cluster is ready.

When not to use the Extended channel

To use the Extended channel for its intended purpose requires manual action. The following scenario illustrates the consequences of using the Extended channel without active management of your cluster's minor version.

Do nothing and receive minor upgrades with same frequency

If you want to keep your cluster on a minor version forever, you enroll your cluster in the Extended channel and take no further action. All minor versions eventually become unsupported and GKE automatically upgrades clusters from unsupported minor versions. So, GKE upgrades this cluster from one unsupported minor version to a soon-to-be unsupported minor version, which averages out to approximately every 4 months. This means that that the cluster receives minor version upgrades just as frequently on other release channels but receives new features later.

Checklist summary

The following table summarizes the tasks that are recommended for an upgrade strategy to keep your GKE clusters seamlessly up-to-date:

Best Practice	Tasks
Set up multiple environments	At minimum, create a production and pre-production environment.
Enroll clusters in release channels	Enroll production clusters in the Stable or Regular channel. Enroll pre-production clusters in the same channels as production. Enroll early development clusters (for example, canary) in the Rapid channel. For clusters where you need to keep running a minor version for longer, enroll clusters in the Extended channel.
Create a continuous upgrade strategy	Proactively receive updates about GKE upgrades and GKE versions. Test and verify new patch and minor versions.
Reduce disruption to existing workloads	Control timing of automatic upgrades by creating a maintenance window. Use maintenance exclusions to prevent automatic maintenance from occurring during high-demand business periods. Set the correct Pod Disruption Budget for your workloads. Use a strategy to control node pool upgrades.

What's next

Watch the Google Cloud Next 2020 video on Ensuring business continuity at times of uncertainty and digital-only business with GKE.
Watch Best practices for GKE upgrade.
Learn more about Release channels.
Learn about versioning and automatic upgrades in GKE.