This page explains how to upgrade GKE on-prem.
To upgrade GKE on-prem, you upgrade your admin workstation. Then, you upgrade your clusters.
Before you begin
- Be sure to follow these instructions from your local workstation or laptop. Don't follow these instructions from your existing admin workstation.
- Check your clusters' current version.
- Review the Release notes and known issues affecting upgrading.
- Review Versions.
Also, read through the following considerations:
About downtime during upgrades
Resource | Description |
---|---|
Admin cluster | When an admin cluster is down, user cluster control planes and workloads on user clusters continue to run, unless they were affected by a failure that caused the downtime |
User cluster control plane | Typically, you should expect no noticeable downtime to user cluster control planes. However, long-running connections to the Kubernetes API server might break and would need to be re-established. In those cases, the API caller should retry until it establishes a connection. In the worst case, there can be up to one minute of downtime during an upgrade. |
User cluster nodes | If an upgrade requires a change to user cluster nodes, GKE on-prem recreates the nodes in a rolling fashion, and reschedules Pods running on these nodes. You can prevent impact to your workloads by configuring appropriate PodDisruptionBudgets and anti-affinity rules. |
Sequential upgrading
GKE on-prem supports sequential upgrading, which means that the cluster you want to upgrade must be at the immediate previous patch version.
You can't upgrade your clusters directly to the latest version from a version that is more than one patch version behind. If your cluster is more than one patch version behind, you must sequentially upgrade that cluster through each patch version.
Example
If you want to upgrade to the 1.1.0
version, and your admin workstation and
user clusters are running the older 1.0.1
version:
- 1.0.1 (oldest version)
- 1.0.2
- 1.1.0 (latest version)
Then you must sequentially upgrade to the 1.0.2
version and then the 1.1.0
version by performing the following steps:
- Upgrade your admin workstation from 1.0.1 to 1.0.2.
- Upgrade your clusters from 1.0.1 to 1.0.2.
- Upgrade your admin workstation from 1.0.2 to 1.1.
- Upgrade your clusters from 1.0.2 to 1.1.0.
Back up your GKE on-prem configuration file and kubeconfig files
When you upgrade your admin workstation, Terraform deletes the admin workstation VM and then replaces it with an upgraded admin workstation.
Before you perform the admin workstation upgrade, you must first back up your
GKE on-prem configuration
file and your clusters'
kubeconfig
files
The gkectl create cluster
command creates the kubeconfig
files for your
admin cluster ([ADMIN_CLUSTER_KUBECONFIG]) and user cluster
([USER_CLUSTER_KUBECONFIG]). See an example in the
basic installation.
After your admin workstation is upgraded, you copy those same files to the
upgraded admin workstation.
Upgrading admin workstation
You use Terraform to upgrade your admin workstation.
The upgraded admin workstation includes the following entities at the same version as the admin workstation's Open Virtualization Appliance (OVA) file:
gkectl
- full bundle
Downloading the OVA
From Downloads, download the admin workstation OVA file for the version to which you're upgrading.
To download the latest OVA, run the following command:
gcloud storage cp gs://gke-on-prem-release/admin-appliance/1.3.2-gke.1/gke-on-prem-admin-appliance-vsphere-1.3.2-gke.1.{ova,ova.1.sig} ~/
Importing the OVA to vSphere and marking it as a VM template
In the following sections, you:
- Create some variables declaring elements of your vCenter Server and vSphere environment.
- Import the admin workstation OVA to vSphere and mark it as a VM template.
Creating variables for govc
Before you import the admin workstation OVA to vSphere, you need to provide govc
some variables declaring elements of your vCenter Server and vSphere environment:
export GOVC_URL=https://[VCENTER_SERVER_ADDRESS]/sdk export GOVC_USERNAME=[VCENTER_SERVER_USERNAME] export GOVC_PASSWORD=[VCENTER_SERVER_PASSWORD] export GOVC_DATASTORE=[VSPHERE_DATASTORE] export GOVC_DATACENTER=[VSPHERE_DATACENTER] export GOVC_INSECURE=true
You can choose to use vSphere's default resource pool or create your own:
# If you want to use a resource pool you've configured yourself, export this variable: export GOVC_RESOURCE_POOL=[VSPHERE_CLUSTER]/Resources/[VSPHERE_RESOURCE_POOL]
# If you want to use vSphere's default resource pool, export this variable instead: export GOVC_RESOURCE_POOL=[VSPHERE_CLUSTER]/Resources
where:
- [VCENTER_SERVER_ADDRESS] is your vCenter Server's IP address or hostname.
- [VCENTER_SERVER_USERNAME] is the username of an account that holds the Administrator role or equivalent privileges in vCenter Server.
- [VCENTER_SERVER_PASSWORD] is the vCenter Server account's password.
- [VSPHERE_DATASTORE] is the name of the datastore you've configured in your vSphere environment.
- [VSPHERE_DATACENTER] is the name of the datacenter you've configured in your vSphere environment.
- [VSPHERE_CLUSTER] is the name of the cluster you've configured in your vSphere environment. For using a non-default resource pool,
- [VSPHERE_RESOURCE_POOL] is the name of the resource pool you've configured to your vSphere environment.
Importing the OVA to vSphere: Standard switch
If you are using a vSphere Standard Switch, import the OVA to vSphere using this command:
govc import.ova -options - ~/gke-on-prem-admin-appliance-vsphere-1.3.2-gke.1.ova <<EOF { "DiskProvisioning": "thin", "MarkAsTemplate": true } EOF
Importing the OVA to vSphere: Distributed switch
If you are using a vSphere Distributed Switch, import the OVA to vSphere using this command, where [YOUR_DISTRIBUTED_PORT_GROUP_NAME] is the name of your distributed port group:
govc import.ova -options - ~/gke-on-prem-admin-appliance-vsphere-1.3.2-gke.1.ova <<EOF { "DiskProvisioning": "thin", "MarkAsTemplate": true, "NetworkMapping": [ { "Name": "VM Network", "Network": "[YOUR_DISTRIBUTED_PORT_GROUP_NAME]" } ] } EOF
Setting the Terraform template variable for the new admin workstation VM
In your admin workstation's TFVARS file, set vm_template
to the version to
which you're upgrading. The value of vm_template
looks like this, where
[VERSION] is the OVA's version:
gke-on-prem-admin-appliance-vsphere-[VERSION]
Using Terraform to upgrade your admin workstation
To upgrade your admin workstation, run the following command. This command deletes the current admin workstation VM and replaces it with an upgraded VM:
terraform init && terraform apply -auto-approve -input=false
Connecting to your admin workstation
SSH in to your admin workstation by running the following command:
ssh -i ~/.ssh/vsphere_workstation ubuntu@[IP_ADDRESS]
Copy your backed up configuration and kubeconfig files
Earlier, you backed up your GKE on-prem configuration file and your clusters' kubeconfig files. Now, you should copy those files back to your upgraded admin workstation.
Upgrading clusters
After upgrading your admin workstation and connecting to it, perform the following steps:
Verify that enough IP addresses are available
Before you upgrade, be sure that you have enough IP addresses available for your clusters.
DHCP
During an upgrade, GKE on-prem creates one temporary node in the admin cluster and one temporary node in each associated user cluster. Make sure that your DHCP server can provide enough IP addresses for these temporary nodes. For more information, see IP addresses needed for admin and user clusters.
Static IPs
During an upgrade, GKE on-prem creates one temporary node in the admin cluster and one temporary node in each associated user cluster. For your admin cluster and each of your user clusters, verify that you have reserved enough IP addresses. For each cluster, you need to have reserved at least one more IP address than the number of cluster nodes. For more information, see Configuring static IP addresses.
Determine the number of nodes in your admin cluster:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get nodes
where [ADMIN_CLUSTER_KUBECONFIG] is the path of your admin cluster's kubeconfig file.
Next, view the addresses reserved for your admin cluster:
kubectl get cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -o yaml
In the output, in the reservedAddresses
field, you can see the number of
IP addresses that are reserved for the admin cluster nodes. For example, the
following output shows that there are five IP addresses reserved for the
admin cluster nodes::
...
reservedAddresses:
- gateway: 21.0.135.254
hostname: admin-node-1
ip: 21.0.133.41
netmask: 21
- gateway: 21.0.135.254
hostname: admin-node-2
ip: 21.0.133.50
netmask: 21
- gateway: 21.0.135.254
hostname: admin-node-3
ip: 21.0.133.56
netmask: 21
- gateway: 21.0.135.254
hostname: admin-node-4
ip: 21.0.133.47
netmask: 21
- gateway: 21.0.135.254
hostname: admin-node-5
ip: 21.0.133.44
netmask: 21
The number of reserved IP addresses should be at least one more than the number of nodes in the admin cluster. If this is not the case, you can reserve an additional address by editing the Cluster object.
Open the Cluster object for editing:
kubectl edit cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG]
Under reservedAddresses
, add an additional block that has gateway
,
hostname
, ip
, and netmask
.
Go through the same procedure for each of your user clusters.
To determine the number of nodes in a user cluster:
kubectl --kubeconfig [USER_CLUSTER_KUBECONFIG] get nodes
where [USER_CLUSTER_KUBECONFIG] is the path of your user cluster's kubeconfig file.
To view the addresses reserved for a user cluster:
kubectl get cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ -n [USER_CLUSTER_NAME] [USER_CLUSTER_NAME] -o yaml
where:
[ADMIN_CLUSTER_KUBECONFIG] is the path of your admin cluster's kubeconfig file.
[USER_CLUSTER_NAME] is the name of the user cluster.
To edit the Cluster object of a user cluster:
kubectl edit cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ -n [USER_CLUSTER_NAME] [USER_CLUSTER_NAME]
Modifying the configuration file
On your admin workstation VM, edit the
configuration file that you used to
create your admin and user clusters. Set the value of bundlepath
, where
[VERSION] is the GKE on-prem version to which you're
upgrading your clusters:
bundlepath: /var/lib/gke/bundles/gke-onprem-vsphere-[VERSION].tgz
About automatically-enabled features
A new GKE on-prem version might include new features or support for specific VMware vSphere features. Sometimes, upgrading to a GKE on-prem version automatically enables such features. You learn about new features in GKE on-prem's Release notes. New features are sometimes surfaced in the GKE on-prem configuration file.
Disabling new features via the configuration file
If you need to disable a new feature that is automatically enabled in a new GKE on-prem version and driven by the configuration file, perform the following steps before you upgrade your cluster:
From your upgraded admin workstation, create a new configuration file with a different name from your current configuration file:
gkectl create-config --config [CONFIG_NAME]
Open the new configuration file and the feature's field. Close the file.
Open your current configuration file and add the new feature's field in the appropriate specification.
Provide the field a
false
or equivalent value.Save the configuration file. Proceed with upgrading your clusters.
You should always review the Release notes before you upgrade your clusters. You cannot declaratively change an existing cluster's configuration after you upgrade it.
Running gkectl prepare
The gkectl prepare
command performs the following tasks:
If necessary, copy a new node OS image to your vSphere environment, and mark the OS image as a template.
Push updated Docker images, specified in the new bundle, to your private Docker registry, if you have configured one.
To perform the preceding tasks, run the following command:
gkectl prepare --config [CONFIG_FILE] [FLAGS]
where:
[CONFIG_FILE] is the GKE on-prem configuration file you're using to perform the upgrade.
[FLAGS] is an optional set of flags. For example, you could include the
--skip-validation-infra
flag to skip checking of your vSphere infrastructure.
Upgrading your admin cluster
Run the following command:
gkectl upgrade admin \ --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ --config [CONFIG_FILE] \ [FLAGS]
where:
[ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
[CONFIG_FILE] is the GKE on-prem configuration file you're using to perform the upgrade.
[FLAGS] is an optional set of flags. For example, you could include the
--skip-validation-infra
flag to skip checking of your vSphere infrastructure.
Upgrading your user cluster
To upgrade a user cluster, your admin cluster must be upgraded to the desired version or higher prior to upgrading your user cluster.
gkectl
From your admin workstation, run the following command:
gkectl upgrade cluster \ --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ --config [CONFIG_FILE] \ --cluster-name [CLUSTER_NAME] \ [FLAGS]
where:
[ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
[CLUSTER_NAME] is the name of the user cluster you're upgrading.
[CONFIG_FILE] is the GKE on-prem configuration file you're using to perform the upgrade.
[FLAGS] is an optional set of flags. For example, you could include the
--skip-validation-infra
flag to skip checking of your vSphere infrastructure.
Console
You can choose to register your user clusters with Google Cloud console during installation or after you've created them. You can view and log in to your registered GKE on-prem clusters and your Google Kubernetes Engine clusters from Google Cloud console's GKE menu.
When an upgrade becomes available for GKE on-prem user clusters,
a notification appears in Google Cloud console. Clicking this notification
displays a list of available versions and a gkectl
command you can run to
upgrade the cluster:
Visit the GKE menu in Google Cloud console.
Under the Notifications column for the user cluster, click Upgrade available, if available.
Copy the
gkectl upgrade cluster
command.From your admin workstation, run the
gkectl upgrade cluster
command, where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file, [CLUSTER_NAME] is the name of the user cluster you're upgrading, and [CONFIG_FILE] is the GKE on-prem configuration file you're using to perform the upgrade.
Resuming an upgrade
If a user cluster upgrade is interrupted after the admin cluster is successfully
upgraded, you can resume the user cluster upgrade by running the
same upgrade command with the --skip-validation-all
flag:
gkectl upgrade cluster \ --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ --config [CONFIG_FILE] \ --cluster-name [CLUSTER_NAME] \ --skip-validation-all
About resuming an admin cluster upgrade
You shouldn't interrupt an admin cluster upgrade. Currently, admin cluster upgrades aren't always resumable. If an admin cluster upgrade is interrupted for any reason, you should contact support for assistance.
Known issues
The following known issues affect upgrading clusters.
Version 1.1.0-gke.6, 1.2.0-gke.6: stackdriver.proxyconfigsecretname
field removed
The stackdriver.proxyconfigsecretname
field was removed in version
1.1.0-gke.6. GKE on-prem's preflight checks will return an error if
the field is present in your configuration file.
To work around this, before you upgrade to 1.2.0-gke.6, delete the
proxyconfigsecretname
field from your configuration file.
Stackdriver references old version
Before version 1.2.0-gke.6, a known issue prevents Stackdriver from updating its configuration after cluster upgrades. Stackdriver still references an old version, which prevents Stackdriver from receiving the latest features of its telemetry pipeline. This issue can make it difficult for Google Support to troubleshoot clusters.
After you upgrade clusters to 1.2.0-gke.6, run the following command against admin and user clusters:
kubectl --kubeconfig=[KUBECONFIG] \ -n kube-system --type=json patch stackdrivers stackdriver \ -p '[{"op":"remove","path":"/spec/version"}]'
where [KUBECONFIG] is the path to the cluster's kubeconfig file.
Disruption for workloads with PodDisruptionBudgets
Currently, upgrading clusters can cause disruption or downtime for workloads that use PodDisruptionBudgets (PDBs).
Version 1.2.0-gke.6: Prometheus and Grafana disabled after upgrading
In user clusters, Prometheus and Grafana get automatically disabled during upgrade. However, the configuration and metrics data are not lost. In admin clusters, Prometheus and Grafana stay enabled.
For instructions, refer to the GKE on-prem release notes.
Version 1.1.2-gke.0: Deleted user cluster nodes aren't removed from vSAN datastore
For instructions, refer to the GKE on-prem release notes.
Version 1.1.1-gke.2: Data disk in vSAN datastore folder can be deleted
If you're using a vSAN datastore, you need to create a folder in which to save
the VMDK. A known issue
requires that you provide the folder's universally unique identifier (UUID) path,
rather than its file path, to vcenter.datadisk
. This mismatch can cause
upgrades to fail.
For instructions, refer to the GKE on-prem release notes.
Upgrading to version 1.1.0-gke.6 from version 1.0.2-gke.3: OIDC issue
Version 1.0.11, 1.0.1-gke.5, and 1.0.2-gke.3 clusters that have OpenID Connect (OIDC) configured cannot be upgraded to version 1.1.0-gke.6. This issue is fixed in version 1.1.1-gke.2.
If you configured a version 1.0.11, 1.0.1-gke.5, or 1.0.2-gke.3 cluster with OIDC during installation, you are not able to upgrade it. Instead, you should create new clusters.
Upgrading to version 1.0.2-gke.3 from version 1.0.11
Version 1.0.2-gke.3 introduces the following OIDC fields (usercluster.oidc
).
These fields enable logging in to a cluster from Google Cloud console:
usercluster.oidc.kubectlredirecturl
usercluster.oidc.clientsecret
usercluster.oidc.usehttpproxy
If you want to use OIDC, the clientsecret
field is required even if you don't
want to log in to a cluster from Google Cloud console. To use OIDC, you might
need to provide a placeholder value for clientsecret
:
oidc: clientsecret: "secret"
Nodes fail to complete their upgrade process
If you have PodDisruptionBudget
objects configured that are unable to
allow any additional disruptions, node upgrades might fail to upgrade to the
control plane version after repeated attempts. To prevent this failure, we
recommend that you scale up the Deployment
or HorizontalPodAutoscaler
to
allow the node to drain while still respecting the PodDisruptionBudget
configuration.
To see all PodDisruptionBudget
objects that do not allow any disruptions:
kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}'
Appendix
About VMware DRS rules enabled in version 1.1.0-gke.6
As of version 1.1.0-gke.6, GKE on-prem automatically creates VMware Distributed Resource Scheduler (DRS) anti-affinity rules for your user cluster's nodes, causing them to be spread across at least three physical hosts in your datacenter. As of version 1.1.0-gke.6, this feature is automatically enabled for new clusters and existing clusters.
Before you upgrade, be sure that your vSphere environment meets the following conditions:
- VMware DRS is enabled. VMware DRS requires vSphere Enterprise Plus license edition. To learn how to enable DRS, see Enabling VMware DRS in a cluster
- The vSphere user account provided in the
vcenter
field has theHost.Inventory.EditCluster
permission. - There are at least three physical hosts available.
Disabling VMware DRS before upgrading to 1.1.0-gke.6
If you do not want to enable this feature for your existing user clusters—for example, if you don't have enough hosts to accommodate the feature—perform the following steps before you upgrade your user clusters:
- Open your existing GKE on-prem configuration file.
- Under the
usercluster
specification, add theantiaffinitygroups
field:usercluster: ... antiaffinitygroups: enabled: false
- Save the file.
- Use the configuration file to upgrade. Your clusters are upgraded, but the feature is not enabled.
Alternate upgrade scenario
If security patches like Common Vulnerabilities and Exposures (CVEs), don't
exist for your version of the admin workstation, you can use the alternative
method to upgrade GKE on-prem. The alternative method upgrades
only gkectl
and your user clusters. You don't upgrade your admin workstation
in this scenario. Therefore, you must only use this alternate method after
all security updates are applied
to your existing admin workstation.
- Update your Google Cloud CLI components:
gcloud components update
. - Download and install the latest
gkectl
tool. - Download the bundle.
- Upgrade your user clusters.
Troubleshooting
For more information, refer to Troubleshooting.
New nodes created but not healthy
- Symptoms
New nodes don't register themselves to the user cluster control plane when using manual load balancing mode.
- Possible causes
In-node Ingress validation might be enabled that blocks the boot up process of the nodes.
- Resolution
To disable the validation, run:
kubectl patch machinedeployment [MACHINE_DEPLOYMENT_NAME] -p '{"spec":{"template":{"spec":{"providerSpec":{"value":{"machineVariables":{"net_validation_ports": null}}}}}}}' --type=merge
Diagnosing cluster issues using gkectl
Use gkectl diagnose
commands to identify cluster issues
and share cluster information with Google. See
Diagnosing cluster issues.
Running gkectl
commands verbosely
-v5
Logging gkectl
errors to stderr
--alsologtostderr
Locating gkectl
logs in the admin workstation
Even if you don't pass in its debugging flags, you can view
gkectl
logs in the following admin workstation directory:
/home/ubuntu/.config/gke-on-prem/logs
Locating Cluster API logs in the admin cluster
If a VM fails to start after the admin control plane has started, you can try debugging this by inspecting the Cluster API controllers' logs in the admin cluster:
Find the name of the Cluster API controllers Pod in the
kube-system
namespace, where [ADMIN_CLUSTER_KUBECONFIG] is the path to the admin cluster's kubeconfig file:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system get pods | grep clusterapi-controllers
Open the Pod's logs, where [POD_NAME] is the name of the Pod. Optionally, use
grep
or a similar tool to search for errors:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system logs [POD_NAME] vsphere-controller-manager