This page lists known issues for GKE. This page is for Admins and architects who manage the lifecycle of the underlying technology infrastructure, and respond to alerts and pages when service level objectives (SLOs) aren't met or applications fail.
To filter the known issues by a product version or category, select your filters from the following drop-down menus.
Select your GKE version:
Select your problem category:
Or, search for your issue:
Category | Identified version(s) | Fixed version(s) | Issue and workaround |
---|---|---|---|
Operation |
|
|
Increased Pod eviction rates on GKE versions 1.30 and 1.31
Some versions of GKE 1.30 and GKE 1.31 that use COS 113 and COS 117, respectively, have kernels that were built
with the option
The config option You might not always see an unusual Pod eviction rate because this issue depends on the workload's memory usage pattern. There is a higher risk of the kubelet evicting Pods for workloads that haven't set a memory limit in the resources field. This is because the workloads might request more memory than what the kubelet reports as available. If you see higher memory usage of an application after upgrading to the mentioned GKE versions without any other changes, then you might be affected by the kernel option.
To check if there are unusual Pod eviction rates, analyze the following metrics with
Metrics Explorer:
You can use the following PromQL queries. Replace the values for
max by (pod_name)(max_over_time(kubernetes_io:container_memory_used_bytes{monitored_resource="k8s_container",memory_type="non-evictable",cluster_name="REPLACE_cluster_name",namespace_name="REPLACE_namespace",metadata_system_top_level_controller_type="REPLACE_controller_type",metadata_system_top_level_controller_name="REPLACE_controller_name"}[${__interval}]))
sum by (pod_name)(avg_over_time(kubernetes_io:container_memory_request_bytes{monitored_resource="k8s_container",cluster_name="REPLACE_cluster_name",namespace_name="REPLACE_namespace",metadata_system_top_level_controller_type="REPLACE_controller_type",metadata_system_top_level_controller_name="REPLACE_controller_name"}[${__interval}]))
If you see unusual spikes in the memory usage that go above the requested memory, the workload might be getting evicted more often. WorkaroundIf you can't upgrade to the fixed versions and if you're running in a GKE environment where you can deploy privileged Pods, you can disable the Multi-Gen LRU option by using a DaemonSet.
After the DaemonSet is running in all the selected node pools, the change is effective immediately and the kubelet memory usage calculation is back to normal. |
Operation | 1.28, 1.29, 1.30, 1.31 |
|
Pods stuck in Terminating statusA bug in the container runtime (containerd) might cause Pods and containers to be stuck in Terminating status with errors similar to the following: OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
If you are impacted by this issue, you can upgrade your nodes to a GKE version with a fixed version of containerd. |
Operation | 1.28,1.29 |
|
Image streaming fails because of symbolic linksA bug in the Image streaming feature might cause containers to fail to start. Containers running on a node with image streaming enabled on specific GKE versions might fail to be created with the following error: "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd container: failed to mount [PATH]: too many levels of symbolic links"
If you are impacted by this issue, you can check for empty layers or duplicate layers. If you can't remove empty empty layers or duplicate layers, then disable Image streaming. |
Operation | 1.27,1.28,1.29 |
|
Image streaming fails because of missing filesA bug in the Image streaming feature might cause containers to fail because of a missing file or files. Containers running on a node with Image streaming enabled on the following versions might fail to start or run with errors informing that certain files don't exist. The following are examples of such errors:
If you are impacted by this issue, you can disable Image streaming. |
Networking,Upgrades and updates | 1.28 |
Gateway TLS configuration errorWe've identified an issue with configuring TLS for Gateways in clusters running GKE version 1.28.4-gke.1083000. This affects TLS configurations using either an SSLCertificate or a CertificateMap. If you're upgrading a cluster with existing Gateways, updates made to the Gateway will fail. For new Gateways, the load balancers won't be provisioned. This issue will be fixed in an upcoming GKE 1.28 patch version. |
|
Upgrades and updates | 1.27 | 1.27.8 or later |
GPU device plugin issue
Clusters that are running GPUs and are upgraded from 1.26 to a 1.27 patch
version earlier than 1.27.8 might experience issues with their nodes'
GPU device plugins (
|
Operation | 1.27,1.28 |
|
Autoscaling for all workloads stops
HorizontalPodAutoscaler (HPA) and VerticalPodAutoscaler (VPA) might
stop autoscaling all workloads in a cluster if it contains misconfigured
Workaround:
Correct misconfigured
For more details on how to configure |
Operation | 1.28,1.29 |
|
Container Threat Detection fails to deployContainer Threat Detection might fail to deploy on Autopilot clusters running the following GKE versions:
|
Networking, Upgrades | 1.27, 1.28, 1.29, 1.30 |
|
Connectivity issues for
|
Operation | 1.29,1.30,1.31 |
|
Incompatible Ray Operator and Cloud KMS database encryptionSome Ray Operator versions are incompatible with Cloud KMS database encryption. Workarounds: Upgrade the cluster control plane to a fixed version or later. |