GKE known issues


This page lists known issues for GKE. This page is for Admins and architects who manage the lifecycle of the underlying technology infrastructure, and respond to alerts and pages when service level objectives (SLOs) aren't met or applications fail.

To filter the known issues by a product version or category, select your filters from the following drop-down menus.

Select your GKE version:

Select your problem category:

Or, search for your issue:

Category Identified version(s) Fixed version(s) Issue and workaround
Operation
  • 1.30.0 to 1.30.5-gke.1443001
  • 1.31.0 to 1.31.1-gke.1678000
  • 1.30.5-gke.1628000 and later
  • 1.31.1-gke.1846000 and later

Increased Pod eviction rates on GKE versions 1.30 and 1.31

Some versions of GKE 1.30 and GKE 1.31 that use COS 113 and COS 117, respectively, have kernels that were built with the option CONFIG_LRU_GEN_ENABLED=y. This option enables the kernel feature Multi-Gen LRU, which causes the kubelet to miscalculate memory usage and might lead to the kubelet evicting Pods.

The config option CONFIG_LRU_GEN_ENABLED is disabled in cos-113-18244-151-96 and cos-117-18613-0-76.

You might not always see an unusual Pod eviction rate because this issue depends on the workload's memory usage pattern. There is a higher risk of the kubelet evicting Pods for workloads that haven't set a memory limit in the resources field. This is because the workloads might request more memory than what the kubelet reports as available.

If you see higher memory usage of an application after upgrading to the mentioned GKE versions without any other changes, then you might be affected by the kernel option.

To check if there are unusual Pod eviction rates, analyze the following metrics with Metrics Explorer: kubernetes.io/container_memory_used_bytes and kubernetes.io/container_memory_request_bytes

You can use the following PromQL queries. Replace the values for cluster_name, namespace_name, metadata_system_top_level_controller_type and metadata_system_top_level_controller_name with the workload name and type that you want to analyze:

max by (pod_name)(max_over_time(kubernetes_io:container_memory_used_bytes{monitored_resource="k8s_container",memory_type="non-evictable",cluster_name="REPLACE_cluster_name",namespace_name="REPLACE_namespace",metadata_system_top_level_controller_type="REPLACE_controller_type",metadata_system_top_level_controller_name="REPLACE_controller_name"}[${__interval}]))

sum by (pod_name)(avg_over_time(kubernetes_io:container_memory_request_bytes{monitored_resource="k8s_container",cluster_name="REPLACE_cluster_name",namespace_name="REPLACE_namespace",metadata_system_top_level_controller_type="REPLACE_controller_type",metadata_system_top_level_controller_name="REPLACE_controller_name"}[${__interval}]))

If you see unusual spikes in the memory usage that go above the requested memory, the workload might be getting evicted more often.

Workaround

If you can't upgrade to the fixed versions and if you're running in a GKE environment where you can deploy privileged Pods, you can disable the Multi-Gen LRU option by using a DaemonSet.

  1. Update the GKE node pools from where you want to run the DaemonSet with an annotation to disable the Multi-Gen LRU option. For example, disable-mglru: "true".
  2. Update the nodeSelector parameter in the DaemonSet manifest with the same annotation you used in the preceding step. For example, see the disable-mglru.yaml file in the GoogleCloudPlatform/k8s-node-tools repository.
  3. Deploy the DaemonSet to your cluster.

After the DaemonSet is running in all the selected node pools, the change is effective immediately and the kubelet memory usage calculation is back to normal.

Operation 1.28, 1.29, 1.30, 1.31
  • 1.28.14-gke.1175000 and later
  • 1.29.9-gke.1341000 and later
  • 1.30.5-gke.1355000 and later
  • 1.31.1-gke.1621000 and later

Pods stuck in Terminating status

A bug in the container runtime (containerd) might cause Pods and containers to be stuck in Terminating status with errors similar to the following:

OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown

If you are impacted by this issue, you can upgrade your nodes to a GKE version with a fixed version of containerd.

Operation 1.28,1.29
  • 1.28.9-gke.1103000 and later
  • 1.29.4-gke.1202000 and later
  • 1.30: All versions

A bug in the Image streaming feature might cause containers to fail to start.

Containers running on a node with image streaming enabled on specific GKE versions might fail to be created with the following error:

"CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd container: failed to mount [PATH]: too many levels of symbolic links"

If you are impacted by this issue, you can check for empty layers or duplicate layers. If you can't remove empty empty layers or duplicate layers, then disable Image streaming.

Operation 1.27,1.28,1.29
  • 1.28.9-gke.1103000 and later
  • 1.29.4-gke.1202000 and later
  • 1.30: All versions

Image streaming fails because of missing files

A bug in the Image streaming feature might cause containers to fail because of a missing file or files.

Containers running on a node with Image streaming enabled on the following versions might fail to start or run with errors informing that certain files don't exist. The following are examples of such errors:

  • No such file or directory
  • Executable file not found in $PATH

If you are impacted by this issue, you can disable Image streaming.

Networking,Upgrades and updates 1.28

Gateway TLS configuration error

We've identified an issue with configuring TLS for Gateways in clusters running GKE version 1.28.4-gke.1083000. This affects TLS configurations using either an SSLCertificate or a CertificateMap. If you're upgrading a cluster with existing Gateways, updates made to the Gateway will fail. For new Gateways, the load balancers won't be provisioned. This issue will be fixed in an upcoming GKE 1.28 patch version.

Upgrades and updates 1.27 1.27.8 or later

GPU device plugin issue

Clusters that are running GPUs and are upgraded from 1.26 to a 1.27 patch version earlier than 1.27.8 might experience issues with their nodes' GPU device plugins (nvidia-gpu-device-plugin). Do the following steps depending on the state of your cluster:

  • If your cluster is running version 1.26 and has GPUs, don't manually upgrade your cluster until version 1.27.8 is available in your cluster's release channel.
  • If your cluster is running an earlier 1.27 patch version and the nodes are affected, restart the nodes or manually delete the nvidia-gpu-device-plugin Pod on the nodes (the add-on manager will create a new working plugin).
  • If your cluster is using auto-upgrades, this doesn't affect you as automatic upgrades will only move clusters to patch versions with the fix.
Operation 1.27,1.28
  • 1.27.5-gke.1300 and later
  • 1.28.1-gke.1400 and later

Autoscaling for all workloads stops

HorizontalPodAutoscaler (HPA) and VerticalPodAutoscaler (VPA) might stop autoscaling all workloads in a cluster if it contains misconfigured autoscaling/v2 HPA objects. The issue impacts clusters running earlier patch versions of GKE version 1.27 and 1.28 (for example, 1.27.3-gke.100).

Workaround:

Correct misconfigured autoscaling/v2 HPA objects by making sure the fields in spec.metrics.resource.target match, for example:

  • When spec.metrics.resource.target.type is Utilization then target should be averageUtilization
  • When spec.metrics.resource.target.type is AverageValue then target should be averageValue

For more details on how to configure autoscaling/v2 HPA objects, see the HorizontalPodAutoscaler Kubernetes documentation.

Operation 1.28,1.29
  • 1.28.7-gke.1026000
  • 1.29.2-gke.1060000

Container Threat Detection fails to deploy

Container Threat Detection might fail to deploy on Autopilot clusters running the following GKE versions:

  • 1.28.6-gke.1095000 to 1.28.7-gke.1025000
  • 1.29.1-gke.1016000 to 1.29.1-gke.1781000
Networking, Upgrades 1.27, 1.28, 1.29, 1.30
  • 1.30.4-gke.1282000 or later
  • 1.29.8-gke.1157000 or later
  • 1.28.13-gke.1078000 or later
  • 1.27.16-gke.1342000 or later

Connectivity issues for hostPort Pods after control plane upgrade

Clusters with network policy enabled might experience connectivity issues with hostPort Pods. Additionally, newly created Pods might take an additional 30 to 60 seconds to be ready.

The issue is triggered when the GKE control plane of a cluster is upgraded to one of the following GKE versions

  • 1.30 to 1.30.4-gke.1281999
  • 1.29.1-gke.1545000 to 1.29.8-gke.1156999
  • 1.28.7-gke.1042000 to 1.28.13-gke.1077999
  • 1.27.12-gke.1107000 to 1.27.16-gke.1341999

Workaround:

Upgrade or recreate nodes immediately after the GKE control plane upgrade.

Operation 1.29,1.30,1.31
  • 1.29.10-gke.1071000 or later
  • 1.30.5-gke.1723000 or later
  • 1.31.2-gke.1115000 or later

Incompatible Ray Operator and Cloud KMS database encryption

Some Ray Operator versions are incompatible with Cloud KMS database encryption.

Workarounds:

Upgrade the cluster control plane to a fixed version or later.