GKE known issues


This page lists known issues for GKE. This page is for Admins and architects who manage the lifecycle of the underlying technology infrastructure, and respond to alerts and pages when service level objectives (SLOs) aren't met or applications fail.

To filter the known issues by a product version or category, select your filters from the following drop-down menus.

Select your GKE version:

Select your problem category:

Or, search for your issue:

Category Identified version(s) Fixed version(s) Issue and workaround
Networking,Upgrades and updates 1.28

Gateway TLS configuration error

We've identified an issue with configuring TLS for Gateways in clusters running GKE version 1.28.4-gke.1083000. This affects TLS configurations using either an SSLCertificate or a CertificateMap. If you're upgrading a cluster with existing Gateways, updates made to the Gateway will fail. For new Gateways, the load balancers won't be provisioned. This issue will be fixed in an upcoming GKE 1.28 patch version.

Upgrades and updates 1.27 1.27.8 or later

GPU device plugin issue

Clusters that are running GPUs and are upgraded from 1.26 to a 1.27 patch version earlier than 1.27.8 might experience issues with their nodes' GPU device plugins (nvidia-gpu-device-plugin). Do the following steps depending on the state of your cluster:

  • If your cluster is running version 1.26 and has GPUs, don't manually upgrade your cluster until version 1.27.8 is available in your cluster's release channel.
  • If your cluster is running an earlier 1.27 patch version and the nodes are affected, restart the nodes or manually delete the nvidia-gpu-device-plugin Pod on the nodes (the add-on manager will create a new working plugin).
  • If your cluster is using auto-upgrades, this doesn't affect you as automatic upgrades will only move clusters to patch versions with the fix.
Networking 1.27,1.28,1.29
  • 1.26.13-gke.1052000 and later
  • 1.27.10-gke.1055000 and later
  • 1.28.6-gke.1095000 and later
  • 1.29.1-gke.1016000 and later

Intermittent connection establishment failures

Clusters on control plane versions 1.26.6-gke.1900 and later might encounter intermittent connection establishment failures.

The chances of failures are low and it doesn't affect all clusters. The failures should stop completely after a few days since the symptom onset.

Operation 1.27,1.28
  • 1.27.5-gke.1300 and later
  • 1.28.1-gke.1400 and later

Autoscaling for all workloads stops

HorizontalPodAutoscaler (HPA) and VerticalPodAutoscaler (VPA) might stop autoscaling all workloads in a cluster if it contains misconfigured autoscaling/v2 HPA objects. The issue impacts clusters running earlier patch versions of GKE version 1.27 and 1.28 (for example, 1.27.3-gke.100).

Workaround:

Correct misconfigured autoscaling/v2 HPA objects by making sure the fields in spec.metrics.resource.target match, for example:

  • When spec.metrics.resource.target.type is Utilization then target should be averageUtilization
  • When spec.metrics.resource.target.type is AverageValue then target should be averageValue

For more details on how to configure autoscaling/v2 HPA objects, see the HorizontalPodAutoscaler Kubernetes documentation.

Networking 1.27,1.28,1.29
  • 1.27.11-gke.1118000 or later
  • 1.28.7-gke.1100000 or later
  • 1.29.2-gke.1217000 or later

DNS resolution issues with Container-Optimized OS

Workloads running on GKE clusters with Container-Optimized OS-based nodes might experience DNS resolution issues.

Operation 1.28,1.29
  • 1.28.7-gke.1026000
  • 1.29.2-gke.1060000

Container Threat Detection fails to deploy

Container Threat Detection might fail to deploy on Autopilot clusters running the following GKE versions:

  • 1.28.6-gke.1095000 to 1.28.7-gke.1025000
  • 1.29.1-gke.1016000 to 1.29.1-gke.1781000
Networking 1.28 1.28.3-gke.1090000 or later

Network Policy drops a connection due to incorrect connection tracking lookup

For clusters with GKE Dataplane V2 enabled, when a client Pod connects to itself using a Service or the virtual IP address of an internal passthrough Network Load Balancer, the reply packet is not identified as a part of an existing connection due to incorrect conntrack lookup in the dataplane. This means that a Network Policy that restricts ingress traffic for the Pod is incorrectly enforced on the packet.

The impact of this issue depends on the number of configured Pods for the Service. For example, if the Service has 1 backend Pod, the connection always fails. If the Service has 2 backend Pods, the connection fails 50% of the time.

Workaround:

You can mitigate this issue by configuring the port and containerPort in the Service manifest to be the same value.

Networking 1.27,1.28
  • 1.28.3-gke.1090000 or later
  • 1.27.11-gke.1097000 or later

Packet drops for hairpin connection flows

For clusters with GKE Dataplane V2 enabled, when a Pod creates a TCP connection to itself using a Service, such that the Pod is both the source and destination of the connection, GKE Dataplane V2 eBPF connection tracking incorrectly tracks the connection states, leading to leaked conntrack entries.

When a connection tuple (protocol, source/destination IP, and source/destination port) has been leaked, new connections using the same connection tuple might result in return packets being dropped.

Workaround:

Use one of the following workarounds:

  • Enable TCP reuse (keep-alives) for an application running in a Pod that might communicate with itself using a Service. This prevents the TCP FIN flag from being issued and avoids leaking the conntrack entry.
  • When using short-lived connections, expose the Pod using a proxy load balancer, such as Gateway, to expose the Service. This results in the destination of the connection request being set to the load balancer IP address, preventing GKE Dataplane V2 from performing SNAT to the loopback IP address.