Plan for large GKE clusters

This page describes the best practices you can follow when planning and designing very large-size clusters.

Why plan for large GKE clusters

Every computer system including Kubernetes has some architectural limits. Exceeding the limits may affect the performance of your cluster or in same cases even cause downtimes. Follow the best practices and execute recommended actions to ensure your clusters run your workloads reliably at scale.

Best practices for splitting workloads between multiple clusters

You can run your workloads on a single, large cluster. This approach is easier to manage, more cost efficient, and provides better resource utilization than multiple clusters. However, in some cases you need to consider splitting your workload into multiple clusters:

  • Review Multi-cluster use cases to learn more about general requirements and scenarios for using multiple clusters.
  • In addition, from the scalability point of view, split your cluster when it could exceed one of the limits described in the section below or one of GKE quotas. Lowering any risk to reach the GKE limits, reduces the risk of downtime or other reliability issues.

If you decide to split your cluster, use Fleet management to simplify management of a multi-cluster fleet.

Limits and best practices

To ensure that your architecture supports large-scale GKE clusters, review the following limits and related best practices. Exceeding these limits may cause degradation of cluster performance or reliability issues.

These best practices apply to any default Kubernetes cluster with no extensions installed. Extending Kubernetes clusters with webhooks or custom resource definitions (CRDs) is common but can constrain your ability to scale the cluster.

The following table extends the main GKE quotas and limits. You should also familiarize yourself with the open-source Kubernetes limits for large-scale clusters.

The GKE version requirements mentioned in the table apply to both the nodes and the control plane.

GKE limit Description Best practices
The etcd database size The maximum size of the etcd database is 6 GB. If you are running a very large cluster with tens of thousands of resources, your etcd instances might exceed this limit. You can check the utilization level for your clusters in the Google Cloud console. If you cross this limit, GKE might mark your etcd instances as unhealthy. This causes the clusters control plane to be unresponsive.

If you cross this limit, contact Google Cloud support.

Total size of etcd objects per type The total size of all objects of the given resource type should not exceed 800 MB. For example, you can create 750 MB of Pod instances and 750 MB of Secrets, but you cannot create 850 MB of Secrets. If you create more than 800 MB of objects, this could lead your Kubernetes or customized controllers to fail to initialize and cause disruptions.

Keep the total size of all objects of each type stored in etcd below 800 MB. This is especially applicable to clusters using many large-sized Secrets or ConfigMaps, or a high volume of CRDs.

Number of Services for clusters where GKE Dataplane V2 is not enabled The performance of iptables used by kube-proxy degrades if any of the following occurs:
  • There are too many Services.
  • The number of backends behind a Service is high.

This limit is eliminated when GKE Dataplane V2 is enabled.

Keep the number of Services in the cluster below 10,000.

To learn more, see Exposing applications using services.

Number of Services per namespace The number of environment variables generated for Services might outgrow shell limits. This might cause Pods to crash on startup.

Keep the number of Services per namespace below 5,000.

You can opt-out from having those environment variables populated. See the documentation for how to set enableServiceLinks in PodSpec to false.

To learn more, see Exposing applications using Services.

Number of Pods behind a single Service for clusters where GKE Dataplane V2 is not enabled

Every node runs a kube-proxy that uses watches for monitoring any Service change. The larger a cluster, the more change-related data the agent processes. This is especially visible in clusters with more than 500 nodes.

Information about the endpoints is split between separate EndpointSlices. This split reduces the amount of data transferred on each change.

Endpoint objects are still available for components, but any endpoint above 1,000 Pods is automatically truncated.

Keep the number of Pods behind a single Service lower than 10,000.

To learn more, see exposing applications using services.

Number of Pods behind a single Service for clusters where GKE Dataplane V2 is enabled

GKE Dataplane V2 contains limits on the number of Pods exposed by a single Service.

The same limit is applicable to Autopilot clusters as they use GKE Dataplane V2.

In GKE 1.23 and earlier, keep the number of Pods behind a single Service lower than 1,000.

In GKE 1.24 and later, keep the number of Pods behind a single Service lower than 10,000.

To learn more, see Exposing applications using services.

DNS records per headless Service

The number of DNS records per Headless Service is limited for both kube-dns and Cloud DNS.

Keep the number of DNS records per headless Service below 1,000 for kube-dns and 3,500/2,000 (IPv4/IPv6) for Cloud DNS.

Number of all Service endpoints The number of endpoints across all Services may hit limits. This may increase programming latency or result in an inability to program new endpoints at all.

Keep the number of all endpoints in all services below 64,000.

GKE Dataplane V2, which is the default dataplane for GKE, relies on eBPF maps that are currently limited to 64,000 endpoints across all Services.

Number of Horizontal Pod Autoscaler objects per cluster

Each Horizontal Pod Autoscaler (HPA) is processed every 15 seconds.

More than 300 HPA objects can cause linear degradation of performance.

Keep the number of HPA objects within this limit; otherwise you might experience linear degradation of frequency of HPA processing. For example in GKE 1.22 with 2,000 HPAs, a single HPA will be reprocessed every 1 minute and 40 seconds.

To learn more, see autoscaling based on resources utilization and horizontal pod autoscaling scalability.

Number of Pods per node GKE has a hard limit of 256 Pods per node. This assumes an average of two or fewer containers per Pod. If you increase the number of containers per Pod, this limit might be lower because GKE allocates more resources per container.

We recommend you use worker nodes with at least one vCPU per each 10 pods.

To learn more, see manually upgrading a cluster or node pool.

Rate of pod changes

Kubernetes has internal limits that impact the rate of creating or deleting Pods (Pods churn) in response to scaling requests. Additional factors like deleting a pod that is a part of a Service also can impact this Pod churn rate.

For clusters with up to 500 nodes, you can expect an average rate of 20 pods created per second and 20 pods deleted per second.

For clusters larger than 500 nodes, you can expect an average rate of 100 pods created per second and 100 pods deleted per second.

Take the Pod creation and deletion rate limit under consideration when planning how to scale your workloads.

Pods share the same deletion throughput with other resource types (for example, EndpointSlices). You can reduce deletion throughput when you define Pods as part of a Service.

To allow Cluster Autoscaler to effectively remove pods from underutilized nodes, avoid too restrictive PodDisruptionBudgets and long termination grace periods.

Wildcard Tolerations are also discouraged, as they can cause workloads to be scheduled on nodes that are in the process of being removed.

Number of open watches

Nodes create a watch for every Secret and ConfigMaps you configure for Pods. The combined amount of watches created by all nodes might generate substantial load on the cluster control plane.

Having more than 200,000 watches per cluster might affect the initialization time of the cluster. This issue can cause the control plane to frequently restart.

Define larger nodes to decrease the likelihood and severity of issues caused by a large number of watches. Higher pod density (fewer large-sized nodes) might reduce the number of watches and mitigate the severity of the issue.

To learn more, see the machine series comparison.

Number of Secrets per cluster if application-layer secrets encryption is enabled A cluster must decrypt all Secrets during cluster startup when application-layer secrets encryption is enabled. If you store more than 30,000 secrets, your cluster might become unstable during startup or upgrades, causing workload outages.

Store less than 30,000 Secrets when using application-layer secrets encryption.

To learn more, see encrypt secrets at the application layer.

Log bandwidth per node

There is a limit on the maximum amount of logs sent by each node to Cloud Logging API. The default limit varies between 100 Kbps and 500 Kbps depending on the load. You can lift the limit to 10 Mbps by deploying a high-throughput Logging agent configuration. Going beyond this limit may cause log entries to be dropped.

Configure your Logging to stay within the default limits or configure a high throughput Logging agent.

To learn more, see Increasing Logging agent throughput.

Backup for GKE limits

You can use Backup for GKE to backup and restore your GKE workloads.

Backup for GKE is subject to limits that you need to keep in mind when defining your backup plans.

Review the limits of Backup for GKE.

If it's possible for your workload to exceed these limits, we recommend creating multiple backup plans to partition your backup and stay within the limits.

Config Connector limits

You can use Config Connector to manage Google Cloud resources through Kubernetes. Config Connector has two modes of operation:

  • Cluster mode, where there is a single Config Connector instance per GKE cluster.

    In this mode, a single Config Connector instance loads all the resources.

  • Namespace mode, where each namespace within a cluster has a separate Config Connector instance.

    In this mode, you can partition managed resources via namespaces. This setup reduces the amount of resources that a single Config Connector instance needs to manage, lowering its CPU and memory usage.

Each mode has different scalability characteristic and limitations.

Each Config Connector instance has 0.1 CPU request and 512 MB memory limit. Therefore, it does not scale well to a large amount of managed resources. We recommend having no more than 25,000 Google Cloud resources per single Config Connector instance. This limitation is for reference purposes only because the amount of memory used depends on the type of resource and specific use cases.

When managing a larger number of managed resources, we recommend using the namespace mode to limit the number of resources handled by each Config Connector instance.

We recommend using up to 500 namespaces with Config Connector in the namespace mode. Each Config Connector instance opens many watch connections to the kube-apiserver. A large number of these instances may overload the GKE cluster control plane, especially during control plane upgrades, when watch connections need to be reinitialized.
The number of 500 namespaces may be further limited in new GKE clusters, because CPU and memory available for the cluster control plane is based initially on the number of nodes in the cluster. The cluster needs time to adjust the CPU and memory available based on utilization.

We recommend using multiple GKE clusters to manage the number of resources that could not be divided to fit in the limits specified above.

See Config Controller scalability guidelines for additional details.

What's next?