Plan for scalability

This page describes general best practices for architecting scalable GKE clusters. You can apply these recommendations on all clusters and workloads to achieve the optimal performance. These recommendations are especially important for clusters that you plan to largely scale. Best practices are intended for administrators responsible for provisioning the infrastructure and for developers who prepare kubernetes components and workloads.

What is scalability?

In a Kubernetes cluster, scalability refers to the ability of the cluster to grow while staying within its service-level objectives (SLOs). Kubernetes also has its own set of SLOs

Kubernetes is a complex system, and its ability to scale is determined by multiple factors. Some of these factors include the type and number of nodes in a node pool, the types and numbers of node pools, the number of Pods available, how resources are allocated to Pods, and the number of Services or backends behind a Service.

Best practices for availability

Choosing a regional or zonal control plane

Due to architectural differences, regional clusters are better suited for high availability. Regional clusters have multiple control planes across multiple compute zones in a region, while zonal clusters have one control plane in a single compute zone.

If a zonal cluster is upgraded, the control plane VM experiences downtime during which the Kubernetes API is not available until the upgrade is complete.

In regional clusters, the control plane remains available during cluster maintenance like rotating IPs, upgrading control plane VMs, or resizing clusters or node pools. When upgrading a regional cluster, two out of three control plane VMs are always running during the rolling upgrade, so the Kubernetes API is still available. Similarly, a single-zone outage won't cause any downtime in the regional control plane.

However, the more highly available regional clusters come with certain trade-offs:

  • Changes to the cluster's configuration take longer because they must propagate across all control planes in a regional cluster instead of the single control plane in zonal clusters.

  • You might not be able to create or upgrade regional clusters as often as zonal clusters. If VMs cannot be created in one of the zones, whether from a lack of capacity or other transient problem, clusters cannot be created or upgraded.

Due to these trade-offs, zonal and regional clusters have different use cases:

  • Use zonal clusters to create or upgrade clusters rapidly when availability is less of a concern.
  • Use regional clusters when availability is more important than flexibility.

Carefully select the cluster type when you create a cluster because you cannot change it after the cluster is created. Instead, you must create a new cluster then migrate traffic to it. Migrating production traffic between clusters is possible but difficult at scale.

Choosing multi-zonal or single-zone node pools

To achieve high availability, the Kubernetes control plane and its nodes need to be spread across different zones. GKE offers two types of node pools: single-zone and multi-zonal.

To deploy a highly available application, distribute your workload across multiple compute zones in a region by using multi-zonal node pools which distribute nodes uniformly across zones.

If all your nodes are in the same zone, you won't be able to schedule Pods if that zone becomes unreachable. Using multi-zonal node pools has certains trade-offs:

  • GPUs are available only in specific zones. It may not be possible to get them in all zones in the region.

  • Round-trip latency between zones within a single region might be higher than that between resources in a single zone. The difference should be immaterial for most workloads.

  • The price of egress traffic between zones in the same region is available on the Compute Engine pricing page.

Best practices for scale

Base infrastructure

Kubernetes workloads require networking, compute, and storage. You need to provide enough CPU and memory to run Pods. However, there are more parameters of underlying infrastructure that can influence performance and scalability of a GKE cluster.

Cluster networking

Using a VPC-native cluster is the networking default and the recommended choice for setting up new GKE clusters. VPC-native clusters allow for larger workloads, higher number of nodes, and some other advantages.

In this mode, the VPC network has a secondary range for all Pod IP addresses. Each node is then assigned a slice of the secondary range for its own Pod IP addresses. This allows the VPC network to natively understand how to route traffic to Pods without relying on custom routes. A single VPC network can have up to 15,000 VMs.

Another approach, which is deprecated and supports no more than 1,500 nodes, is to use a routes-based cluster. A routes-based cluster is not a good fit for large workloads. It consumes VPC routes quota and lacks other benefits of the VPC-native networking. It works by adding a new custom route to the routing table in the VPC network for each new node.

Private clusters

In regular GKE clusters, all nodes have public IP addresses. In private clusters, nodes only have internal IP addresses to isolate nodes from inbound and outbound connectivity to the Internet. GKE uses VPC network peering to connect VMs running the Kubernetes API server with the rest of the cluster. This allows higher throughput between GKE control planes and nodes, as traffic is routed using private IP addresses.

Using private clusters has the additional security benefit that nodes are not exposed to the Internet.

Cluster load balancing

GKE Ingress and Cloud Load Balancing configure and deploy load balancers to expose Kubernetes workloads outside the cluster and also to the public internet. The GKE Ingress and Service controllers deploy objects such as forwarding rules, URL maps, backend services, network endpoint groups, and more on behalf of GKE workloads. Each of these resources has inherent quotas and limits and these limits also apply in GKE. When any particular Cloud Load Balancing resource has reached its quota, it will prevent a given Ingress or Service from deploying correctly and errors will appear in the resource's events.

The following table describes the scaling limits when using GKE Ingress and Services:

Load balancer Node limit per cluster
Internal passthrough Network Load Balancer
External passthrough Network Load Balancer 1000 nodes per zone
External Application Load Balancer
Internal Application Load Balancer No node limit

If you need to scale further, contact your Google Cloud sales team to increase this limit.


Service discovery in GKE is provided through kube-dns which is a centralized resource to provide DNS resolution to Pods running inside the cluster. This can become a bottleneck on very large clusters or for workloads which have a high request load. GKE automatically autoscales kube-dns based on the size of the cluster to increase its capacity. When this capacity is still not enough, GKE offers distributed, local resolution of DNS queries on each node with NodeLocal DNSCache. This provides a local DNS cache on each GKE node which answers queries locally, distributing the load and providing faster response times.

Managing IP addresses in VPC-native clusters

A VPC-native cluster uses three IP address ranges:

  • Primary range for node subnet: Defaults to /20 (4092 IP addresses).
  • Secondary range for Pod subnet: Defaults to /14 (262,144 IP addresses). However, you can configure the Pod subnet.
  • Secondary range for Service subnet: Defaults to /20 (4096 addresses). However, you can't change this range after you create this Service subnet.

For more information, see IP address ranges for VPC-native clusters.

IP address limitations and recommendations:

  • Node limit: Node limit is determined by both the primary and Pod IP addresses per node. There must be enough addresses in both node and Pod IP address ranges to provision a new node. By default, you can create only 1024 nodes due to Pod IP address limitations.
  • Pod limit per node: By default, the Pod limit per node is 110 Pods. However, you can configure smaller Pod CIDRs for efficient use with fewer Pods per node.
  • Scaling beyond RFC 1918: If you require more IP addresses than available within the private space defined by RFC 1918, then we recommend using non-RFC 1918 private addresses or PUPIs for additional flexibility.
  • Secondary IP address range for Service and Pod: By default, you can configure 4096 Services. However, you can configure more Services by choosing the Service subnet range. You cannot modify secondary ranges after creation. When you create a cluster, ensure that you choose ranges large enough to accommodate anticipated growth. However, you can add more IP addresses for Pods later using discontiguous multi-Pod CIDR. For more information, see Not enough free IP address space for Pods.

For more information, see node limiting ranges and Plan IP addresses when migrating to GKE.

Configuring nodes for better performance

GKE nodes are regular Google Cloud virtual machines. Some of their parameters, for example the number of cores or size of disk, can influence how GKE clusters perform.

Reducing Pod initialization times

You can use Image streaming to stream data from eligible container images as your workloads request them, which leads to faster initialization times.

Egress traffic

In Google Cloud, the machine type and the number of cores allocated to the instance determine its network capacity. Maximum egress bandwidth varies from 1 to 32 Gbps, while the maximum egress bandwidth for default e2-medium-2 machines is 2 Gbps. For details on bandwidth limits, see Shared-core machine types.

IOPS and disk throughput

In Google Cloud, the size of persistent disks determines the IOPS and throughput of the disk. GKE typically uses Persistent Disks as boot disks and to back Kubernetes' Persistent Volumes. Increasing disk size increases both IOPS and throughput, up to certain limits.

Each persistent disk write operation contributes to your virtual machine instance's cumulative network egress cap. Thus IOPS performance of disks, especially SSDs, also depend on the number of vCPUs in the instance in addition to disk size. Lower core VMs have lower write IOPS limits due to network egress limitations on write throughput.

If your virtual machine instance has insufficient CPUs, your application won't be able to get close to IOPS limit. As a general rule, you should have one available CPU for every 2000-2500 IOPS of expected traffic.

Workloads that require high capacity or large numbers of disks need to consider the limits of how many PDs can be attached to a single VM. For regular VMs, that limit is 128 disks with a total size of 64 TB, while shared-core VMs have a limit of 16 PDs with a total size of 3 TB. Google Cloud enforces this limit, not Kubernetes.

Monitor control plane metrics

Use the available control plane metrics to configure your monitoring dashboards. You can use control plane metrics to observe the health of the cluster, observe the results of cluster configuration changes (for example, deploying additional workloads or third-party components) or when troubleshooting issues.

One of the most important metrics to monitor is the latency of the Kubernetes API. When latency increases, this indicates overloading the system. Keep in mind that LISTs calls transferring large amounts of data are expected to have much higher latency than smaller requests.

Increased Kubernetes API latency can also be caused by slow responsiveness of third party admission webhooks. You can use metrics to measure the latency of webhooks to detect such common issues.

Best practices for Kubernetes developers

Use list and watch pattern instead of periodic listing

As a Kubernetes developer, you might need to create a component having the following requirements:

  • Your component needs to retrieve the list of some Kubernetes objects periodically.
  • Your component needs to run in multiple instances (in case of DaemonSet, even on each node).

Such a component can generate load spikes on the kube-apiserver, even if the state of periodically retrieved objects is not changing.

The simplest approach is to use periodic LIST calls. However, this is an inefficient and expensive approach for both the caller and the server because all objects must be loaded to the memory, serialized, and transferred each time. The excessive use of LIST requests might overload the control plane or generate heavy throttling of such requests.

You can improve your component by setting resourceVersion=0 parameter on LIST calls. This allows kube-apiserver to use in-memory object cache and reduces the number of internal interactions between kube-apiserver and the etcd database and related processing.

We highly recommend avoiding repeatable LIST calls and replace them with the list and watch pattern. List the objects once and then use Watch API to get incremental changes of the state. This approach reduces processing time and minimizes traffic compared to periodic LIST calls. When objects do not change, no additional load is generated.

If you use Go language, then check SharedInformer and SharedInformerFactory for Go packages implementing this pattern.

Limit unnecessary traffic generated by watches and lists

Kubernetes internally uses watches to send notifications about object updates. Even with watches requiring much less resources than periodic LIST calls, processing of watches in large clusters can take a significant portion of cluster resources and affect the cluster performance. The biggest, negative impact is generated by creating watches that observe frequently changing objects from multiple places. For example, by observing data about all Pods from a component running on all nodes. If you install third-party code or extensions on your cluster, they can create such watches under the hood.

We recommend the following best practices:

  • Reduce unnecessary processing and traffic generated by watches and LIST calls.
  • Avoid creating watches that observe frequently changing objects from multiple places (for example DaemonSets).
  • (Highly recommended) Create a central controller that watches and processes required data on a single node.
  • Watch only a subset of the objects, for example, kubelet on each node observes only pods scheduled on the same node.
  • Avoid deploying third party components or extensions that could impact cluster performance by making a high volume of watches or LIST calls.

Limit Kubernetes object manifest size

If you need fast operations that require high Pod throughput, such as resizing or updating large workloads, make sure to keep Pod manifest sizes to a minimum, ideally below 10 KiB.

Kubernetes stores resource manifests in etcd. The entire manifest is sent every time the resource is retrieved, including when you use the list and watch pattern.

Manifest size has the following limitations:

  • Maximum size for each object in etcd: Approximately 1.5 MiB.
  • Total quota for all etcd objects in the cluster: The pre-configured quota size is 6 GiB. This includes a change log with all updates to all objects in the most recent 150 seconds of cluster history.
  • Control plane performance during high traffic periods: Larger manifest sizes increase the load on the API server.

For single, rarely processed objects, the manifest size is usually not a concern as long as it's below 1.5 MiB. However, manifest sizes above 10 KiB for numerous, frequently processed objects - such as pods in very large workloads - can cause increased API call latency and decreased overall performance. Lists and watches in particular can be significantly affected by large manifest sizes. You might also have issues with etcd quota, because the quantity of revisions in the last 150 seconds can accumulate quickly during periods of high API server traffic.

To reduce a pod's manifest size, Kubernetes ConfigMaps can be leveraged to store part of the configuration, particularly the part that is shared by multiple pods in the cluster. For example, environment variables are often shared by all pods in a workload.

Note that it's also possible to run into similar issues with ConfigMap objects if they are as numerous, as large, and processed as frequently as the pods. Extracting part of the configuration is most useful when it decreases overall traffic.

Disable automounting of default service account

If logic running in your Pods does not need to access Kubernetes API, you should disable the default, automatic mounting of service account to avoid creation of related Secrets and watches.

When you create a Pod without specifying a service account, Kubernetes automatically does the following actions:

  • Assigns the default service account to the Pod.
  • Mounts the service account credentials as a Secret for the Pod.
  • For every mounted Secret, the kubelet creates a watch to observe changes to that Secret on every node.

In large clusters, these actions represent thousands of unnecessary watches which may put a significant load on the kube-apiserver.

Use Protocol buffers instead of JSON for API requests

Use protocol buffers when implementing highly scalable components as described in Kubernetes API Concepts.

Kubernetes REST API supports JSON and protocol buffers as a serialization format for objects. JSON is used by default but protocol buffers are more efficient for performance at scale because it requires less CPU intensive processing and sends less data over the network. The overhead related to processing of JSON can cause timeouts when listing large-size data.

What's next?