Gateway traffic management

This page explains how Gateway traffic management works.


Google Kubernetes Engine (GKE) networking is built upon Cloud Load Balancing. With Cloud Load Balancing, a single anycast IP address delivers global traffic management. Google's traffic management provides global and regional load balancing, autoscaling, and capacity management to provide equalized, stable, and low latency traffic distribution. Using the GKE Gateway controller, GKE users can utilize Google's global traffic management control in a declarative and Kubernetes-native manner.

To try traffic spillover between clusters, see Deploying capacity-based load balancing. To try traffic-based autoscaling, see Autoscaling based on load balancer traffic.

Traffic management

Load balancing, autoscaling, and capacity management are the foundations of a traffic management system. They operate together to equalize and stabilize system load.

  • Load balancing distributes traffic across backend Pods according to location, health, and different load balancing algorithms.
  • Autoscaling scales workload replicas to create more capacity to absorb more traffic.
  • Capacity management monitors the utilization of Services so that traffic can overflow to backends with capacity rather than impacting application availability or performance.

These capabilities can be combined in different ways depending on your goals. For example:

  • If you want to take advantage of low-cost Spot VMs, you might want to optimize for evenly distributing traffic across Spot VMs at the cost of latency. Using load balancing and capacity management, GKE would overflow traffic between regions based on capacity so that Spot VMs are fully utilized wherever they are available.
  • If you want to optimize user latency at the cost of over-provisioning, you could deploy GKE clusters in many regions and increase capacity dynamically wherever load increases. Using load balancing and autoscaling GKE would autoscale the number of Pods when traffic spikes so that traffic does not have to overflow over to other regions. Regions would grow in capacity so that they are able to fully handle load as close as possible to users.

The following diagram shows load balancing, autoscaling, and capacity management operating together:

Load balancing, autoscaling and capacity management diagram

In the diagram, the workload in the gke-us cluster has failed. Load balancing and health checking drains active connections and redirects traffic to the next closest cluster. The workload in gke-asia receives more traffic than it has capacity for, so it sheds load to gke-eu. The gke-eu receives more load than typical because of events in gke-us and gke-asia and so gke-eu autoscales to increase its traffic capacity.

To learn more about how Cloud Load Balancing handles traffic management, see global capacity management.

Traffic management capabilities

Gateway, HTTPRoute, Service, and Policy resources provide the controls to manage traffic in GKE. The GKE Gateway controller is the control plane that monitors these resources.

The following traffic management capabilities are available when deploying Services in GKE:

  • Service capacity: the ability to specify the amount of traffic capacity that a Service can receive before Pods are autoscaled or traffic overflows to other available clusters.
  • Traffic-based autoscaling: autoscaling Pods within a Service based on HTTP requests received per second.
  • Multi-cluster load balancing: the ability to load balance to Services hosted across multiple GKE clusters or multiple regions.
  • Traffic splitting: explicit, weight-based traffic distribution across backends. Traffic splitting is supported with single-cluster Gateways in GA.

Traffic management support

The available traffic management capabilities depend on the GatewayClass that you deploy. For a complete list of feature support, see GatewayClass capabilities. The following table summarizes GatewayClass support for traffic management:

GatewayClass Service capacity Traffic autoscaling Multi-cluster load balancing Traffic splitting1
1 Traffic splitting is supported with single-cluster Gateways in GA.

Global, regional, and zonal load balancing

Service capacity, location, and health all determine how much traffic the load balancer sends to a given backend. Load balancing decisions are made at the following levels, starting with global for global load balancers and regional for regional load balancers:

  • Global: traffic is sent to the closest Google Cloud region to the client that has healthy backends with capacity. As long as the region has capacity, it receives all of its closest traffic. If a region does not have capacity, excess traffic overflows to the next closest region with capacity. To learn more, see global load balancing.
  • Regional: traffic is sent by the load balancer to a specific region. The traffic is load balanced across zones in proportion to the zone's available serving capacity. To learn more, see regional load balancing.
  • Zonal: after traffic is determined for a specific zone, the load balancer distributes traffic evenly across backends within that zone. Existing TCP connections and session persistence settings are preserved, so that future requests go to the same backends, as long as the backend Pod is healthy. To learn more, see zonal load balancing.

Global load balancing and traffic overflow

To try the following concepts in your own cluster, see Capacity-based load balancing.

Under normal conditions, traffic is sent to the closest backend to the client. Traffic terminates at the closest Google point of presence (PoP) to the client and then traverses the Google backbone until it reaches the closest backend, as determined by network latency. When the backends in a region do not have remaining capacity, traffic overflows to the next closest cluster with healthy backends that have capacity. If less than 50% of backend Pods within a zone are unhealthy, then traffic gradually fails over to other zones or regions, independent of the configured capacity.

Traffic overflow only occurs under the following conditions:

  • You are using a multi-cluster Gateway.
  • You have the same Service deployed across multiple clusters, served by the multi-cluster Gateway.
  • You have Service capacities configured such that traffic exceeds service capacities in one cluster, but not others.

The following diagram demonstrates how global load balancing works with traffic overflow:

Global load balancing with traffic overflow

In the diagram:

  • A multi-cluster Gateway provides global internet load balancing for the store Service. The service is deployed across two GKE clusters, one in us-west1 and another in europe-west1. Each cluster is running 2 replicas.
  • Each Service is configured with max-rate-per-endpoint="10", which means that each Service has a total capacity 2 replicas * 10 RPS = 20 RPS in each cluster.
  • Google PoPs in North America receive 6 RPS. All traffic is sent to the nearest healthy backend with capacity, the GKE cluster in us-west1.
  • European PoPs receive 30 cumulative RPS. The closest backends are in europe-west1, but they only have 20 RPS of capacity. Because the backends in us-west1 have excess capacity, 10 RPS overflows to us-west1 so that it receives 16 RPS in total and distributes 8 RPS to each pod.

Preventing traffic overflow

Traffic overflow helps prevent exceeding application capacity that can impact performance or availability.

However, you might not want to overflow traffic. Latency-sensitive applications, for example, might not benefit from traffic overflow to a much more distant backend.

You can use any of following methods to prevent traffic overflow:

  • Use only single-cluster Gateways which can host Services in only a single cluster.
  • Even if using multi-cluster Gateways, replicas of an application deployed across multiple clusters can be deployed as separate Services. From the perspective of the Gateway, this enables multi-cluster load balancing, but does not aggregate all endpoints of a Service between clusters.
  • Set Service capacities at a high enough level that traffic capacity is never realistically exceeded unless absolutely necessary.

Load balancing within a region

Within a region, traffic is distributed across zones according to the available capacities of the backends. This is not using overflow, but rather load balancing in direct proportion to the Service capacities in each zone. Any individual flow or session is always sent to a single, consistent backend Pod and is not split.

The following diagram shows how traffic is distributed within a region:

Traffic distributed within a region

In the diagram:

  • A Service is deployed in a regional GKE cluster. The Service has 4 Pods which are deployed unevenly across zones. 3 Pods are in zone A, 1 Pod is in zone B, and 0 Pods are in zone C.
  • The Service is configured with max-rate-per-endpoint="10". Zone A has 30 RPS of total capacity, zone B has 10 RPS of total capacity, and zone C has 0 RPS of total capacity, because it has no Pods.
  • The Gateway receives a total of 16 RPS of traffic from different clients. This traffic is distributed across zones in proportion to the remaining capacity in each zone.
  • Traffic flow from any individual source or client is consistently load balanced to a single backend Pod according to the session persistence settings. The distribution of traffic splits across different source traffic flows so that any individual flows are never split. As a result, a minimum amount of source or client diversity is required to granularly distribute traffic across backends.

For example, if the incoming traffic spikes from 16 RPS to 60 RPS, either of the following scenarios would occur:

  • If using single-cluster Gateways, then there are no other clusters or regions for this traffic to overflow to. Traffic continues to be distributed according to the relative zonal capacities, even if incoming traffic exceeds the total capacity. As a result, zone A receives 45 RPS and zone B receives 15 RPS.
  • If using multi-cluster Gateways with Services distributed across multiple clusters, then traffic can overflow to other clusters and other regions as described in Global load balancing and traffic overflow. Zone A receives 30 RPS, zone B receives 10 RPS, and 20 RPS overflows to another cluster.

Load balancing within a zone

Once traffic has been sent to a zone, it is distributed evenly across all the backends within that zone. HTTP sessions are persistent depending on the session affinity setting. Unless the backend becomes unavailable, existing TCP connections never move to a different backend. This means that long-lived connections continue going to the same backend Pod even if new connections overflow because of limited capacity. The load balancer prioritizes maintaining existing connections over new ones.

Service capacity

With service capacity, you can define a Requests per Second (RPS) value per Pod in a Service. This value represents the maximum RPS per-Pod on average that a Service can receive. This value is configurable on Services and is used to determine traffic-based autoscaling and capacity-based load balancing.


Service capacity has the following requirements and limitations:

  • Only impacts load balancing if you are using traffic-based autoscaling or multi-cluster Gateways. If you are not using these capabilities, Service capacity has no effect on network traffic.

Configure Service capacity

To configure Service capacity, create a Service using the annotation The following manifest describes a Service with a maximum RPS:

apiVersion: v1
kind: Service
  name: store
  annotations: "RATE_PER_SECOND"
  - port: 8080
    targetPort: 8080
    name: http
    app: store
  type: ClusterIP

Replace RATE_PER_SECOND with the maximum HTTP/HTTPS requests per second that a single Pod in this Service should receive.

The max-rate-per-endpoint value creates a dynamic capacity for a Service based on the number of Pods in the Service. The total Service capacity value is calculated by multiplying the max-rate-per-endpoint value with the number of replicas, as described in the following formula:

Total Service capacity = max-rate-per-endpoint * number of replicas

If an autoscaler scales up the number of Pods within a Service, then the Service's total capacity is computed accordingly. If a Service is scaled down to zero Pods, then it has zero capacity and does not receive any traffic from the load balancer.

Service capacity and standalone NEGs

Service capacity can also be configured when using standalone NEGs, however it does not use the max-rate-per-endpoint annotation. When using standalone NEGs, the max-rate-per-endpoint is configured manually when adding the NEG to a Backend Service resource. Using the gcloud compute backend-services add- backend command, the --max-rate-per-endpoint flag can configure capacity for each NEG individually.

This can be useful for any of the following workflows:

There is no functional difference when configuring service capacity with standalone NEGs. Both traffic autoscaling and traffic spillover are supported.

Determine your Service's capacity

Determining the value for max-rate-per-endpoint requires an understanding of your applications performance characteristics and your load balancing goals. The following strategies can help you define your application performance characteristics:

  • Observe your application in both test and production environments when configured without Service capacity.
  • Use Cloud Monitoring to create a correlation between traffic requests and your performance service level objectives (SLOs).
  • Define what your performance SLOs are for your application. They might be one or more of the following, depending on what you consider "bad" or "unstable" performance. All of the following can be gathered from Cloud Monitoring load balancer metrics:
    • Response error codes
    • Response or total latency
    • Backend unhealthiness or downtime
  • Observe your application under traffic load in both test and production environments. In test environments, stress your application under increasing request load so you can see how the different performance metrics are impacted as traffic increases. In production environments, observe realistic traffic patterns levels.

Default Service capacity

All Services attached to GKE resources have a default Service capacity configured even if it isn't explicitly configured using the annotation. To learn more, see Default service capacity.

The following table describes the default capacities:

Load balancing resource type Default max-rate-per-endpoint
Ingress (internal and external) 1 RPS
Gateway (all GatewayClasses) 100,000,000 RPS
MultiClusterIngress 100,000,000 RPS

Traffic-based autoscaling

Traffic-based autoscaling is a capability of GKE that natively integrates traffic signals from load balancers to autoscale Pods. Traffic-based autoscaling is only supported for single-cluster Gateways.

To use traffic-based autoscaling, see Autoscaling based on load balancer traffic.

Traffic-based autoscaling provides the following benefits:

  • Applications which are not strictly CPU or memory bound might have capacity limits which are not reflected in their CPU or memory usage.
  • Traffic, or requests per second (RPS) is an easier metric to understand in some cases because it is more aligned with app usage and business metrics such as page views or daily active users (DAUs).
  • Traffic is a leading indicator that represents instantaneous demand compared with CPU or memory which are lagging indicators.
  • The combination of CPU, memory, and traffic autoscaling metrics provides a holistic way of autoscaling applications that uses multiple dimensions to ensure that capacity is appropriately provisioned.

The following diagram demonstrates how traffic-based autoscaling works:

Traffic-based autoscaling

In the diagram:

  • The Service owner configures Service capacity and a target utilization for the Deployment.
  • The Gateway receives traffic from clients going to the store Service. The Gateway sends utilization telemetry to the GKE Pod Autoscaler. Utilization is equal to the actual traffic received by an individual Pod divided by the Pod's configured capacity.
  • The GKE Pod Autoscaler scales Pods up or down according to the configured target utilization.

Autoscaling behavior

The following diagram shows how traffic-based autoscaling works on an application receiving 10 RPS through the load balancer:

Traffic-based autoscaling with 10 RPS

In the diagram, the service owner has configured the capacity of the store Service to 10 RPS, which means that each Pod can receive a maximum of 10 RPS. The HorizontalPodAutoscaler is configured with averageUtilization is set to 70, which means that the target utilization is 70% of 10 RPS per Pod.

The autoscaler attempts to scale replicas to achieve the following equation:

replicas = ceiling[ current traffic / ( averageUtilization * max-rate-per-endpoint) ]

In the diagram, this equation computes to:

ceiling[ 10 rps / (0.7 * 10 rps) ] = ceiling[ 1.4 ] = 2 replicas

10 RPS of traffic results in 2 replicas. Each replica receives 6 RPS, which is under the target utilization of 7 RPS.

Traffic splitting

Traffic splitting uses an explicit ratio, called a weight, that defines the proportion of HTTP requests that are sent to a Service. HTTPRoute resources let you configure weights on a list of Services. The relative weights between Services define the split of traffic between them. This is useful for splitting traffic during rollouts, canarying changes, or for emergencies.

The following diagram describes an example traffic splitting configuration:

Traffic splitting configuration

In the diagram:

  • The Service owner configures two services for a single route, with a rule splitting traffic 90% to store-v1 and 10% to store-v2.
  • The Gateway receives traffic from clients going to the URL of the store application and traffic is split according to the configured rule. 90% of traffic routes to store-v1 and 10% routes to store-v2.

Traffic splitting is supported between Services in the same cluster and also between Services in different clusters:

  • Traffic splitting between Services: used for splitting traffic for application version rollouts. Using the traffic splitting example, you would have two separate Deployments, store-v1 and store-v2, which each have their own Service, store-v1 and store-v2. Weights are configured between the two Services to gradually shift traffic until store-v2 is fully rolled out.

  • Traffic splitting between ServiceImports: used for shifting traffic to or from specific clusters for maintenance, migration, or emergencies. ServiceImports represent multi-cluster Services and enable traffic splitting between different Services on different clusters. The exercise Blue-green, multi-cluster routing with Gateway demonstrates splitting traffic across clusters.

Weight vs capacity

Weights and capacities both control how much traffic is sent to different Services. While they have similar effects, they operate differently and have different use cases. They can and should be used together, though for different purposes.


Weight is an explicit control of traffic. It defines the exact proportions of traffic, independent of incoming traffic and backend utilization. In the traffic splitting example, if store-v2 was over-capacity, or if all of its replicas failed, 10% of the traffic would still be allocated to store-v2, potentially causing traffic to be dropped. That is because weight does not change the proportion of traffic based on utilization or health.

Weight is best suited for the following use cases:

  • Shifting traffic between different versions of a service for rollouts.
  • Manually onboarding services using explicit traffic splits.
  • Shifting traffic away from a set of backends for emergency or maintenance purposes.


Capacity is an implicit control of traffic. It defines the proportions of traffic indirectly as they depend on the amount of incoming traffic, backend utilization, and the source location of traffic. Capacity is an inherent property of a Service and is typically updated much less frequently.

Capacity is best suited for the following use cases:

  • Preventing backend over-utilization during traffic spikes.
  • Controlling the rate of autoscaling with respect to traffic.

Configuring Service capacity to overflow traffic may not always be a behavior that you want. Consider the global load balancing example. Service capacity protects backends from over-utilization by overflowing traffic, but this might result in extra latency for the requests that have overflowed, since those requests are traveling to a more remote region.

If your application is not very sensitive to overutilization then you might want to configure a very high Service capacity so that traffic is unlikely to ever overflow to another region. If your application's availability or latency is sensitive to overutilization, then overflowing traffic to other clusters or regions may be better than absorbing excess traffic on over-utilized backends. To learn more about how to configure Service capacity for your application, see Determine your Service's capacity.

What's next