This page explains how Gateway traffic management works.
Overview
Google Kubernetes Engine (GKE) networking is built upon Cloud Load Balancing. With Cloud Load Balancing, a single anycast IP address delivers global traffic management. Google's traffic management provides global and regional load balancing, autoscaling, and capacity management to provide equalized, stable, and low latency traffic distribution. Using the GKE Gateway controller, GKE users can utilize Google's global traffic management control in a declarative and Kubernetes-native manner.
To try traffic spillover between clusters, see Deploying capacity-based load balancing. To try traffic-based autoscaling, see Autoscaling based on load balancer traffic.
Traffic management
Load balancing, autoscaling, and capacity management are the foundations of a traffic management system. They operate together to equalize and stabilize system load.
- Load balancing distributes traffic across backend Pods according to location, health, and different load balancing algorithms.
- Autoscaling scales workload replicas to create more capacity to absorb more traffic.
- Capacity management monitors the utilization of Services so that traffic can overflow to backends with capacity rather than impacting application availability or performance.
These capabilities can be combined in different ways depending on your goals. For example:
- If you want to take advantage of low-cost Spot VMs, you might want to optimize for evenly distributing traffic across Spot VMs at the cost of latency. Using load balancing and capacity management, GKE would overflow traffic between regions based on capacity so that Spot VMs are fully utilized wherever they are available.
- If you want to optimize user latency at the cost of over-provisioning, you could deploy GKE clusters in many regions and increase capacity dynamically wherever load increases. Using load balancing and autoscaling GKE would autoscale the number of Pods when traffic spikes so that traffic does not have to overflow over to other regions. Regions would grow in capacity so that they are able to fully handle load as close as possible to users.
The following diagram shows load balancing, autoscaling, and capacity management operating together:
In the diagram, the workload in the gke-us
cluster has failed. Load balancing
and health checking drains active connections and redirects traffic to the next
closest cluster. The workload in gke-asia
receives more traffic than it
has capacity for, so it sheds load to gke-eu
. The gke-eu
receives more load
than typical because of events in gke-us
and gke-asia
and so gke-eu
autoscales to increase its traffic capacity.
To learn more about how Cloud Load Balancing handles traffic management, see global capacity management.
Traffic management capabilities
Gateway, HTTPRoute, Service, and Policy resources provide the controls to manage traffic in GKE. The GKE Gateway controller is the control plane that monitors these resources.
The following traffic management capabilities are available when deploying Services in GKE:
- Service capacity: the ability to specify the amount of traffic capacity that a Service can receive before Pods are autoscaled or traffic overflows to other available clusters.
- Traffic-based autoscaling: autoscaling Pods within a Service based on HTTP requests received per second.
- Multi-cluster load balancing: the ability to load balance to Services hosted across multiple GKE clusters or multiple regions.
- Traffic splitting: explicit, weight-based traffic distribution across backends. Traffic splitting is supported with single-cluster Gateways in GA.
Traffic management support
The available traffic management capabilities depend on the GatewayClass that you deploy. For a complete list of feature support, see GatewayClass capabilities. The following table summarizes GatewayClass support for traffic management:
GatewayClass | Service capacity | Traffic autoscaling | Multi-cluster load balancing | Traffic splitting1 |
---|---|---|---|---|
gke-l7-global-external-managed |
||||
gke-l7-regional-external-managed |
||||
gke-l7-rilb |
||||
gke-l7-gxlb |
||||
gke-l7-global-external-managed-mc |
||||
gke-l7-regional-external-managed-mc |
||||
gke-l7-rilb-mc |
||||
gke-l7-gxlb-mc |
Global, regional, and zonal load balancing
Service capacity, location, and health all determine how much traffic the load balancer sends to a given backend. Load balancing decisions are made at the following levels, starting with global for global load balancers and regional for regional load balancers:
- Global: traffic is sent to the closest Google Cloud region to the client that has healthy backends with capacity. As long as the region has capacity, it receives all of its closest traffic. If a region does not have capacity, excess traffic overflows to the next closest region with capacity. To learn more, see global load balancing.
- Regional: traffic is sent by the load balancer to a specific region. The traffic is load balanced across zones in proportion to the zone's available serving capacity. To learn more, see regional load balancing.
- Zonal: after traffic is determined for a specific zone, the load balancer distributes traffic evenly across backends within that zone. Existing TCP connections and session persistence settings are preserved, so that future requests go to the same backends, as long as the backend Pod is healthy. To learn more, see zonal load balancing.
Global load balancing and traffic overflow
To try the following concepts in your own cluster, see Capacity-based load balancing.
Under normal conditions, traffic is sent to the closest backend to the client. Traffic terminates at the closest Google point of presence (PoP) to the client and then traverses the Google backbone until it reaches the closest backend, as determined by network latency. When the backends in a region do not have remaining capacity, traffic overflows to the next closest cluster with healthy backends that have capacity. If less than 50% of backend Pods within a zone are unhealthy, then traffic gradually fails over to other zones or regions, independent of the configured capacity.
Traffic overflow only occurs under the following conditions:
- You are using a multi-cluster Gateway.
- You have the same Service deployed across multiple clusters, served by the multi-cluster Gateway.
- You have Service capacities configured such that traffic exceeds service capacities in one cluster, but not others.
The following diagram demonstrates how global load balancing works with traffic overflow:
In the diagram:
- A multi-cluster Gateway provides global internet load balancing for the
store
Service. The service is deployed across two GKE clusters, one inus-west1
and another ineurope-west1
. Each cluster is running 2 replicas. - Each Service is configured with
max-rate-per-endpoint="10"
, which means that each Service has a total capacity 2 replicas * 10 RPS = 20 RPS in each cluster. - Google PoPs in North America receive 6 RPS. All traffic is sent to the
nearest healthy backend with capacity, the GKE cluster in
us-west1
. - European PoPs receive 30 cumulative RPS. The closest backends are in
europe-west1
, but they only have 20 RPS of capacity. Because the backends inus-west1
have excess capacity, 10 RPS overflows tous-west1
so that it receives 16 RPS in total and distributes 8 RPS to each pod.
Preventing traffic overflow
Traffic overflow helps prevent exceeding application capacity that can impact performance or availability.
However, you might not want to overflow traffic. Latency-sensitive applications, for example, might not benefit from traffic overflow to a much more distant backend.
You can use any of following methods to prevent traffic overflow:
- Use only single-cluster Gateways which can host Services in only a single cluster.
- Even if using multi-cluster Gateways, replicas of an application deployed across multiple clusters can be deployed as separate Services. From the perspective of the Gateway, this enables multi-cluster load balancing, but does not aggregate all endpoints of a Service between clusters.
- Set Service capacities at a high enough level that traffic capacity is never realistically exceeded unless absolutely necessary.
Load balancing within a region
Within a region, traffic is distributed across zones according to the available capacities of the backends. This is not using overflow, but rather load balancing in direct proportion to the Service capacities in each zone. Any individual flow or session is always sent to a single, consistent backend Pod and is not split.
The following diagram shows how traffic is distributed within a region:
In the diagram:
- A Service is deployed in a regional GKE cluster. The Service has 4 Pods which are deployed unevenly across zones. 3 Pods are in zone A, 1 Pod is in zone B, and 0 Pods are in zone C.
- The Service is configured with
maxRatePerEndpoint="10"
. Zone A has 30 RPS of total capacity, zone B has 10 RPS of total capacity, and zone C has 0 RPS of total capacity, because it has no Pods. - The Gateway receives a total of 16 RPS of traffic from different clients. This traffic is distributed across zones in proportion to the remaining capacity in each zone.
- Traffic flow from any individual source or client is consistently load balanced to a single backend Pod according to the session persistence settings. The distribution of traffic splits across different source traffic flows so that any individual flows are never split. As a result, a minimum amount of source or client diversity is required to granularly distribute traffic across backends.
For example, if the incoming traffic spikes from 16 RPS to 60 RPS, either of the following scenarios would occur:
- If using single-cluster Gateways, then there are no other clusters or regions for this traffic to overflow to. Traffic continues to be distributed according to the relative zonal capacities, even if incoming traffic exceeds the total capacity. As a result, zone A receives 45 RPS and zone B receives 15 RPS.
- If using multi-cluster Gateways with Services distributed across multiple clusters, then traffic can overflow to other clusters and other regions as described in Global load balancing and traffic overflow. Zone A receives 30 RPS, zone B receives 10 RPS, and 20 RPS overflows to another cluster.
Load balancing within a zone
Once traffic has been sent to a zone, it is distributed evenly across all the backends within that zone. HTTP sessions are persistent depending on the session affinity setting. Unless the backend becomes unavailable, existing TCP connections never move to a different backend. This means that long-lived connections continue going to the same backend Pod even if new connections overflow because of limited capacity. The load balancer prioritizes maintaining existing connections over new ones.
Service capacity
With service capacity, you can define a Requests per Second (RPS) value per Pod in a Service. This value represents the maximum RPS per-Pod on average that a Service can receive. This value is configurable on Services and is used to determine traffic-based autoscaling and capacity-based load balancing.
Requirements
Service capacity has the following requirements and limitations:
- Only supported with the GatewayClass resources and Ingress types defined in Traffic management support.
- Only impacts load balancing if you are using traffic-based autoscaling or multi-cluster Gateways. If you are not using these capabilities, Service capacity has no effect on network traffic.
Configure Service capacity
Single-cluster Gateways
Ensure your GKE
cluster is running the version 1.31.1-gke.2008000 or later. Earlier versions
can use the networking.gke.io/max-rate-per-endpoint
annotation as
described in the Multi-cluster Gateways tab.
To use single-cluster Gateways to configure Service capacity, create a Service and
an associated GCPBackendPolicy
. Use the following manifest to create a Service:
apiVersion: v1
kind: Service
metadata:
name: store
spec:
ports:
- port: 8080
targetPort: 8080
name: http
selector:
app: store
type: ClusterIP
Configure the GCPBackendPolicy
object using the maxRatePerEndpoint
field with a
maximum RPS. Use the following manifest to configure the GCPBackendPolicy
object:
apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
name: store
spec:
default:
maxRatePerEndpoint: "RATE_PER_SECOND"
targetRef:
group: ""
kind: Service
name: store
Multi-cluster Gateways
To use multi-cluster Gateways to configure Service capacity, create a
Service using the networking.gke.io/max-rate-per-endpoint
annotation. Use the
following manifest to create a Service with a maximum RPS:
apiVersion: v1
kind: Service
metadata:
name: store
annotations:
networking.gke.io/max-rate-per-endpoint: "RATE_PER_SECOND"
spec:
ports:
- port: 8080
targetPort: 8080
name: http
selector:
app: store
type: ClusterIP
Replace RATE_PER_SECOND
with the maximum HTTP/HTTPS
requests per second that a single Pod in this Service should receive.
The maxRatePerEndpoint
value creates a dynamic capacity for a Service based
on the number of Pods in the Service. The total Service capacity value is
calculated by multiplying the maxRatePerEndpoint
value with the number of
replicas, as described in the following formula:
Total Service capacity = maxRatePerEndpoint * number of replicas
If an autoscaler scales up the number of Pods within a Service, then the Service's total capacity is computed accordingly. If a Service is scaled down to zero Pods, then it has zero capacity and does not receive any traffic from the load balancer.
Service capacity and standalone NEGs
Service capacity can also be configured when using
standalone NEGs, however it
does not use the maxRatePerEndpoint
setting. When using standalone NEGs, the
maxRatePerEndpoint
is configured manually when adding the NEG to a Backend
Service resource. Using the
gcloud compute backend-services add-backend
command, the --max-rate-per-endpoint
flag can configure capacity for each NEG
individually.
This can be useful for any of the following workflows:
- When deploying internal and external load balancers manually using standalone NEGs
- When deploying Cloud Service Mesh on GKE using standalone NEGs
There is no functional difference when configuring service capacity with standalone NEGs. Both traffic autoscaling and traffic spillover are supported.
Determine your Service's capacity
Determining the value for maxRatePerEndpoint
requires an understanding of
your applications performance characteristics and your load balancing goals. The
following strategies can help you define your application performance
characteristics:
- Observe your application in both test and production environments when configured without Service capacity.
- Use Cloud Monitoring to create a correlation between traffic requests and your performance service level objectives (SLOs).
- Use
load balancer metrics,
such as
https
orrequest_count
to map RPS levels.
- Define what your performance SLOs are for your application. They might be one
or more of the following, depending on what you consider "bad" or "unstable"
performance. All of the following can be gathered from Cloud Monitoring
load balancer metrics:
- Response error codes
- Response or total latency
- Backend unhealthiness or downtime
- Observe your application under traffic load in both test and production environments. In test environments, stress your application under increasing request load so you can see how the different performance metrics are impacted as traffic increases. In production environments, observe realistic traffic patterns levels.
Default Service capacity
All Services attached to GKE resources have a default Service capacity configured even if it isn't explicitly configured. To learn more, see Default service capacity.
The following table describes the default capacities:
Load balancing resource type | Default maxRatePerEndpoint |
---|---|
Ingress (internal and external) | 1 RPS |
Gateway (all GatewayClasses) | 100,000,000 RPS |
MultiClusterIngress | 100,000,000 RPS |
Traffic-based autoscaling
Traffic-based autoscaling is a capability of GKE that natively integrates traffic signals from load balancers to autoscale Pods. Traffic-based autoscaling is only supported for single-cluster Gateways.
To use traffic-based autoscaling, see Autoscaling based on load balancer traffic.Traffic-based autoscaling provides the following benefits:
- Applications which are not strictly CPU or memory bound might have capacity limits which are not reflected in their CPU or memory usage.
- Traffic, or requests per second (RPS) is an easier metric to understand in some cases because it is more aligned with app usage and business metrics such as page views or daily active users (DAUs).
- Traffic is a leading indicator that represents instantaneous demand compared with CPU or memory which are lagging indicators.
- The combination of CPU, memory, and traffic autoscaling metrics provides a holistic way of autoscaling applications that uses multiple dimensions to ensure that capacity is appropriately provisioned.
The following diagram demonstrates how traffic-based autoscaling works:
In the diagram:
- The Service owner configures Service capacity and a target utilization for the Deployment.
- The Gateway receives traffic from clients going to the
store
Service. The Gateway sends utilization telemetry to the GKE Pod Autoscaler. Utilization is equal to the actual traffic received by an individual Pod divided by the Pod's configured capacity. - The GKE Pod Autoscaler scales Pods up or down according to the configured target utilization.
Autoscaling behavior
The following diagram shows how traffic-based autoscaling works on an application receiving 10 RPS through the load balancer:
In the diagram, the service owner has configured the capacity of the store
Service to 10 RPS, which means that each Pod can receive a maximum of 10 RPS.
The HorizontalPodAutoscaler is configured with averageValue
is set to
70
, which means that the target utilization is 70% of 10 RPS per Pod.
The autoscaler attempts to scale replicas to achieve the following equation:
replicas = ceiling[ current traffic / ( averageValue * maxRatePerEndpoint) ]
In the diagram, this equation computes to:
ceiling[ 10 rps / (0.7 * 10 rps) ] = ceiling[ 1.4 ] = 2 replicas
10 RPS of traffic results in 2 replicas. Each replica receives 5 RPS, which is under the target utilization of 7 RPS.
Traffic splitting
Traffic splitting uses an explicit ratio, called a weight, that defines the proportion of HTTP requests that are sent to a Service. HTTPRoute resources let you configure weights on a list of Services. The relative weights between Services define the split of traffic between them. This is useful for splitting traffic during rollouts, canarying changes, or for emergencies.
The following diagram describes an example traffic splitting configuration:
In the diagram:
- The Service owner configures two services for a single route, with a rule
splitting traffic 90% to
store-v1
and 10% tostore-v2
. - The Gateway receives traffic from clients going to the URL of the store
application and traffic is split according to the configured rule.
90% of traffic routes to
store-v1
and 10% routes tostore-v2
.
Traffic splitting is supported between Services in the same cluster and also between Services in different clusters:
Traffic splitting between Services: used for splitting traffic for application version rollouts. Using the traffic splitting example, you would have two separate Deployments,
store-v1
andstore-v2
, which each have their own Service,store-v1
andstore-v2
. Weights are configured between the two Services to gradually shift traffic untilstore-v2
is fully rolled out.Traffic splitting between ServiceImports: used for shifting traffic to or from specific clusters for maintenance, migration, or emergencies. ServiceImports represent multi-cluster Services and enable traffic splitting between different Services on different clusters. The exercise Blue-green, multi-cluster routing with Gateway demonstrates splitting traffic across clusters.
Weight vs capacity
Weights and capacities both control how much traffic is sent to different Services. While they have similar effects, they operate differently and have different use cases. They can and should be used together, though for different purposes.
Weight
Weight is an explicit control of traffic. It defines the exact proportions
of traffic, independent of incoming traffic and backend utilization. In the
traffic splitting example, if store-v2
was over-capacity, or if
all of its replicas failed, 10% of the traffic would still be allocated to
store-v2
, potentially causing traffic to be dropped. That is because weight
does not change the proportion of traffic based on utilization or health.
Weight is best suited for the following use cases:
- Shifting traffic between different versions of a service for rollouts.
- Manually onboarding services using explicit traffic splits.
- Shifting traffic away from a set of backends for emergency or maintenance purposes.
Capacity
Capacity is an implicit control of traffic. It defines the proportions of traffic indirectly as they depend on the amount of incoming traffic, backend utilization, and the source location of traffic. Capacity is an inherent property of a Service and is typically updated much less frequently.
Capacity is best suited for the following use cases:
- Preventing backend over-utilization during traffic spikes.
- Controlling the rate of autoscaling with respect to traffic.
Configuring Service capacity to overflow traffic may not always be a behavior that you want. Consider the global load balancing example. Service capacity protects backends from over-utilization by overflowing traffic, but this might result in extra latency for the requests that have overflowed, since those requests are traveling to a more remote region.
If your application is not very sensitive to overutilization then you might want to configure a very high Service capacity so that traffic is unlikely to ever overflow to another region. If your application's availability or latency is sensitive to overutilization, then overflowing traffic to other clusters or regions may be better than absorbing excess traffic on over-utilized backends. To learn more about how to configure Service capacity for your application, see Determine your Service's capacity.
What's next
- Learn about Deploying Gateways.
- Learn about Deploying multi-cluster Gateways.