Failover for internal passthrough Network Load Balancers

You can configure an internal passthrough Network Load Balancer to distribute connections among virtual machine (VM) instances in primary backends, and then switch, if needed, to using failover backends. Failover provides one method of increasing availability, while also giving you greater control over how to manage your workload when your primary backend VMs aren't healthy.

This page describes concepts and requirements specific to failover for internal passthrough Network Load Balancers. Make sure that you are familiar with the conceptual information in the following articles before you configure failover for internal passthrough Network Load Balancers:

These concepts are important to understand because configuring failover modifies the internal passthrough Network Load Balancer's standard traffic distribution algorithm.

By default, when you add a backend to an internal passthrough Network Load Balancer's backend service, that backend is a primary backend. You can designate a backend to be a failover backend when you add it to the load balancer's backend service, or by editing the backend service later. Failover backends only receive connections from the load balancer after a configurable ratio of primary VMs don't pass health checks.

Supported instance groups

Managed and unmanaged instance groups are supported as backends. For simplicity, the examples on this page show unmanaged instance groups.

Using managed instance groups with autoscaling and failover might cause the active pool to repeatedly failover and failback between the primary and failover backends. Google Cloud doesn't prevent you from configuring failover with managed instance groups because your deployment might benefit from this setup.

Architecture

The following simple example depicts an internal passthrough Network Load Balancer with one primary backend and one failover backend.

  • The primary backend is an unmanaged instance group in us-west1-a.
  • The failover backend is a different unmanaged instance group in us-west1-c.
Failover example for internal passthrough Network Load Balancers.
Failover example for internal passthrough Network Load Balancers (click to enlarge).

The next example depicts an internal passthrough Network Load Balancer with two primary backends and two failover backends, both distributed between two zones in the us-west1 region. This configuration increases reliability because it doesn't depend on a single zone for all primary or all failover backends.

For more information about region-specific considerations, see Geography and regions.

  • Primary backends are unmanaged instance groups ig-a and ig-d.
  • Failover backends are unmanaged instance groups ig-b and ig-c.
Multi-zone internal passthrough Network Load Balancer failover.
Multi-zone internal passthrough Network Load Balancer failover (click to enlarge).

During failover, both primary backends become inactive, while the healthy VMs in both failover backends become active. For a full explanation of how failover works in this example, see the Failover example.

Backend instance groups and VMs

Unmanaged instance groups in internal passthrough Network Load Balancers are either primary backends or failover backends. You can designate a backend to be a failover backend at the time that you add it to the backend service or by editing the backend after you add it. Otherwise, unmanaged instance groups are primary by default.

You can configure multiple primary backends and multiple failover backends in a single internal passthrough Network Load Balancer by adding them to the load balancer's backend service.

A primary VM is a member of an instance group that you've defined to be a primary backend. The VMs in a primary backend participate in the load balancer's active pool (described in the next section), unless the load balancer switches to using its failover backends.

A backup VM is a member of an instance group that you've defined to be a failover backend. The VMs in a failover backend participate in the load balancer's active pool when primary VMs become unhealthy. The number of unhealthy VMs that triggers failover is a configurable percentage.

Limits

  • VMs. You can have up to 250 VMs in the active pool. In other words, your primary backend instance groups can have a total of up to 250 primary VMs, and your failover backend instance groups can have a total of up to 250 backup VMs.

  • Unmanaged instance groups. You can have up to 50 primary backend instance groups and up to 50 failover backend instance groups.

As an example, a maximum deployment might have 5 primary backends and 5 failover backends, with each instance group containing 50 VMs.

Active pool

The active pool is the collection of backend VMs to which an internal passthrough Network Load Balancer sends new connections. Membership of backend VMs in the active pool is computed automatically based on which backends are healthy and conditions that you can specify, as described in Failover ratio.

The active pool never combines primary VMs and backup VMs. The following examples clarify the membership possibilities. During failover, the active pool contains only backup VMs. During normal operation (failback), the active pool contains only primary VMs.

Active pool on failover and failback.
Active pool on failover and failback (click to enlarge).

Failover and failback

Failover and failback are the automatic processes that switch backend VMs into or out of the load balancer's active pool. When Google Cloud removes primary VMs from the active pool and adds healthy failover VMs to the active pool, the process is called failover. When Google Cloud reverses this, the process is called failback.

Failover policy

A failover policy is a collection of parameters that Google Cloud uses for failover and failback. Each internal passthrough Network Load Balancer has one failover policy that has multiple settings:

  • Failover ratio
  • Dropping traffic when all backend VMs are unhealthy
  • Connection draining on failover and failback

Failover ratio

A configurable failover ratio determines when Google Cloud performs a failover or failback, changing membership in the active pool. The ratio can be from 0.0 to 1.0, inclusive. If you don't specify a failover ratio, Google Cloud uses a default value of 0.0. It's a best practice to set your failover ratio to a number that works for your use case rather than relying on this default.

Conditions VMs in active pool
  1. The failover ratio (x) != 0.0.
    The ratio of healthy primary VMs >= x.
  2. The failover ratio (x) = 0.0.
    The number of healthy primary VMs > 0.
All healthy primary VMs
If at least one backup VM is healthy and:
  1. The failover ratio (x) != 0.0.
    The ratio of healthy primary VMs < x.
  2. The failover ratio = 0.0.
    The number of healthy primary VMs = 0.
All healthy backup VMs
When all primary VMs and all backup VMs are unhealthy and you haven't configured your load balancer to drop traffic during this situation All primary VMs, as a last resort

The following examples clarify membership in the active pool. For an example with calculations, see the Failover example.

  • A failover ratio of 1.0 requires that all primary VMs be healthy. When at least one primary VM becomes unhealthy, Google Cloud performs a failover, moving the backup VMs into the active pool.
  • A failover ratio of 0.1 requires that at least 10% of the primary VMs be healthy; otherwise, Google Cloud performs a failover.
  • A failover ratio of 0.0 means that Google Cloud performs a failover only when all the primary VMs are unhealthy. Failover doesn't happen if at least one primary VM is healthy.

An internal passthrough Network Load Balancer distributes connections among VMs in the active pool according to the traffic distribution algorithm.

Dropping traffic when all backend VMs are unhealthy

By default, when all primary and backup VMs are unhealthy, Google Cloud distributes new connections among only the primary VMs. It does so as a last resort. The backup VMs are excluded from this last-resort distribution of connections.

If you prefer, you can configure your internal passthrough Network Load Balancer to drop new connections when all primary and backup VMs are unhealthy.

Connection draining on failover and failback

Connection draining allows existing TCP sessions to remain active for up to a configurable time period even after backend VMs become unhealthy. If the protocol for your load balancer is TCP, the following is true:

  • By default, connection draining is enabled. Existing TCP sessions can persist on a backend VM for up to 300 seconds (5 minutes), even if the backend VM becomes unhealthy or isn't in the load balancer's active pool.

  • You can disable connection draining during failover and failback events. Disabling connection draining during failover and failback ensures that all TCP sessions, including established ones, are quickly terminated. Connections to backend VMs might be closed with a TCP reset packet.

Disabling connection draining on failover and failback is useful for scenarios such as the following:

  • Patching backend VMs. Prior to patching, configure your primary VMs to fail health checks so that the load balancer performs a failover. Disabling connection draining ensures that all connections are moved to the backup VMs quickly and in a planned fashion. This allows you to install updates and restart the primary VMs without existing connections persisting. After patching, Google Cloud can perform a failback when a sufficient number of primary VMs (as defined by the failover ratio) pass their health checks.

  • Single backend VM for data consistency. If you need to ensure that only one primary VM is the destination for all connections, disable connection draining so that switching from a primary to a backup VM does not allow existing connections to persist on both. This reduces the possibility of data inconsistencies by keeping just one backend VM active at any given time.

Failover example

The following example describes failover behavior for the multi-zone internal passthrough Network Load Balancer example presented in the architecture section.

Multi-zone internal passthrough Network Load Balancer failover.
Multi-zone internal passthrough Network Load Balancer failover (click to enlarge).

The primary backends for this load balancer are the unmanaged instance groups ig-a in us-west1-a and ig-d in us-west1-c. Each instance group contains two VMs. All four VMs from both instance groups are primary VMs:

  • vm-a1 in ig-a
  • vm-a2 in ig-a
  • vm-d1 in ig-d
  • vm-d2 in ig-d

The failover backends for this load balancer are the unmanaged instance groups ig-b in us-west1-a and ig-c in us-west1-c. Each instance group contains two VMs. All four VMs from both instance groups are backup VMs:

  • vm-b1 in ig-b
  • vm-b2 in ig-b
  • vm-c1 in ig-c
  • vm-c2 in ig-c

Suppose you want to configure a failover policy for this load balancer such that new connections are delivered to backup VMs when the number of healthy primary VMs is fewer than two. To accomplish this, set the failover ratio to 0.5 (50%). Google Cloud uses the failover ratio to calculate the minimum number of primary VMs that must be healthy by multiplying the failover ratio by the number of primary VMs: 4 × 0.5 = 2

When all four primary VMs are healthy, Google Cloud distributes new connections to all of them. When primary VMs fail health checks:

  • If vm-a1 and vm-d1 become unhealthy, Google Cloud distributes new connections between the remaining two healthy primary VMs, vm-a2 and vm-d2, because the number of healthy primary VMs is at least the minimum.

  • If vm-a2 also fails health checks, leaving only one healthy primary VM, vm-d2, Google Cloud recognizes that the number of healthy primary VMs is fewer than the minimum, so it performs a failover. The active pool is set to the four healthy backup VMs, and new connections are distributed among those four (in instance groups ig-b and ig-c). Even though vm-d2 remains healthy, it is removed from the active pool and does not receive new connections.

  • If vm-a2 recovers and passes its health check, Google Cloud recognizes that the number of healthy primary VMs is at least the minimum of two, so it performs a failback. The active pool is set to the two healthy primary VMs, vm-a2 and vm-d2, and new connections are distributed between them. All backup VMs are removed from the active pool.

  • As other primary VMs recover and pass their health checks, Google Cloud adds them to the active pool. For example, if vm-a1 becomes healthy, Google Cloud sets the active pool to the three healthy primary VMs, vm-a1, vm-a2, and vm-d2, and distributes new connections among them.

What's next