Troubleshoot external passthrough Network Load Balancers

This guide describes how to troubleshoot configuration issues for a Google Cloud external passthrough Network Load Balancer. Before investigating issues, familiarize yourself with the following pages:

Troubleshoot common issues with Network Analyzer

Network Analyzer automatically monitors your VPC network configuration and detects both suboptimal configurations and misconfigurations. It identifies network failures, provides root cause information, and suggests possible resolutions. To learn about the different misconfiguration scenarios that are automatically detected by Network Analyzer, see Load balancer insights in the Network Analyzer documentation.

Network Analyzer is available in the Google Cloud console as a part of Network Intelligence Center.

Go to Network Analyzer

Troubleshoot setup issues

Backends have incompatible balancing modes

When creating a load balancer, you might see the error:

Validation failed for instance group INSTANCE_GROUP:

backend services 1 and 2 point to the same instance group
but the backends have incompatible balancing_mode. Values should be the same.

This happens when you try to use the same backend in two different load balancers, and the backends don't have compatible balancing modes.

For more information, see the following:

Troubleshoot general connectivity issues

If you can't connect to your external passthrough Network Load Balancer, check for the following common issues:

Verify firewall rules.
- Ensure that ingress allow firewall rules are defined to permit health checks to backend VMs.
- Ensure that ingress allow firewall rules allow traffic to the backend VMs from clients.
- Ensure that relevant firewall rules exist to allow traffic to reach the backend VMs on the ports being used by the load balancer.
- If you're using target tags for the firewall rules, make sure that the load balancer's backend VMs are tagged appropriately.
To learn how to configure firewall rules required by your external passthrough Network Load Balancer, see Configuring firewall rules.
Verify that the Google guest agent is running on the backend VM. If you can connect to a healthy backend VM, but you cannot connect to the load balancer, it might be that the Google guest agent (formerly, the Windows Guest Environment or Linux Guest Environment) on the VM is either not running or is unable to communicate with the metadata server (metadata.google.internal, 169.254.169.254).

Check for the following:
- Ensure that the Google guest agent is installed and running on the backend VM.
- Ensure that the firewall rules within the guest operating system of the backend VM (iptables or Windows Firewall) don't block access to the metadata server.
Verify that backend VMs are accepting packets sent to the load balancer. Each backend VM must be configured to accept packets sent to the load balancer. That is, the destination of packets delivered to the backend VMs is the IP address of the load balancer. Under most circumstances, this is implemented with a local route.

For VMs created from Google Cloud images, the Guest agent installs the local route for the load balancer's IP address. Google Kubernetes Engine instances based on Container-Optimized OS implement this by using iptables instead.

On a Linux backend VM, you can verify the presence of the local route by running the following command. Replace LOAD_BALANCER_IP with the load balancer's IP address:
```
sudo ip route list table local | grep LOAD_BALANCER_IP
```
Verify service IP address and port binding on the backend VMs. Packets sent to an external passthrough Network Load Balancer arrive at backend VMs with the destination IP address of the load balancer itself. This type of load balancer is not a proxy, and this is expected behavior.

To see the services listening on a port, run the following command:
```
netstat -nl | grep ':PORT'
```
The software running on the backend VM must be doing the following:
- Listening on (bound to) the load balancer's IP address or any IP address (0.0.0.0 or ::)
- Listening on (bound to) a port that's included in the load balancer's forwarding rule
To test this, connect to a backend VM by using either SSH or RDP. Then perform the following tests by using either curl, telnet, or a similar tool:
- Attempt to reach the service by contacting it using the internal IP address of the backend VM itself, 127.0.0.1, or localhost.
- Attempt to reach the service by contacting it using the IP address of the load balancer's forwarding rule.
Verify that health check traffic can reach backend VMs. To verify that health check traffic reaches your backend VMs, enable health check logging and search for successful log entries.

Troubleshoot Shared VPC issues

If you are using Shared VPC and you cannot create a new external passthrough Network Load Balancer in a particular subnet, an organization policy might be the cause. In the organization policy, add the subnet to the list of allowed subnets or contact your organization administrator. For more information, see the constraints/compute.restrictSharedVpcSubnetworks constraint.

Troubleshoot failover issues

If you've configured failover for an external passthrough Network Load Balancer, use the following steps to verify your configuration:

Make sure that you've designated at least one failover backend.
Verify your failover policy settings.
Make sure that you understand how membership in the active pool works, and when Google Cloud performs failover and failback. Inspect your load balancer's configuration by doing the following:
- Use the Google Cloud console to check for the number of healthy backend VMs in each backend instance group. The Google Cloud console also shows you which VMs are in the active pool.
- Make sure that your load balancer's failover ratio is set appropriately. For example, if you have ten primary VMs and a failover ratio set to 0.2, this means that Google Cloud performs a failover when fewer than two (10 × 0.2 = 2) primary VMs are healthy. A failover ratio of 0.0 has a special meaning: Google Cloud performs a failover when no primary VMs are healthy.

Other issues that can occur are the following:

The active pool is changing back and forth (flapping) between the primary and failover backends.

Using managed instance groups with autoscaling and failover might cause the active pool to repeatedly failover and failback between the primary and failover backends. Google Cloud doesn't prevent you from configuring failover with managed instance groups, because your deployment might benefit from this setup.

Disabling connection draining does not work.

Disabling connection draining only works if the backend service is set up with protocol TCP.

The following error message appears if you create a backend service with UDP while connection draining is disabled:

gcloud compute backend-services create my-failover-bs
  --load-balancing-scheme external \
  --health-checks-region us-central1 \
  --health-checks my-tcp-health-check \
  --region us-central1 \
  --no-connection-drain-on-failover \
  --drop-traffic-if-unhealthy \
  --failover-ratio 0.5 \
  --protocol UDP
ERROR: (gcloud.compute.backend-services.create) Invalid value for
[--protocol]: can only specify --connection-drain-on-failover if the protocol is
TCP.

Existing connections are terminated during failover or failback.

Edit your backend service's failover policy. Ensure that connection draining on failover is enabled.

Troubleshoot logging issues

If you configure logging for an external passthrough Network Load Balancer, the following issues might occur:

RTT measurements such as byte values might be missing in some of the logs if not enough packets are sampled to capture RTT. This is more likely to happen for low volume connections.
RTT values are available only for TCP flows.
Some packets are sent with no payload. If header-only packets are sampled, the bytes value is 0.