Troubleshoot Cloud NAT packet loss from a cluster


This page shows you how to resolve issues with Cloud NAT packet loss from a VPC-native Google Kubernetes Engine (GKE) private cluster.

Node VMs in VPC-native GKE private clusters don't have external IP addresses. This means that clients on the internet cannot connect to the IP addresses of the nodes. You can use Cloud NAT to allocate the external IP addresses and ports that allow private clusters to make public connections.

If a node VM runs out of its allocation of external ports and IP addresses from Cloud NAT, packets will drop. To avoid this, you can reduce the outbound packet rate or increase the allocation of available Cloud NAT source IP addresses and ports. The following sections describe how to diagnose and troubleshoot packet loss from Cloud NAT in the context of GKE private clusters.

Diagnose packet loss

The following sections explains how to log dropped packets using Cloud Logging, and diagnose the cause of dropped packets using Cloud Monitoring.

Log dropped packets

You can log dropped packets with the following query in Cloud Logging:

resource.type="nat_gateway"
resource.labels.region=REGION
resource.labels.gateway_name=GATEWAY_NAME
jsonPayload.allocation_status="DROPPED"

Replace the following:

  • REGION: the name of the region that the cluster is in.
  • GATEWAY_NAME: the name of the Cloud NAT gateway.

This command returns a list of all packets dropped by a Cloud NAT gateway, but does not identify the cause.

Monitor causes for packet loss

To identify causes for dropped packets, query the Metrics observer in Cloud Monitoring. Packets drop for one of three reasons:

  • OUT_OF_RESOURCES
  • ENDPOINT_INDEPENDENT_CONFLICT
  • NAT_ALLOCATION_FAILED

To identify packets dropped due to OUT_OF_RESOURCES or ENDPOINT_ALLOCATION_FAILED error codes, use the following query:

fetch nat_gateway
  metric 'router.googleapis.com/nat/dropped_sent_packets_count'
  filter (resource.gateway_name == GATEWAY_NAME)
  align rate(1m)
  every 1m
  group_by [metric.reason],
    [value_dropped_sent_packets_count_aggregate:
       aggregate(value.dropped_sent_packets_count)]

If you identify packets that drop because of these reasons, see Packets dropped with reason: out of resources and Packets dropped with reason: endpoint independent conflict for troubleshooting advice.

To identify packets dropped due to the NAT_ALLOCATION_FAILED error code, use the following query:

fetch nat_gateway
  metric 'router.googleapis.com/nat/nat_allocation_failed'
  group_by 1m,
    [value_nat_allocation_failed_count_true:
       count_true(value.nat_allocation_failed)]
  every 1m

If you identify packets that dropped for this reason, see Need to allocate more IP addresses.

Investigate Cloud NAT configuration

If the previous queries return empty results, and GKE Pods are unable to communicate to external IP addresses, use the following table to help you troubleshoot your configuration:

Configuration Troubleshooting
Cloud NAT configured to apply only to the subnet's primary IP address range. When Cloud NAT is configured only for the subnet's primary IP address range, packets sent from the cluster to external IP addresses must have a source node IP address. In this Cloud NAT configuration:
  • Pods can send packets to external IP addresses if those external IP address destinations are subject to IP masquerading. When deploying the ip-masq-agent, verify that the nonMasqueradeCIDRs list doesn't contain the destination IP address and port. Packets sent to those destinations are first converted to source node IP addresses before being processed by Cloud NAT.
  • To allow the Pods to connect to all external IP addresses with this Cloud NAT configuration, ensure the ip-masq-agent is deployed and that the nonMasqueradeCIDRs list contains only the node and Pod IP address ranges of the cluster. Packets sent to destinations outside of the cluster are first converted to source node IP addresses before being processed by Cloud NAT.
  • To prevent Pods from sending packets to some external IP addresses, you need to explicitly block those addresses so they are not masqueraded. When the ip-masq-agent is deployed, add the external IP addresses you want to block to the nonMasqueradeCIDRs list. Packets sent to those destinations leave the node with their original Pod IP address sources. The Pod IP addresses come from a secondary IP address range of the cluster's subnet. In this configuration, Cloud NAT won't operate on that secondary range.
Cloud NAT configured to apply only to the subnet's secondary IP address range used for Pod IPs.

When Cloud NAT is configured only for the subnet's secondary IP address range used by the cluster's Pod IPs, packets sent from the cluster to external IP addresses must have a source Pod IP address. In this Cloud NAT configuration:

  • Using an IP masquerade agent causes packets to lose their source Pod IP address when processed by Cloud NAT. To keep the source Pod IP address, specify destination IP address ranges in a nonMasqueradeCIDRs list. With the ip-masq-agent deployed, any packets sent to destinations on the nonMasqueradeCIDRslist retain their source Pod IP addresses before being processed by Cloud NAT.
  • To allow the Pods to connect to all external IP addresses with this Cloud NAT configuration, ensure the ip-masq-agent is deployed and that the nonMasqueradeCIDRs list is as large as possible (0.0.0.0/0 specifies all IP address destinations). Packets sent to all destinations retain source Pod IP addresses before being processed by Cloud NAT.

Reduce packet loss

After you have diagnosed the cause of your packet loss, consider using the following recommendations to reduce the likelihood of the issue from recurring in the future:

  • Configure the Cloud NAT gateway to use dynamic port allocation and increase the maximum number of ports per VM.

  • If you're using static port allocation, increase the number of minimum ports per VM.

  • Reduce your application's outbound packet rate. When an application makes multiple outbound connections to the same destination IP address and port, it can quickly consume all connections Cloud NAT can make to that destination using the number of allocated NAT source addresses and source port tuples.

    For details about how Cloud NAT uses NAT source addresses and source ports to make connections, including limits on the number of simultaneous connections to a destination, refer to Ports and connections.

    To reduce the rate of outbound connections from the application, reuse open connections. Common methods of reusing connections include connection pooling, multiplexing connections using protocols such as HTTP/2, or establishing persistent connections reused for multiple requests. For more information, see Ports and Connections.

What's next

If you need additional assistance, reach out to Cloud Customer Care.