This section provides guidance for resolving issues related to VPC-native clusters. You can also view GKE IP address utilization insights.
The default network resource is not ready
- Symptoms
You get an error message similar to the following:
projects/[PROJECT_NAME]/regions/XXX/subnetworks/default
- Potential causes
There are parallel operations on the same subnet. For example, another VPC-native cluster is being created, or a secondary range is being added or deleted on the subnet.
- Resolution
Retry the command.
Invalid value for IPCidrRange
- Symptoms
You get an error message similar to the following:
resource.secondaryIpRanges[1].ipCidrRange': 'XXX'. Invalid IPCidrRange: XXX conflicts with existing subnetwork 'default' in region 'XXX'
- Potential causes
Another VPC-native cluster is being created at the same time and is attempting to allocate the same ranges in the same VPC network.
The same secondary range is being added to the subnetwork in the same VPC network.
- Resolution
If this error is returned on cluster creation when no secondary ranges were specified, retry the cluster creation command.
Not enough free IP address space for Pods
- Symptoms
Cluster is stuck in a provisioning state for an extended period of time.
Cluster creation returns a Managed Instance Group (MIG) error.
When you add one or more nodes to a cluster, the follow error appears:
[IP_SPACE_EXHAUSTED] Instance 'INSTANCE_NAME' creation failed: IP space of 'projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME-SECONDARY_RANGE_NAME' is exhausted.
- Potential causes
Node IP address exhaustion: The primary IP address range of the subnet assigned to your cluster runs out of available IP addresses. This typically happens when scaling node pools or creating large clusters.
Pod IP address exhaustion: The Pod CIDR range assigned to your cluster is full. This occurs when the number of Pods exceeds the capacity of the Pod CIDR, especially with high Pod density per node or large deployments.
Specific subnet naming conventions: The way a subnet is named in an error message can help you figure out if the problem is with the node IP address range (where the nodes themselves get their IP address) or the Pod IP address range (where the containers inside the Pods get their IP addresses).
Secondary range exhaustion (Autopilot): In Autopilot clusters, secondary ranges assigned for Pod IP addresses are exhausted due to scaling or high Pod density.
- Solution
Gather the following information about your cluster: name, control plane version, mode of operation, associated VPC name, and subnet name and CIDR. Additionally, note the default and any additional Cluster Pod IPv4 ranges (including names and CIDRs), whether VPC-native traffic routing is enabled, and the maximum Pods per node setting at both the cluster and node pool levels (if applicable). Note any impacted node pools and their specific IPv4 Pod IP address ranges and maximum Pods per node configurations if they differ from the cluster-wide settings. Also, record the default and custom (if any) configurations for maximum Pods per node in the node pool configuration.
Confirm IP address exhaustion issue
Network Intelligence Center: Check for high IP address allocation rates in the Pod IP address ranges in the Network Intelligence Center for your GKE cluster.
If you observe a high IP address allocation rate in the Pod ranges within Network Intelligence Center, then your Pod IP address range is exhausted.
If the Pod IP address ranges show normal allocation rates, but you are still experiencing IP address exhaustion, then it's likely your node IP address range is exhausted.
Audit logs: Examine the
resourceName
field inIP_SPACE_EXHAUSTED
entries, comparing it to subnet names or secondary Pod IP address range names.Check whether exhausted IP address range is node IP address range or Pod IP address range.
To verify whether exhausted IP address range is node IP address range or Pod IP address range, check whether the value of
resourceName
in theipSpaceExhausted
portion of aIP_SPACE_EXHAUSTED
log entry correlates with subnet name or name of secondary IPv4 address range for Pods used in the impacted GKE cluster.If value of
resourceName
is in format "[Subnet_name]", then node IP address range is exhausted. If value of resourceName is in format "[Subnet_name]-[Name_of_Secondary_IPv4_range_for_pods]-[HASH_8BYTES]", then Pod IP address range is exhausted.
Resolve Pod IP address exhaustion:
- Resize existing Pod CIDR: Increase the size of the current Pod IP address range. You can add Pod IP ranges to the cluster using discontiguous multi-Pod CIDR.
- Create additional subnets: Add subnets with dedicated Pod CIDRs to the cluster.
Reduce Pods per node to free up IP addresses:
- Create a new node pool with a smaller maximum number of Pods per node.
- Migrate workloads to that node pool, and then delete the previous node pool. Reducing the maximum number of Pods per node lets you support more nodes on a fixed secondary IP address range for Pods. Refer to Subnet secondary IP address range for Pods and Node limiting ranges for details about the calculations involved.
Address node IP address exhaustion:
- Review IP address planning: Ensure the node IP address range aligns with your scaling requirements.
- Create new cluster (if necessary): If the node IP address range is severely constrained, create a replacement cluster with appropriate IP address range sizing. Refer to IP ranges for VPC-native clusters and IP range planning.
Debug IP address exhaustion issues with gcpdiag
gcpdiag
is an open source tool. It is not an officially supported Google Cloud product.
You can use the gcpdiag
tool to help you identify and fix Google Cloud
project issues. For more information, see the
gcpdiag project on GitHub.
- Cluster status: Checks the cluster status if IP address exhaustion is reported.
- Network analyzer: Queries stackdriver logs for network analyzer logs to confirm if there is Pod or node IP address exhaustion.
- Cluster Type: Checks the cluster type and provides relevant recommendations based on the cluster type.
Google Cloud console
- Complete and then copy the following command.
- Open the Google Cloud console and activate Cloud Shell. Open Cloud console
- Paste the copied command.
- Run the
gcpdiag
command, which downloads thegcpdiag
docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.
gcpdiag runbook gke/ip-exhaustion --project=PROJECT_ID \
--parameter name=CLUSTER_NAME \
--parameter location=ZONE|REGION \
--parameter start_time=yyyy-mm-ddThh:mm:ssZ \
--parameter end_time=yyyy-mm-ddThh:mm:ssZ \
Docker
You can
run gcpdiag
using a wrapper that starts gcpdiag
in a
Docker container. Docker or
Podman must be installed.
- Copy and run the following command on your local workstation.
curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
- Execute the
gcpdiag
command../gcpdiag runbook gke/ip-exhaustion --project=PROJECT_ID \ --parameter name=CLUSTER_NAME \ --parameter location=ZONE|REGION \ --parameter start_time=yyyy-mm-ddThh:mm:ssZ \ --parameter end_time=yyyy-mm-ddThh:mm:ssZ \
View available parameters for this runbook.
Replace the following:
- PROJECT_ID: The ID of the project containing the resource
- CLUSTER_NAME: The name of the target GKE cluster within your project.
- LOCATION: The zone or region in which your cluster is located.
- start_time: The time the issue started.
- end_time: The time the issue ended. Set current time if issue is ongoing.
Useful flags:
--project
: The PROJECT_ID--universe-domain
: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource--parameter
or-p
: Runbook parameters
For a list and description of all gcpdiag
tool flags, see the
gcpdiag
usage instructions.
Confirm whether default SNAT is disabled
Use the following command to check the status of default SNAT:
gcloud container clusters describe CLUSTER_NAME
Replace CLUSTER_NAME
with the name of your cluster.
The output is similar to the following:
networkConfig:
disableDefaultSnat: true
network: ...
Cannot use --disable-default-snat
without --enable-ip-alias
This error message, and must disable default sNAT (--disable-default-snat)
before using public IP address privately in the cluster
, mean that you should
explicitly set the --disable-default-snat
flag when creating the cluster since
you are using public IP addresses in your private cluster.
If you see error messages like cannot disable default sNAT ...
, this means
the default SNAT can't be disabled in your cluster. To resolve this issue,
review your cluster configuration.
Debugging Cloud NAT with default SNAT disabled
If you have a private cluster created with the --disable-default-snat
flag and
have set up Cloud NAT for internet access and you aren't seeing
internet-bound traffic from your Pods, make sure that the Pod range is included
in the Cloud NAT configuration.
If there is a problem with Pod to Pod communication, examine the iptables rules on the nodes to verify that the Pod ranges are not masqueraded by iptables rules.
For more information, see the GKE IP masquerade documentation.If you have not configured an IP masquerade agent for the cluster, GKE automatically ensures that Pod to Pod communication is not masqueraded. However, if an IP masquerade agent is configured, it overrides the default IP masquerade rules. Verify that additional rules are configured in the IP masquerade agent to ignore masquerading the Pod ranges.
The dual-stack cluster network communication is not working as expected
- Potential causes
- The firewall rules created by the GKE cluster don't include the allocated IPv6 addresses.
- Resolution
- You can validate the firewall rule by following these steps:
Verify the firewall rule content:
gcloud compute firewall-rules describe FIREWALL_RULE_NAME
Replace
FIREWALL_RULE_NAME
with the name of the firewall rule.Each dual-stack cluster creates a firewall rule that allows nodes and Pods to communicate with each other. The firewall rule content is similar to the following:
allowed: - IPProtocol: esp - IPProtocol: ah - IPProtocol: sctp - IPProtocol: tcp - IPProtocol: udp - IPProtocol: '58' creationTimestamp: '2021-08-16T22:20:14.747-07:00' description: '' direction: INGRESS disabled: false enableLogging: false id: '7326842601032055265' kind: compute#firewall logConfig: enable: false name: gke-ipv6-4-3d8e9c78-ipv6-all network: https://www.googleapis.com/compute/alpha/projects/my-project/global/networks/alphanet priority: 1000 selfLink: https://www.googleapis.com/compute/alpha/projects/my-project/global/firewalls/gke-ipv6-4-3d8e9c78-ipv6-all selfLinkWithId: https://www.googleapis.com/compute/alpha/projects/my-project/global/firewalls/7326842601032055265 sourceRanges: - 2600:1900:4120:fabf::/64 targetTags: - gke-ipv6-4-3d8e9c78-node
The
sourceRanges
value must be the same as thesubnetIpv6CidrBlock
. ThetargetTags
value must be the same as the tags on the GKE nodes. To fix this issue, update the firewall rule with the clusteripAllocationPolicy
block information.
The Private Service Connect endpoint might leak during cluster deletion
- Symptoms
You cannot see a connected endpoint under Private Service Connect in your Private Service Connect-based cluster.
You can't delete the subnet or VPC network where the endpoint is Private Service Connect allocated. An error message similar to the following appears:
projects/<PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNET_NAME> is already being used by projects/<PROJECT_ID>/regions/<REGION>/addresses/gk3-<ID>
- Potential causes
On GKE clusters that use Private Service Connect, GKE deploys a Private Service Connect endpoint by using a forwarding rule that allocates an internal IP address to access the cluster's control plane in a control plane's network. To protect the communication between the control plane and the nodes by using Private Service Connect, GKE keeps the endpoint invisible, and you can't see it on Google Cloud console or gcloud CLI.
- Resolution
To prevent leaking the Private Service Connect endpoint before cluster deletion, complete the following steps:
- Assign the
Kubernetes Engine Service Agent role
to the GKE service account. - Ensure that the
compute.forwardingRules.*
andcompute.addresses.*
permissions are not explicitly denied from GKE service account.
If you see the Private Service Connect endpoint leaked, contact support.
- Assign the
What's next
- For general information about diagnosing Kubernetes DNS issues, see Debugging DNS Resolution.