This page shows troubleshooting steps for problems using Google Kubernetes Engine (GKE).
If you need additional assistance, reach out to Cloud Customer Care.Debug Kubernetes resources
For Kubernetes resources, if you're experiencing an issue with:
Your cluster, refer to Troubleshooting Clusters in the Kubernetes documentation.
Your application, its Pods, or its controller object, refer to Troubleshooting Applications.
Troubleshoot kubectl
command issues
This section contains troubleshooting steps for several types of issues with
the kubectl
command.
Issue: The kubectl
command isn't found
If you are receiving a message that the kubectl
command isn't found,
reinstall the kubectl
binary and set your $PATH
environment variable.
Install the
kubectl
binary by running the following command:gcloud components update kubectl
Answer "yes" when the installer prompts you to modify your
$PATH
environment variable. Modifying this variable lets you usekubectl
commands without typing their full path.Alternatively, add the following line to wherever your shell stores environment variables, such as
~/.bashrc
(or~/.bash_profile
in macOS):export PATH=$PATH:/usr/local/share/google/google-cloud-sdk/bin/
Run the following command to load your updated file. The following example uses
.bashrc
.source ~/.bashrc
If you are using macOS, use
~/.bash_profile
instead of.bashrc
.
Issue: kubectl
commands return "connection refused" error
If kubectl
commands return a "connection refused" error, then
you need to set the cluster context with the following command:
gcloud container clusters get-credentials CLUSTER_NAME
If you are unsure of what to enter for CLUSTER_NAME
, use
the following command to list your clusters:
gcloud container clusters list
Error: kubectl
command timed out
If you created a cluster and attempted to run the kubectl
command against the
cluster but the kubectl
command times out, you'll see an error such as:
Unable to connect to the server: dial tcp IP_ADDRESS: connect: connection timed out
Unable to connect to the server: dial tcp IP_ADDRESS: i/o timeout
.
These errors indicate that kubectl
is unable to communicate with the
cluster control plane.
To resolve this issue, verify and set the context where the cluster is set, then ensure connectivity to the cluster:
Go to
$HOME/.kube/config
or run the commandkubectl config view
to verify the config file contains the cluster context and the external IP address of the control plane.Set the cluster credentials:
gcloud container clusters get-credentials CLUSTER_NAME \ --location=COMPUTE_LOCATION \ --project=PROJECT_ID
Replace the following:
CLUSTER_NAME
: the name of your cluster.COMPUTE_LOCATION
: the Compute Engine location.PROJECT_ID
: ID of the project in which the GKE cluster was created.
If the cluster is a private GKE cluster, then ensure that its list of existing authorized networks includes the outgoing IP of the machine that you are attempting to connect from. You can find your existing authorized networks in the console or by running the following command:
gcloud container clusters describe CLUSTER_NAME \ --location=COMPUTE_LOCATION \ --project=PROJECT_ID \ --format "flattened(masterAuthorizedNetworksConfig.cidrBlocks[])"
If the outgoing IP of the machine is not included in the list of authorized networks from the output of the preceding command, then:
- If you are using the console, follow the directions in Can't reach control plane of a private cluster
- If connecting from Cloud Shell, follow the directions in Using Cloud Shell to access a private cluster.
Error: kubectl
commands return "failed to negotiate an api version" error
If kubectl
commands return a "failed to negotiate an api version"
error, then you need to ensure kubectl
has authentication credentials:
gcloud auth application-default login
Issue: The kubectl
logs
, attach
, exec
, and port-forward
commands stop responding
If the kubectl
logs
, attach
, exec
, or port-forward
commands stop
responding, typically the API
server is unable to communicate with the nodes.
First, check whether your cluster has any nodes. If you've scaled down the number of nodes in your cluster to zero, the commands won't work. To resolve this issue, resize your cluster to have at least one node.
If your cluster has at least one node, then check whether you are using SSH or Konnectivity proxy tunnels to enable secure communication. The following sections discuss the troubleshooting steps specific to each approach:
Troubleshoot SSH issues
If you are using SSH, GKE saves an SSH public key file in your Compute Engine project metadata. All Compute Engine VMs using Google-provided images regularly check their project's common metadata and their instance's metadata for SSH keys to add to the VM's list of authorized users. GKE also adds a firewall rule to your Compute Engine network for allowing SSH access from the control plane's IP address to each node in the cluster.
Issues with SSH can be caused by the following:
Your network's firewall rules don't allow for SSH access from the control plane.
All Compute Engine networks are created with a firewall rule called
default-allow-ssh
that allows SSH access from all IP addresses (requiring a valid private key). GKE also inserts an SSH rule for each public cluster of the formgke-CLUSTER_NAME-RANDOM_CHARACTERS-ssh
that allows SSH access specifically from the cluster's control plane to the cluster's nodes.If neither of these rules exists, then the control plane can't open SSH tunnels.
To verify that this is the cause of the issue, check whether your configuration has these rules.
To resolve this issue, identify the tag that's on all of the cluster's nodes, then re-add a firewall rule allowing access to VMs with that tag from the control plane's IP address.
Your project's common metadata entry for
ssh-keys
is full.If the project's metadata entry named "ssh-keys" is close to maximum size limit, then GKE isn't able to add its own SSH key for opening SSH tunnels.
To verify that this is the issue, check the length of the list of ssh-keys. You can see your project's metadata by running the following command, optionally including the
--project
flag:gcloud compute project-info describe [--project=PROJECT_ID]
To resolve this issue, delete some of the SSH keys that are no longer needed.
You have set a metadata field with the key "ssh-keys" on the VMs in the cluster.
The node agent on VMs prefers per-instance ssh-keys to project-wide SSH keys, so if you've set any SSH keys specifically on the cluster's nodes, then the control plane's SSH key in the project metadata won't be respected by the nodes.
To verify that this is the issue, run
gcloud compute instances describe VM_NAME
and look for anssh-keys
field in the metadata.To fix the issue, delete the per-instance SSH keys from the instance metadata.
Troubleshoot Konnectivity proxy issues
You can determine whether your cluster uses the Konnectivity proxy by checking for the following system Deployment:
kubectl get deployments konnectivity-agent --namespace kube-system
Issues with the Konnectivity proxy can be caused by the following:
Your network's firewall rules don't allow for Konnectivity agent access to the control plane.
On cluster creation, Konnectivity agent Pods establish and maintain a connection to the control plane on port
8132
. When one of thekubectl
commands is run, the API server uses this connection to communicate with the cluster.If your network's firewall rules contain Egress Deny rule(s), it can prevent the agent from connecting.
To verify that this is the cause of the issue, check your network's firewall rules to see whether they contain Egress Deny rule(s).
To resolve this issue, allow Egress traffic to the cluster control plane on port 8132. (For comparison, the API server uses 443).
Your cluster's network policy blocks ingress from
kube-system
namespace toworkload
namespace.These features are not required for the correct functioning of the cluster. If you prefer to keep your cluster's network locked down from all outside access, be aware that features like these won't work.
To verify that this is the cause of the issue, find the network policies in the affected namespace by running the following command:
kubectl get networkpolicy --namespace AFFECTED_NAMESPACE
To resolve this issue, add the following to the
spec.ingress
field of the network policies:- from: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: kube-system podSelector: matchLabels: k8s-app: konnectivity-agent
Troubleshoot error 4xx issues
The following sections help you troubleshoot error 400, 401, 403, 404, and related authentication and authorization errors.
Issue: authentication and authorization errors when connecting to GKE clusters
When connecting to GKE clusters, you can get an authentication and authorization error, with HTTP status code 401 (Unauthorized). This issue might occur when you try to run a kubectl
command in your
GKE cluster from a local environment.
The cause of this issue might be one of the following:
- The
gke-gcloud-auth-plugin
authentication plugin is not correctly installed or configured. - You lack the permissions to connect to the cluster API server and run
kubectl
commands.
To diagnose the cause, follow the steps in the following sections:
Connect to the cluster using curl
To diagnose the cause of the authentication and authorization error,
connect to the cluster using curl
.
Using curl
bypasses the kubectl
CLI and the gke-gcloud-auth-plugin
plugin.
Set environment variables:
APISERVER=https://$(gcloud container clusters describe CLUSTER_NAME --location=COMPUTE_LOCATION --format "value(endpoint)") TOKEN=$(gcloud auth print-access-token)
Verify that your access token is valid:
curl https://oauth2.googleapis.com/tokeninfo?access_token=$TOKEN
Try to connect to the core API endpoint in the API server:
gcloud container clusters describe CLUSTER_NAME --location=COMPUTE_LOCATION --format "value(masterAuth.clusterCaCertificate)" | base64 -d > /tmp/ca.crt curl -s -X GET "${APISERVER}/api/v1/namespaces" --header "Authorization: Bearer $TOKEN" --cacert /tmp/ca.crt
If the
curl
command succeeds, check whether the plugin is the cause using the steps in the Configure the plugin in kubeconfig section.If the
curl
command fails with an output that is similar to the following, then you don't have the correct permissions to access the cluster:{ "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "Unauthorized", "reason": "Unauthorized", "code": 401 }
To resolve this issue, get the correct permissions to access the cluster.
Configure use of the plugin in kubeconfig
If you're getting authentication and authorization errors when connecting to your clusters but were able to connect to the cluster using curl
, then
ensure that you can
access your cluster without needing the gke-gcloud-auth-plugin
plugin.
To resolve this issue, configure your local environment to ignore the
gke-gcloud-auth-plugin
binary when authenticating to the cluster. In Kubernetes
clients running version 1.25 and later, the gke-gcloud-auth-plugin
binary is required,
so you need to use kubectl
CLI version 1.24 or earlier.
Follow these steps to access your cluster without needing the plugin:
Install
kubectl
CLI version 1.24 usingcurl
:curl -LO https://dl.k8s.io/release/v1.24.0/bin/linux/amd64/kubectl
You can use any
kubectl
CLI version 1.24 or earlier.Open your shell startup script file, such as
.bashrc
for the Bash shell, in a text editor:vi ~/.bashrc
If you are using MacOS, use
~/.bash_profile
instead of.bashrc
in these instructions.Add the following line to the file and save it:
export USE_GKE_GCLOUD_AUTH_PLUGIN=False
Run the startup script:
source ~/.bashrc
Get credentials for your cluster, which sets up your
.kube/config
file:gcloud container clusters get-credentials CLUSTER_NAME \ --location=COMPUTE_LOCATION
Replace the following:
CLUSTER_NAME
: the name of the cluster.COMPUTE_LOCATION
: the Compute Engine location.
Run a
kubectl
command:kubectl cluster-info
If you get a 401 error or a similar authorization error running these commands, ensure that you have the correct permissions, then rerun the step that returned the error.
Error 400: Node pool requires recreation
An error 400, node pool requires recreation, looks similar to the following:
ERROR: (gcloud.container.clusters.update) ResponseError: code=400, message=Node pool "test-pool-1" requires recreation.
This error occurs when you try to perform an action that recreates your control plane and nodes. For example, this error can occur when you complete an ongoing credential rotation.
On the backend, node pools are marked for recreation, but the actual recreation operation might take some time to begin. Thus the operation fails because GKE has not recreated one or more node pools in your cluster yet.
To resolve this issue, do one of the following:
- Wait for the recreation to happen. This might take hours, days, or weeks depending on factors such as existing maintenance windows and exclusions.
Manually start a recreation of the affected node pools by starting a version upgrade to the same version as the control plane.
To start a recreation, run the following command:
gcloud container clusters upgrade CLUSTER_NAME \ --node-pool=POOL_NAME
After the upgrade completes, try the operation again.
Error 403: Insufficient permissions
An error 403, insufficient permissions, similar to the following:
ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=403, message=Required "container.clusters.get" permission(s) for "projects/<your-project>/locations/<region>/clusters/<your-cluster>".
This error occurs when you try to connect to a GKE
cluster using gcloud container clusters get-credentials
, but the account
doesn't have permission to access the Kubernetes API server.
To resolve this issue, do the following:
Identify the account that has the access issue:
gcloud auth list
Grant the required access to the account using the instructions in Authenticating to the Kubernetes API server.
Error 403: Retry budget exhausted
The following error occurs when you try to create a GKE cluster:
Error: googleapi: Error 403: Retry budget exhausted: Google Compute Engine:
Required permission 'PERMISSION_NAME' for 'RESOURCE_NAME'.
In this error message, the following variables apply:
PERMISSION_NAME
: the name of a permission, likecompute.regions.get
.RESOURCE_NAME
: the path to the Google Cloud resource that you were trying to access, like a Compute Engine region.
This error occurs if the IAM service account attached to the cluster doesn't have the minimum required permissions to create the cluster.
To resolve this issue, do the following:
- Create or modify an IAM service account to have all of the required permissions to run a GKE cluster. For instructions, see Use least privilege IAM service accounts.
- Specify the updated IAM service account in your cluster
creation command by using the
--service-account
flag. For instructions, see Create an Autopilot cluster.
Alternatively, omit the --service-account
flag to let GKE use
the Compute Engine default service account in the project, which has
the required permissions by default.
Error 404: Resource "not found" when calling gcloud container
commands
If you get an error 403, resource not found,
when calling gcloud container
commands, fix the issue by re-authenticating
to the Google Cloud CLI:
gcloud auth login
Error 400/403: Missing edit permissions on account
A missing edit permissions on account error (error 400 or 403), indicates that one of the following has been deleted or edited manually:
- Your Compute Engine default service account.
- The Google APIs Service Agent.
- The service account associated with GKE.
When you enable the Compute Engine or Kubernetes Engine API, Google Cloud creates the following service accounts and agents:
- Compute Engine default service account with edit permissions on your project.
- Google APIs Service Agent with edit permissions on your project.
- Google Kubernetes Engine service account with the Kubernetes Engine Service Agent role on your project.
Cluster creation and all management functionality will fail if at any point someone edits those permissions, removes the role bindings on the project, removes the service account entirely, or disables the API.
To verify whether the Google Kubernetes Engine service account has the Kubernetes Engine Service Agent role assigned on the project, do the following steps:
Use the following pattern to calculate the name of your Google Kubernetes Engine service account:
service-PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com
Replace
PROJECT_NUMBER
with your project number.Verify that your Google Kubernetes Engine service account does not yet have the Kubernetes Engine Service Agent role assigned on the project. In this command, replace
PROJECT_ID
with your project ID:gcloud projects get-iam-policy PROJECT_ID
To fix the issue,
- If someone removed the Kubernetes Engine Service Agent role from your Google Kubernetes Engine service account, add it back.
Otherwise, use the following instructions to re-enable the Kubernetes Engine API, which will correctly restore your service accounts and permissions.
Console
Go to the APIs & Services page in the Google Cloud console.
Select your project.
Click Enable APIs and Services.
Search for Kubernetes, then select the API from the search results.
Click Enable. If you have previously enabled the API, you must first disable it and then enable it again. It can take several minutes for the API and related services to be enabled.
gcloud
Run the following command in the gcloud CLI:
PROJECT_NUMBER=$(gcloud projects describe "PROJECT_ID"
--format 'get(projectNumber)')
gcloud projects add-iam-policy-binding PROJECT_ID \
--member "serviceAccount:service-${PROJECT_NUMBER?}@container-engine-robot.iam.gserviceaccount.com" \
--role roles/container.serviceAgent
Troubleshoot issues with GKE cluster creation
Error CONDITION_NOT_MET: Constraint constraints/compute.vmExternalIpAccess violated
You have the organization policy constraint constraints/compute.vmExternalIpAccess configured to Deny All
or to restrict external IPs to specific VM instances at the organization, folder, or project level in which you are trying to create a public GKE cluster.
When you create public GKE clusters, the underlying Compute Engine VMs, which make up the worker nodes of this cluster, have external IP addresses assigned. If you configure the organization policy constraint constraints/compute.vmExternalIpAccess to Deny All
or to restrict external IPs to specific VM instances, then the policy prevents the GKE worker nodes from obtaining external IP addresses, which results in cluster creation failure.
To find the logs of the cluster creation operation, you can review the GKE Cluster Operations Audit Logs using Logs Explorer with a search query similar to the following:
resource.type="gke_cluster"
logName="projects/test-last-gke-sa/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName="google.container.v1beta1.ClusterManager.CreateCluster"
resource.labels.cluster_name="CLUSTER_NAME"
resource.labels.project_id="PROJECT_ID"
To resolve this issue, ensure that the effective policy for the constraint constraints/compute.vmExternalIpAccess
is Allow All
on the project where you are trying to create a GKE public cluster. See Restricting external IP addresses to specific VM instances for information on working with this constraint. After setting the constraint to Allow All
, delete the failed cluster and create a new cluster. This is required because repairing the failed cluster is not possible.
Troubleshoot issues with deployed workloads
GKE returns an error if there are issues with a workload's Pods.
You can check the status of a Pod using the kubectl
command-line tool or the
Google Cloud console.
kubectl
To see all Pods running in your cluster, run the following command:
kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE
POD_NAME 0/1 CrashLoopBackOff 23 8d
To get more details information about a specific Pod, run the following command:
kubectl describe pod POD_NAME
Replace POD_NAME
with the name of the desired Pod.
Console
Perform the following steps:
Go to the Workloads page in the Google Cloud console.
Select the desired workload. The Overview tab displays the status of the workload.
From the Managed Pods section, click the error status message.
The following sections explain some common errors returned by workloads and how to resolve them.
Error: CrashLoopBackOff
CrashLoopBackOff
indicates that a container is repeatedly crashing after
restarting. A container might crash for many reasons, and checking a Pod's
logs might aid in troubleshooting the root cause.
By default, crashed containers restart with an exponential delay limited to
five minutes. You can change this behavior by setting the restartPolicy
field
Deployment's Pod specification under spec: restartPolicy
. The field's default
value is Always
.
You can troubleshoot CrashLoopBackOff
errors using the Google Cloud console:
Go to the Crashlooping Pods Interactive Playbook:
For
Cluster, enter the name of the cluster you want to troubleshoot.For
Namespace, enter the namespace you want to troubleshoot.(Optional) Create an alert to notify you of future
CrashLoopBackOff
errors:- In the Future Mitigation Tips section, select Create an Alert.
Inspect logs
You can find out why your Pod's container is crashing using the kubectl
command-line tool or the Google Cloud console.
kubectl
To see all Pods running in your cluster, run the following command:
kubectl get pods
Look for the Pod with the CrashLoopBackOff
error.
To get the Pod's logs, run the following command:
kubectl logs POD_NAME
Replace POD_NAME
with the name of the problematic
Pod.
You can also pass in the -p
flag to get the logs for the previous
instance of a Pod's container, if it exists.
Console
Perform the following steps:
Go to the Workloads page in the Google Cloud console.
Select the desired workload. The Overview tab displays the status of the workload.
From the Managed Pods section, click the problematic Pod.
From the Pod's menu, click the Logs tab.
Check "Exit Code" of the crashed container
You can find the exit code by performing the following tasks:
Run the following command:
kubectl describe pod POD_NAME
Replace
POD_NAME
with the name of the Pod.Review the value in the
containers: CONTAINER_NAME: last state: exit code
field:- If the exit code is 1, the container crashed because the application crashed.
- If the exit code is 0, verify for how long your app was running.
Containers exit when your application's main process exits. If your app finishes execution very quickly, container might continue to restart.
Connect to a running container
Open a shell to the Pod:
kubectl exec -it POD_NAME -- /bin/bash
If there is more than one container in your Pod, add
-c CONTAINER_NAME
.
Now, you can run bash commands from the container: you can test the network or check if you have access to files or databases used by your application.
Errors ImagePullBackOff and ErrImagePull
ImagePullBackOff
and ErrImagePull
indicate that the image used
by a container cannot be loaded from the image registry.
You can verify this issue using the Google Cloud console or the kubectl
command-line tool.
kubectl
To get more information about a Pod's container image, run the following command:
kubectl describe pod POD_NAME
Console
Perform the following steps:
Go to the Workloads page in the Google Cloud console.
Select the desired workload. The Overview tab displays the status of the workload.
From the Managed Pods section, click the problematic Pod.
From the Pod's menu, click the Events tab.
Issue: If the image is not found
If your image is not found:
- Verify that the image's name is correct.
- Verify that the image's tag is correct. (Try
:latest
or no tag to pull the latest image). - If the image has full registry path, verify that it exists in the Docker registry you are using. If you provide only the image name, check the Docker Hub registry.
Try to pull the docker image manually:
SSH into the node:
For example, to SSH into a VM:
gcloud compute ssh VM_NAME --zone=ZONE_NAME
Replace the following:
VM_NAME
: the name of the VM.ZONE_NAME
: a Compute Engine zone.
Run
docker-credential-gcr configure-docker
. This command generates a config file at/home/[USER]/.docker/config.json
. Ensure that this file includes the registry of the image in thecredHelpers
field. For example, the following file includes authentication information for images hosted at asia.gcr.io, eu.gcr.io, gcr.io, marketplace.gcr.io, and us.gcr.io:{ "auths": {}, "credHelpers": { "asia.gcr.io": "gcr", "eu.gcr.io": "gcr", "gcr.io": "gcr", "marketplace.gcr.io": "gcr", "us.gcr.io": "gcr" } }
Run
docker pull IMAGE_NAME
.
If this option works, you probably need to specify ImagePullSecrets on a Pod. Pods can only reference image pull secrets in their own namespace, so this process needs to be done one time per namespace.
Error: Permission denied
If you encounter a "permission denied" or "no pull access" error, verify that you are logged in and have access to the image. Try one of the following methods depending on the registry in which you host your images.
Artifact Registry
If your image is in Artifact Registry, your node pool's service account needs read access to the repository that contains the image.
Grant the
artifactregistry.reader
role
to the service account:
gcloud artifacts repositories add-iam-policy-binding REPOSITORY_NAME \
--location=REPOSITORY_LOCATION \
--member=serviceAccount:SERVICE_ACCOUNT_EMAIL \
--role="roles/artifactregistry.reader"
Replace the following:
REPOSITORY_NAME
: the name of your Artifact Registry repository.REPOSITORY_LOCATION
: the region of your Artifact Registry repository.SERVICE_ACCOUNT_EMAIL
: the email address of the IAM service account associated with your node pool.
Container Registry
If your image is in Container Registry, your node pool's service account needs read access to the Cloud Storage bucket that contains the image.
Grant the roles/storage.objectViewer
role
to the service account so that it can read from the bucket:
gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
--member=serviceAccount:SERVICE_ACCOUNT_EMAIL \
--role=roles/storage.objectViewer
Replace the following:
SERVICE_ACCOUNT_EMAIL
: the email of the service account associated with your node pool. You can list all the service accounts in your project usinggcloud iam service-accounts list
.BUCKET_NAME
: the name of the Cloud Storage bucket that contains your images. You can list all the buckets in your project usinggcloud storage ls
.
If your registry administrator set up
gcr.io repositories in Artifact Registry
to store images for the gcr.io
domain instead of Container Registry, you must
grant read access to Artifact Registry instead of Container Registry.
Private registry
If your image is in a private registry, you might require keys to access the images. See Using private registries for more information.
Error 401 Unauthorized: Cannot pull images from private container registry repository
An error similar to the following might occur when you pull an image from a private Container Registry repository:
gcr.io/PROJECT_ID/IMAGE:TAG: rpc error: code = Unknown desc = failed to pull and
unpack image gcr.io/PROJECT_ID/IMAGE:TAG: failed to resolve reference
gcr.io/PROJECT_ID/IMAGE]:TAG: unexpected status code [manifests 1.0]: 401 Unauthorized
Warning Failed 3m39s (x4 over 5m12s) kubelet Error: ErrImagePull
Warning Failed 3m9s (x6 over 5m12s) kubelet Error: ImagePullBackOff
Normal BackOff 2s (x18 over 5m12s) kubelet Back-off pulling image
Identify the node running the pod:
kubectl describe pod POD_NAME | grep "Node:"
Verify the node has the storage scope:
gcloud compute instances describe NODE_NAME \ --zone=COMPUTE_ZONE --format="flattened(serviceAccounts[].scopes)"
The node's access scope should contain at least one of the following:
serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/devstorage.read_only serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/cloud-platform
Recreate node pool the node belongs to with sufficient scope. You cannot modify existing nodes, you must recreate the node with the correct scope.
Recommended: create a new node pool with the
gke-default
scope:gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --zone=COMPUTE_ZONE \ --scopes="gke-default"
Create a new node pool with only storage scope:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --zone=COMPUTE_ZONE \ --scopes="https://www.googleapis.com/auth/devstorage.read_only"
Error: Pod unschedulable
PodUnschedulable
indicates that your Pod cannot be scheduled because of
insufficient resources or some configuration error.
If you have configured your GKE cluster to send Kubernetes API server and Kubernetes scheduler metrics to Cloud Monitoring, you can find more information about these errors in scheduler metrics and API server metrics.
You can troubleshoot PodUnschedulable
errors using the Google Cloud console:
Go to the Unschedulable Pods Interactive Playbook:
For
Cluster, enter the name of the cluster you want to troubleshoot.For
Namespace, enter the namespace you want to troubleshoot.(Optional) Create an alert to notify you of future
PodUnschedulable
errors:- In the Future Mitigation Tips section, select Create an Alert.
Error: Insufficient resources
You might encounter an error indicating a lack of CPU, memory, or another resource. For example: "No nodes are available that match all of the predicates: Insufficient cpu (2)" which indicates that on two nodes there isn't enough CPU available to fulfill a Pod's requests.
If your Pod resource requests exceed that of a single node from any eligible node pools, GKE does not schedule the Pod and also does not trigger scale up to add a new node. For GKE to schedule the Pod, you must either request fewer resources for the Pod, or create a new node pool with sufficient resources.
You can also enable node auto-provisioning so that GKE can automatically create node pools with nodes where the unscheduled Pods can run.
The default CPU request is 100m or 10% of a CPU (or one core).
If you want to request more or fewer resources, specify the value in the Pod
specification under spec: containers: resources: requests
.
Error: MatchNodeSelector
MatchNodeSelector
indicates that there are no nodes that match the Pod's
label selector.
To verify this, check the labels specified in the Pod specification's
nodeSelector
field, under spec: nodeSelector
.
To see how nodes in your cluster are labelled, run the following command:
kubectl get nodes --show-labels
To attach a label to a node, run the following command:
kubectl label nodes NODE_NAME LABEL_KEY=LABEL_VALUE
Replace the following:
NODE_NAME
: the desired node.LABEL_KEY
: the label's key.LABEL_VALUE
: the label's value.
For more information, refer to Assigning Pods to Nodes.
Error: PodToleratesNodeTaints
PodToleratesNodeTaints
indicates that the Pod can't be scheduled to any node
because no node currently tolerates its
node taint.
To verify that this is the case, run the following command:
kubectl describe nodes NODE_NAME
In the output, check the Taints
field, which lists key-value pairs and
scheduling effects.
If the effect listed is NoSchedule
, then no Pod can be scheduled on that node
unless it has a matching toleration.
One way to resolve this issue is to remove the taint. For example, to remove a NoSchedule taint, run the following command:
kubectl taint nodes NODE_NAME key:NoSchedule-
Error: PodFitsHostPorts
PodFitsHostPorts
indicates that a port that a node is attempting to use is
already in use.
To resolve this issue, check the Pod specification's hostPort
value under
spec: containers: ports: hostPort
. You might need to change this value to
another port.
Error: Does not have minimum availability
If a node has adequate resources but you still see the Does not have minimum availability
message, check the Pod's status. If the status is SchedulingDisabled
or
Cordoned
status, the node cannot schedule new Pods. You can check the status of a
node using the Google Cloud console or the kubectl
command-line tool.
kubectl
To get statuses of your nodes, run the following command:
kubectl get nodes
To enable scheduling on the node, run:
kubectl uncordon NODE_NAME
Console
Perform the following steps:
Go to the Google Kubernetes Engine page in the Google Cloud console.
Select the desired cluster. The Nodes tab displays the Nodes and their status.
To enable scheduling on the Node, perform the following steps:
From the list, click the desired Node.
From the Node Details, click Uncordon button.
Error: Maximum pods per node limit reached
If the Maximum Pods per node
limit is reached by all nodes in the cluster, the Pods will be stuck in
Unschedulable state. Under the pod Events tab, you will see a message
including the phrase Too many pods
.
Check the
Maximum pods per node
configuration from the Nodes tab in GKE cluster details in the Google Cloud console.Get a list of nodes:
kubectl get nodes
For each node, verify the number of Pods running on the node:
kubectl get pods -o wide | grep NODE_NAME | wc -l
If limit is reached, add a new node pool or add additional nodes to the existing node pool.
Issue: Maximum node pool size reached with cluster autoscaler enabled
If the node pool has reached its maximum size according to its cluster autoscaler configuration, GKE does not trigger scale up for the Pod that would otherwise be scheduled with this node pool. If you want the Pod to be scheduled with this node pool, change the cluster autoscaler configuration.
Issue: Maximum node pool size reached with cluster autoscaler disabled
If the node pool has reached its maximum number of nodes, and cluster autoscaler is disabled, GKE cannot schedule the Pod with the node pool. Increase the size of your node pool or enable cluster autoscaler for GKE to resize your cluster automatically.
Error: Unbound PersistentVolumeClaims
Unbound PersistentVolumeClaims
indicates that the Pod references a
PersistentVolumeClaim that is not bound. This error might happen if your
PersistentVolume failed to provision. You can verify that provisioning failed by
getting the events for your PersistentVolumeClaim and examining them for
failures.
To get events, run the following command:
kubectl describe pvc STATEFULSET_NAME-PVC_NAME-0
Replace the following:
STATEFULSET_NAME
: the name of the StatefulSet object.PVC_NAME
: the name of the PersistentVolumeClaim object.
This may also happen if there was a configuration error during your manual pre-provisioning of a PersistentVolume and its binding to a PersistentVolumeClaim. You can try to pre-provision the volume again.
Error: Insufficient quota
Verify that your project has sufficient Compute Engine quota for
GKE to scale up your cluster. If GKE attempts to
add a node to your cluster to schedule the Pod, and scaling up would exceed your
project's available quota, you receive the scale.up.error.quota.exceeded
error
message.
To learn more, see ScaleUp errors.
Issue: Deprecated APIs
Ensure that you are not using deprecated APIs that are removed with your cluster's minor version. To learn more, see GKE deprecations.
Error: "failed to allocate for range 0: no IP addresses in range set"
GKE version 1.18.17 and later fixed an issue where out-of-memory
(OOM) events would result in incorrect Pod eviction if the Pod was deleted before
its containers were started. This incorrect eviction could result in orphaned
Pods that continued to have reserved IP addresses from the allocated node range.
Over time, GKE ran out of IP addresses to allocate to new Pods
because of the build-up of orphaned Pods. This led to the error message failed
to allocate for range 0: no IP addresses in range set
, because the allocated
node range didn't have available IPs to assign to new Pods.
To resolve this issue, upgrade your cluster and node pools to GKE version 1.18.17 or later.
To prevent this issue and resolve it on clusters with GKE versions prior to 1.18.17, increase your resource limits to avoid OOM events in the future, and then reclaim the IP addresses by removing the orphaned Pods.
You can also view GKE IP address utilization insights.
Remove the orphaned Pods from affected nodes
You can remove the orphaned Pods by draining the node, upgrading the node pool, or moving the affected directories.
Draining the node (recommended)
Cordon the node to prevent new Pods from scheduling on it:
kubectl cordon NODE
Replace
NODE
with the name of the node you want to drain.Drain the node. GKE automatically reschedules Pods managed by deployments onto other nodes. Use the
--force
flag to drain orphaned Pods that don't have a managing resource.kubectl drain NODE --force
Uncordon the node to allow GKE to schedule new Pods on it:
kubectl uncordon NODE
Moving affected directories
You can identify orphaned Pod directories in /var/lib/kubelet/pods
and move
them out of the main directory to allow GKE to terminate the Pods.
Troubleshoot issues with terminating resources
Issue: Namespace stuck in Terminating
state
Debug GKE Cluster Austoscaler issues with gcpdiag
gcpdiag
is an open source tool. It is not an officially supported Google Cloud product.
You can use the gcpdiag
tool to help you identify and fix Google Cloud
project issues. For more information, see the
gcpdiag project on GitHub.
- scale.up.error.out.of.resources
- scale.up.error.quota.exceeded
- scale.up.error.waiting.for.instances.timeout
- scale.up.error.ip.space.exhausted
- scale.up.error.service.account.deleted
- scale.down.error.failed.to.evict.pods
- no.scale.down.node.node.group.min.size.reached
Google Cloud console
- Complete and then copy the following command.
- Open the Google Cloud console and activate Cloud Shell. Open Cloud console
- Paste the copied command.
- Run the
gcpdiag
command, which downloads thegcpdiag
docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.
gcpdiag runbook gke/cluster-autoscaler --project=PROJECT_ID \
--parameter name=CLUSTER_NAME \
--parameter location=LOCATION
Docker
You can
run gcpdiag
using a wrapper that starts gcpdiag
in a
Docker container. Docker or
Podman must be installed.
- Copy and run the following command on your local workstation.
curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
- Execute the
gcpdiag
command../gcpdiag runbook gke/cluster-autoscaler --project=PROJECT_ID \ --parameter name=CLUSTER_NAME \ --parameter location=LOCATION
View available parameters for this runbook.
Replace the following:
- PROJECT_ID: The ID of the project containing the resource
- CLUSTER_NAME: The name of the target GKE Cluster within your project.
- LOCATION: The location in which your target GKE cluster is located (this is the zone for zonal cluster, or the region for regional cluster).
Useful flags:
--project
: The PROJECT_ID--universe-domain
: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource--parameter
or-p
: Runbook parameters
For a list and description of all gcpdiag
tool flags, see the
gcpdiag
usage instructions.
Troubleshoot metrics from your cluster not appearing in Cloud Monitoring
Ensure that you've enabled the Monitoring API and the Logging API on your project. You should also confirm that you're able to view your project in the Cloud Monitoring overview in the Google Cloud console.
If the issue persists, check the following potential causes:
Ensure that you have enabled monitoring on your cluster.
Monitoring is enabled by default for clusters created from the Google Cloud console and from the Google Cloud CLI, but you can verify by running the following command or clicking into the cluster's details in the Google Cloud console:
gcloud container clusters describe CLUSTER_NAME
The output from this command should include
SYSTEM_COMPONENTS
in the list ofenableComponents
in themonitoringConfig
section similar to this:monitoringConfig: componentConfig: enableComponents: - SYSTEM_COMPONENTS
If monitoring is not enabled, run the following command to enable it:
gcloud container clusters update CLUSTER_NAME --monitoring=SYSTEM
How long has it been since your cluster was created or had monitoring enabled?
It can take up to an hour for a new cluster's metrics to start appearing in Cloud Monitoring.
Is a
heapster
orgke-metrics-agent
(the OpenTelemetry Collector) running in your cluster in the "kube-system" namespace?This pod might be failing to schedule workloads because your cluster is running low on resources. Check whether Heapster or OpenTelemetry is running by calling
kubectl get pods --namespace=kube-system
and checking for Pods withheapster
orgke-metrics-agent
in the name.Is your cluster's control plane able to communicate with the nodes?
Cloud Monitoring relies on that. You can check whether this is the case by running the following command:
kubectl logs POD_NAME
If this command returns an error, then the SSH tunnels may be causing the issue. See this section for further information.
If you are having an issue related to the Cloud Logging agent, see its troubleshooting documentation.
For more information, refer to the Logging documentation.
Issue: cluster's root Certificate Authority expiring soon
Your cluster's root Certificate Authority is expiring soon. To prevent normal cluster operations from being interrupted, you must perform a credential rotation.
Error: "Instance 'Foo' does not contain 'instance-template' metadata"
You may see an error "Instance 'Foo' does not contain 'instance-template' metadata" as a status of a node pool that fails to upgrade, scale, or perform automatic node repair.
This message indicates that the metadata of VM instances, allocated by GKE,
was corrupted. This typically happens when custom-authored automation or scripts
attempt to add new instance metadata (like block-project-ssh-keys
),
and instead of just adding or updating values, it also deletes existing metadata.
You can read about VM instance metadata in Setting custom metadata.
In case any of the critical metadata values (among others: instance-template
,
kube-labels
, kubelet-config
, kubeconfig
, cluster-name
, configure-sh
,
cluster-uid
) were deleted, the node or entire node pool might render itself into
an unstable state as these values are crucial for GKE operations.
If the instance metadata was corrupted, the best way to recover the metadata is to re-create the node pool that contains the corrupted VM instances. You will need to add a node pool to your cluster and increase the node count on the new node pool, while cordoning and removing nodes on another. See the instructions to migrate workloads between node pools.
To find who and when instance metadata was edited, you can review Compute Engine audit logging information or find logs using Logs Explorer with the search query similar to this:
resource.type="gce_instance_group_manager"
protoPayload.methodName="v1.compute.instanceGroupManagers.setInstanceTemplate"
In the logs you may find the request originator IP address and user agent:
requestMetadata: {
callerIp: "REDACTED"
callerSuppliedUserAgent: "google-api-go-client/0.5 GoogleContainerEngine/v1"
}
For more information about secrets in GKE, see Encrypt secrets at the application layer.
Issue: secrets encryption update failed
If the operation to enable, disable or update the Cloud KMS key fails, please see the Troubleshoot application-layer secrets encryption guide.