Troubleshoot issues with deployed workloads


This page shows you how to resolve errors with your deployed workloads in Google Kubernetes Engine (GKE).

For more general advice about troubleshooting your applications, see Troubleshooting Applications in the Kubernetes documentation.

All errors: Check Pod status

If there are issues with a workload's Pods, Kubernetes updates the Pod status with an error message. View these errors by checking the status of a Pod using the Google Cloud console or the kubectl command-line tool.

Console

Perform the following steps:

  1. In the Google Cloud console, go to the Workloads page.

    Go to Workloads

  2. Select the workload that you want to investigate. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click any error status message.

kubectl

To see all Pods running in your cluster, run the following command:

kubectl get pods

The output is similar to the following:

NAME       READY  STATUS             RESTARTS  AGE
POD_NAME   0/1    CrashLoopBackOff   23        8d

Potential errors are listed in the Status column.

To get more details information about a specific Pod, run the following command:

kubectl describe pod POD_NAME

Replace POD_NAME with the name of the Pod that you want to investigate.

In the output, the Events field shows more information about errors.

If you'd like more information, view the container logs:

kubectl logs POD_NAME

These logs can help you identify if a command or code in the container caused the Pod to crash.

After you identify the error, use the following sections to try and resolve the issue.

Error: CrashLoopBackOff

A status of CrashLoopBackOff doesn't mean there's a specific error, instead it indicates that a container is repeatedly crashing after restarting. When a container crashes or exits shortly after starting (CrashLoop), Kubernetes attempts to restart the container. With each failed restart, the delay (BackOff) before the next attempt increases exponentially (10s, 20s, 40s, etc.), up to a maximum of five minutes.

The following sections help you identify why your container might be crashing.

Use the Crashlooping Pods interactive playbook

Begin troubleshooting what's causing a CrashLoopBackOff status by using the interactive playbook in the Google Cloud console:

  1. Go to the Crashlooping Pods interactive playbook:

    Go to Playbook

  2. In the Cluster drop-down list, select the cluster that you want to troubleshoot. If you can't find your cluster, enter the name of the cluster in the Filter field.

  3. In the Namespace drop-down list, select the namespace that you want to troubleshoot. If you can't find your namespace, enter the namespace in the Filter field.

  4. Work through each of the sections to help you identify the cause:

    1. Identify Application Errors
    2. Investigate Out Of Memory Issues
    3. Investigate Node Disruptions
    4. Investigate Liveness Probe Failures
    5. Correlate Change Events
  5. Optional: To get notifications about future CrashLoopBackOff errors, in the Future Mitigation Tips section, select Create an Alert.

Inspect logs

A container might crash for many reasons, and checking a Pod's logs can aid you in troubleshooting the root cause.

You can check the logs with the Google Cloud console or the kubectl command-line tool.

Console

Perform the following steps:

  1. Go to the Workloads page in the Google Cloud console.

    Go to Workloads

  2. Select the workload that you want to investigate. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click the problematic Pod.

  4. From the Pod's menu, click the Logs tab.

kubectl

  1. View all Pods running in your cluster:

    kubectl get pods
    
  2. In the output of the preceding command, look for a Pod with the CrashLoopBackOff error in the Status column.

  3. Get the Pod's logs:

    kubectl logs POD_NAME
    

    Replace POD_NAME with the name of the problematic Pod.

    You can also pass in the -p flag to get the logs for the previous instance of a Pod's container, if it exists.

Check the exit code of the crashed container

To better understand why your container crashed, find the exit code:

  1. Describe the Pod:

    kubectl describe pod POD_NAME
    

    Replace POD_NAME with the name of the problematic Pod.

  2. Review the value in the containers: CONTAINER_NAME: last state: exit code field:

    • If the exit code is 1, the container crashed because the application crashed.
    • If the exit code is 0, check how long your app was running. Containers exit when your application's main process exits. If your app finishes execution very quickly, the container might continue to restart. If you experience this error, one solution is to set the restartPolicy field to OnFailure. After you make this change, the app only restarts when the exit code isn't 0.

Connect to a running container

To run bash commands from the container so that you can test the network or check if you have access to files or databases used by your application, open a shell to the Pod:

kubectl exec -it POD_NAME -- /bin/bash

If there's more than one container in your Pod, add -c CONTAINER_NAME.

Errors: ImagePullBackOff and ErrImagePull

A status of ImagePullBackOff or ErrImagePull indicates that the image used by a container cannot be loaded from the image registry.

You can verify this issue using the Google Cloud console or the kubectl command-line tool.

Console

Perform the following steps:

  1. In the Google Cloud console, go to the Workloads page.

    Go to Workloads

  2. Select the workload that you want to investigate. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click the problematic Pod.

  4. From the Pod's menu, click the Events tab.

kubectl

To get more information about a Pod's container image, run the following command:

kubectl describe pod POD_NAME

Issue: The image isn't found

If your image is not found, complete the following steps:

  1. Verify that the name of the image is correct.
  2. Verify that the tag for the image is correct. (Try :latest or no tag to pull the latest image).
  3. If the image has a full registry path, verify that it exists in the Docker registry that you are using. If you provide only the image name, check the Docker Hub registry.
  4. In GKE Standard clusters, try to pull the Docker image manually:

    1. Use SSH to connect to the node:

      For example, to use SSH to connect to a VM, run the following command:

      gcloud compute ssh VM_NAME --zone=ZONE_NAME
      

      Replace the following:

    2. Generate a config file at /home/[USER]/.docker/config.json:

      docker-credential-gcr configure-docker
      

      Ensure that the config file at /home/[USER]/.docker/config.json includes the registry of the image in the credHelpers field. For example, the following file includes authentication information for images hosted at asia.gcr.io, eu.gcr.io, gcr.io, marketplace.gcr.io, and us.gcr.io:

      {
      "auths": {},
      "credHelpers": {
        "asia.gcr.io": "gcr",
        "eu.gcr.io": "gcr",
        "gcr.io": "gcr",
        "marketplace.gcr.io": "gcr",
        "us.gcr.io": "gcr"
      }
      }
      
    3. Try to pull the image:

      docker pull IMAGE_NAME
      

    If pulling the image manually works, you probably need to specify ImagePullSecrets on a Pod. Pods can only reference image pull Secrets in their own namespace, so this process needs to be done one time per namespace.

Error: Permission denied

If you encounter a "permission denied" or "no pull access" error, verify that you are logged in and have access to the image. Try one of the following methods depending on the registry in which you host your images.

Artifact Registry

If your image is in Artifact Registry, your node pool's service account needs read access to the repository that contains the image.

Grant the artifactregistry.reader role to the service account:

gcloud artifacts repositories add-iam-policy-binding REPOSITORY_NAME \
    --location=REPOSITORY_LOCATION \
    --member=serviceAccount:SERVICE_ACCOUNT_EMAIL \
    --role="roles/artifactregistry.reader"

Replace the following:

  • REPOSITORY_NAME: the name of your Artifact Registry repository.
  • REPOSITORY_LOCATION: the region of your Artifact Registry repository.
  • SERVICE_ACCOUNT_EMAIL: the email address of the IAM service account associated with your node pool.

Container Registry

If your image is in Container Registry, your node pool's service account needs read access to the Cloud Storage bucket that contains the image.

Grant the roles/storage.objectViewer role to the service account so that it can read from the bucket:

gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
    --member=serviceAccount:SERVICE_ACCOUNT_EMAIL \
    --role=roles/storage.objectViewer

Replace the following:

  • SERVICE_ACCOUNT_EMAIL: the email of the service account associated with your node pool. You can list all the service accounts in your project using gcloud iam service-accounts list.
  • BUCKET_NAME: the name of the Cloud Storage bucket that contains your images. You can list all the buckets in your project using gcloud storage ls.

If your registry administrator set up gcr.io repositories in Artifact Registry to store images for the gcr.io domain instead of Container Registry, you must grant read access to Artifact Registry instead of Container Registry.

Private registry

If your image is in a private registry, you might require keys to access the images. For more information, see Using private registries in the Kubernetes documentation.

Error 401 Unauthorized: Cannot pull images from private container registry repository

An error similar to the following might occur when you pull an image from a private Container Registry repository:

gcr.io/PROJECT_ID/IMAGE:TAG: rpc error: code = Unknown desc = failed to pull and
unpack image gcr.io/PROJECT_ID/IMAGE:TAG: failed to resolve reference
gcr.io/PROJECT_ID/IMAGE]:TAG: unexpected status code [manifests 1.0]: 401 Unauthorized

Warning  Failed     3m39s (x4 over 5m12s)  kubelet            Error: ErrImagePull
Warning  Failed     3m9s (x6 over 5m12s)   kubelet            Error: ImagePullBackOff
Normal   BackOff    2s (x18 over 5m12s)    kubelet            Back-off pulling image

To resolve the error, complete the following steps:

  1. Identify the node running the Pod:

    kubectl describe pod POD_NAME | grep "Node:"
    
  2. Verify that the node you identified in the previous step has the storage scope:

    gcloud compute instances describe NODE_NAME \
        --zone=COMPUTE_ZONE --format="flattened(serviceAccounts[].scopes)"
    

    The node's access scope should contain at least one of the following scopes:

    serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/devstorage.read_only
    serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/cloud-platform
    

    If the node doesn't contain one of these scopes, recreate the node pool.

  3. Recreate the node pool that the node belongs to with sufficient scope. You cannot modify existing nodes, you must recreate the node with the correct scope.

    • Recommended: Create a new node pool with the gke-default scope:

      gcloud container node-pools create NODE_POOL_NAME \
          --cluster=CLUSTER_NAME \
          --zone=COMPUTE_ZONE \
          --scopes="gke-default"
      
    • Create a new node pool with only storage scope:

      gcloud container node-pools create NODE_POOL_NAME \
          --cluster=CLUSTER_NAME \
          --zone=COMPUTE_ZONE \
          --scopes="https://www.googleapis.com/auth/devstorage.read_only"
      

Error: Pod unschedulable

A status of PodUnschedulable indicates that your Pod cannot be scheduled because of insufficient resources or some configuration error.

If you have configured control plane metrics, you can find more information about these errors in scheduler metrics and API server metrics.

Use the unschedulable Pods interactive playbook

You can troubleshoot PodUnschedulable errors using the interactive playbook in the Google Cloud console:

  1. Go to the unschedulable Pods interactive playbook:

    Go to Playbook

  2. In the Cluster drop-down list, select the cluster that you want to troubleshoot. If you can't find your cluster, enter the name of the cluster in the Filter field.

  3. In the Namespace drop-down list, select the namespace that you want to troubleshoot. If you can't find your namespace, enter the namespace in the Filter field.

  4. To help you identify the cause, work through each of the sections in the playbook:

    1. Investigate CPU and Memory
    2. Investigate Max Pods per Node
    3. Investigate Autoscaler Behavior
    4. Investigate Other Failure Modes
    5. Correlate Change Events
  5. Optional: To get notifications about future PodUnschedulable errors, in the Future Mitigation Tips section, select Create an Alert .

Error: Insufficient resources

You might encounter an error indicating a lack of CPU, memory, or another resource. For example: No nodes are available that match all of the predicates: Insufficient cpu (2) which indicates that, on two nodes, there isn't enough CPU available to fulfill a Pod's requests.

If your Pod resource requests exceed that of a single node from any eligible node pools, GKE does not schedule the Pod and also does not trigger scale up to add a new node. For GKE to schedule the Pod, you must either request fewer resources for the Pod, or create a new node pool with sufficient resources.

You can also enable node auto-provisioning so that GKE can automatically create node pools with nodes where the unscheduled Pods can run.

The default CPU request is 100m or 10% of a CPU (or one core). If you want to request more or fewer resources, specify the value in the Pod specification under spec: containers: resources: requests.

Error: MatchNodeSelector

MatchNodeSelector indicates that there are no nodes that match the Pod's label selector.

To verify this, check the labels specified in the Pod specification's nodeSelector field, under spec: nodeSelector.

To see how nodes in your cluster are labeled, run the following command:

kubectl get nodes --show-labels

To attach a label to a node, run the following command:

kubectl label nodes NODE_NAME LABEL_KEY=LABEL_VALUE

Replace the following:

  • NODE_NAME: the node that you want to add a label to.
  • LABEL_KEY: the label's key.
  • LABEL_VALUE: the label's value.

For more information, refer to Assigning Pods to Nodes in the Kubernetes documentation.

Error: PodToleratesNodeTaints

PodToleratesNodeTaints indicates that the Pod can't be scheduled to any node because the Pod doesn't have tolerations that correspond to existing node taints.

To verify that this is the case, run the following command:

kubectl describe nodes NODE_NAME

In the output, check the Taints field, which lists key-value pairs and scheduling effects.

If the effect listed is NoSchedule, then no Pod can be scheduled on that node unless it has a matching toleration.

One way to resolve this issue is to remove the taint. For example, to remove a NoSchedule taint, run the following command:

kubectl taint nodes NODE_NAME key:NoSchedule-

Error: PodFitsHostPorts

PodFitsHostPorts indicates that a port that a node is attempting to use is already in use.

To resolve this issue, check the Pod specification's hostPort value under spec: containers: ports: hostPort. You might need to change this value to another port.

Error: Does not have minimum availability

If a node has adequate resources but you still see the Does not have minimum availability message, check the Pod's status. If the status is SchedulingDisabled or Cordoned status, the node cannot schedule new Pods. You can check the status of a node using the Google Cloud console or the kubectl command-line tool.

Console

Perform the following steps:

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. Select the cluster that you want to investigate. The Nodes tab displays the Nodes and their status.

To enable scheduling on the node, perform the following steps:

  1. From the list, click the node that you want to investigate.

  2. From the Node Details section, click Uncordon.

kubectl

To get statuses of your nodes, run the following command:

kubectl get nodes

To enable scheduling on the node, run:

kubectl uncordon NODE_NAME

Error: Maximum Pods per node limit reached

If the Maximum Pods per node limit is reached by all nodes in the cluster, the Pods will be stuck in Unschedulable state. Under the Pod Events tab, you see a message including the phrase Too many pods.

To resolve this error, complete the following steps:

  1. Check the Maximum pods per node configuration from the Nodes tab in GKE cluster details in the Google Cloud console.

  2. Get a list of nodes:

    kubectl get nodes
    
  3. For each node, verify the number of Pods running on the node:

    kubectl get pods -o wide | grep NODE_NAME | wc -l
    
  4. If the limit is reached, add a new node pool or add additional nodes to the existing node pool.

Issue: Maximum node pool size reached with cluster autoscaler enabled

If the node pool has reached its maximum size according to its cluster autoscaler configuration, GKE does not trigger scale up for the Pod that would otherwise be scheduled with this node pool. If you want the Pod to be scheduled with this node pool, change the cluster autoscaler configuration.

Issue: Maximum node pool size reached with cluster autoscaler disabled

If the node pool has reached its maximum number of nodes, and cluster autoscaler is disabled, GKE cannot schedule the Pod with the node pool. Increase the size of your node pool or enable cluster autoscaler for GKE to resize your cluster automatically.

Error: Unbound PersistentVolumeClaims

Unbound PersistentVolumeClaims indicates that the Pod references a PersistentVolumeClaim that is not bound. This error might happen if your PersistentVolume failed to provision. You can verify that provisioning failed by getting the events for your PersistentVolumeClaim and examining them for failures.

To get events, run the following command:

kubectl describe pvc STATEFULSET_NAME-PVC_NAME-0

Replace the following:

  • STATEFULSET_NAME: the name of the StatefulSet object.
  • PVC_NAME: the name of the PersistentVolumeClaim object.

This can also happen if there was a configuration error during your manual pre-provisioning of a PersistentVolume and its binding to a PersistentVolumeClaim.

To resolve this error, try to pre-provision the volume again.

Error: Insufficient quota

Verify that your project has sufficient Compute Engine quota for GKE to scale up your cluster. If GKE attempts to add a node to your cluster to schedule the Pod, and scaling up would exceed your project's available quota, you receive the scale.up.error.quota.exceeded error message.

To learn more, see ScaleUp errors.

Issue: Deprecated APIs

Ensure that you are not using deprecated APIs that are removed with your cluster's minor version. To learn more, see GKE deprecations.

What's next

If you need additional assistance, reach out to Cloud Customer Care.