Troubleshoot issues with deployed workloads


This page shows you how to resolve errors with your deployed workloads in Google Kubernetes Engine (GKE).

For more general advice about troubleshooting your applications, see Troubleshooting Applications in the Kubernetes documentation.

All errors: Check Pod status

If there are issues with a workload's Pods, Kubernetes updates the Pod status with an error message. View these errors by checking the status of a Pod using the Google Cloud console or the kubectl command-line tool.

Console

Perform the following steps:

  1. In the Google Cloud console, go to the Workloads page.

    Go to Workloads

  2. Select the workload that you want to investigate. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click any error status message.

kubectl

To see all Pods running in your cluster, run the following command:

kubectl get pods

The output is similar to the following:

NAME       READY  STATUS             RESTARTS  AGE
POD_NAME   0/1    CrashLoopBackOff   23        8d

Potential errors are listed in the Status column.

To get more details information about a specific Pod, run the following command:

kubectl describe pod POD_NAME

Replace POD_NAME with the name of the Pod that you want to investigate.

In the output, the Events field shows more information about errors.

If you'd like more information, view the container logs:

kubectl logs POD_NAME

These logs can help you identify if a command or code in the container caused the Pod to crash.

After you identify the error, use the following sections to try and resolve the issue.

Error: CrashLoopBackOff

A status of CrashLoopBackOff doesn't mean there's a specific error, instead it indicates that a container is repeatedly crashing after restarting. When a container crashes or exits shortly after starting (CrashLoop), Kubernetes attempts to restart the container. With each failed restart, the delay (BackOff) before the next attempt increases exponentially (10s, 20s, 40s, etc.), up to a maximum of five minutes.

The following sections help you identify why your container might be crashing.

Use the Crashlooping Pods interactive playbook

Begin troubleshooting what's causing a CrashLoopBackOff status by using the interactive playbook in the Google Cloud console:

  1. Go to the Crashlooping Pods interactive playbook:

    Go to Playbook

  2. In the Cluster drop-down list, select the cluster that you want to troubleshoot. If you can't find your cluster, enter the name of the cluster in the Filter field.

  3. In the Namespace drop-down list, select the namespace that you want to troubleshoot. If you can't find your namespace, enter the namespace in the Filter field.

  4. Work through each of the sections to help you identify the cause:

    1. Identify Application Errors
    2. Investigate Out Of Memory Issues
    3. Investigate Node Disruptions
    4. Investigate Liveness Probe Failures
    5. Correlate Change Events
  5. Optional: To get notifications about future CrashLoopBackOff errors, in the Future Mitigation Tips section, select Create an Alert.

Inspect logs

A container might crash for many reasons, and checking a Pod's logs can aid you in troubleshooting the root cause.

You can check the logs with the Google Cloud console or the kubectl command-line tool.

Console

Perform the following steps:

  1. Go to the Workloads page in the Google Cloud console.

    Go to Workloads

  2. Select the workload that you want to investigate. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click the problematic Pod.

  4. From the Pod's menu, click the Logs tab.

kubectl

  1. View all Pods running in your cluster:

    kubectl get pods
    
  2. In the output of the preceding command, look for a Pod with the CrashLoopBackOff error in the Status column.

  3. Get the Pod's logs:

    kubectl logs POD_NAME
    

    Replace POD_NAME with the name of the problematic Pod.

    You can also pass in the -p flag to get the logs for the previous instance of a Pod's container, if it exists.

Check the exit code of the crashed container

To better understand why your container crashed, find the exit code:

  1. Describe the Pod:

    kubectl describe pod POD_NAME
    

    Replace POD_NAME with the name of the problematic Pod.

  2. Review the value in the containers: CONTAINER_NAME: last state: exit code field:

    • If the exit code is 1, the container crashed because the application crashed.
    • If the exit code is 0, check how long your app was running. Containers exit when your application's main process exits. If your app finishes execution very quickly, the container might continue to restart. If you experience this error, one solution is to set the restartPolicy field to OnFailure. After you make this change, the app only restarts when the exit code isn't 0.

Connect to a running container

To run bash commands from the container so that you can test the network or check if you have access to files or databases used by your application, open a shell to the Pod:

kubectl exec -it POD_NAME -- /bin/bash

If there's more than one container in your Pod, add -c CONTAINER_NAME.

Errors: ImagePullBackOff and ErrImagePull

A status of ImagePullBackOff or ErrImagePull indicates that the image used by a container cannot be loaded from the image registry.

For guidance on troubleshooting these statuses, see Troubleshoot image pulls.

Error: Pod unschedulable

A status of PodUnschedulable indicates that your Pod cannot be scheduled because of insufficient resources or some configuration error.

If you have configured control plane metrics, you can find more information about these errors in scheduler metrics and API server metrics.

Use the unschedulable Pods interactive playbook

You can troubleshoot PodUnschedulable errors using the interactive playbook in the Google Cloud console:

  1. Go to the unschedulable Pods interactive playbook:

    Go to Playbook

  2. In the Cluster drop-down list, select the cluster that you want to troubleshoot. If you can't find your cluster, enter the name of the cluster in the Filter field.

  3. In the Namespace drop-down list, select the namespace that you want to troubleshoot. If you can't find your namespace, enter the namespace in the Filter field.

  4. To help you identify the cause, work through each of the sections in the playbook:

    1. Investigate CPU and Memory
    2. Investigate Max Pods per Node
    3. Investigate Autoscaler Behavior
    4. Investigate Other Failure Modes
    5. Correlate Change Events
  5. Optional: To get notifications about future PodUnschedulable errors, in the Future Mitigation Tips section, select Create an Alert .

Error: Insufficient resources

You might encounter an error indicating a lack of CPU, memory, or another resource. For example: No nodes are available that match all of the predicates: Insufficient cpu (2) which indicates that, on two nodes, there isn't enough CPU available to fulfill a Pod's requests.

If your Pod resource requests exceed that of a single node from any eligible node pools, GKE does not schedule the Pod and also does not trigger scale up to add a new node. For GKE to schedule the Pod, you must either request fewer resources for the Pod, or create a new node pool with sufficient resources.

You can also enable node auto-provisioning so that GKE can automatically create node pools with nodes where the unscheduled Pods can run.

The default CPU request is 100m or 10% of a CPU (or one core). If you want to request more or fewer resources, specify the value in the Pod specification under spec: containers: resources: requests.

Error: MatchNodeSelector

MatchNodeSelector indicates that there are no nodes that match the Pod's label selector.

To verify this, check the labels specified in the Pod specification's nodeSelector field, under spec: nodeSelector.

To see how nodes in your cluster are labeled, run the following command:

kubectl get nodes --show-labels

To attach a label to a node, run the following command:

kubectl label nodes NODE_NAME LABEL_KEY=LABEL_VALUE

Replace the following:

  • NODE_NAME: the node that you want to add a label to.
  • LABEL_KEY: the label's key.
  • LABEL_VALUE: the label's value.

For more information, refer to Assigning Pods to Nodes in the Kubernetes documentation.

Error: PodToleratesNodeTaints

PodToleratesNodeTaints indicates that the Pod can't be scheduled to any node because the Pod doesn't have tolerations that correspond to existing node taints.

To verify that this is the case, run the following command:

kubectl describe nodes NODE_NAME

In the output, check the Taints field, which lists key-value pairs and scheduling effects.

If the effect listed is NoSchedule, then no Pod can be scheduled on that node unless it has a matching toleration.

One way to resolve this issue is to remove the taint. For example, to remove a NoSchedule taint, run the following command:

kubectl taint nodes NODE_NAME key:NoSchedule-

Error: PodFitsHostPorts

The PodFitsHostPorts error means that a node is trying to use a port that's already occupied.

To resolve the issue, consider following Kubernetes best practices and use a NodePort instead of a hostPort.

If you must use a hostPort, check the manifests of the Pods and make sure that all Pods on the same node have unique values defined for hostPort.

Error: Does not have minimum availability

If a node has adequate resources but you still see the Does not have minimum availability message, check the Pod's status. If the status is SchedulingDisabled or Cordoned status, the node cannot schedule new Pods. You can check the status of a node using the Google Cloud console or the kubectl command-line tool.

Console

Perform the following steps:

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. Select the cluster that you want to investigate. The Nodes tab displays the Nodes and their status.

To enable scheduling on the node, perform the following steps:

  1. From the list, click the node that you want to investigate.

  2. From the Node Details section, click Uncordon.

kubectl

To get statuses of your nodes, run the following command:

kubectl get nodes

To enable scheduling on the node, run:

kubectl uncordon NODE_NAME

Error: Maximum Pods per node limit reached

If the Maximum Pods per node limit is reached by all nodes in the cluster, the Pods will be stuck in Unschedulable state. Under the Pod Events tab, you see a message including the phrase Too many pods.

To resolve this error, complete the following steps:

  1. Check the Maximum pods per node configuration from the Nodes tab in GKE cluster details in the Google Cloud console.

  2. Get a list of nodes:

    kubectl get nodes
    
  3. For each node, verify the number of Pods running on the node:

    kubectl get pods -o wide | grep NODE_NAME | wc -l
    
  4. If the limit is reached, add a new node pool or add additional nodes to the existing node pool.

Issue: Maximum node pool size reached with cluster autoscaler enabled

If the node pool has reached its maximum size according to its cluster autoscaler configuration, GKE does not trigger scale up for the Pod that would otherwise be scheduled with this node pool. If you want the Pod to be scheduled with this node pool, change the cluster autoscaler configuration.

Issue: Maximum node pool size reached with cluster autoscaler disabled

If the node pool has reached its maximum number of nodes, and cluster autoscaler is disabled, GKE cannot schedule the Pod with the node pool. Increase the size of your node pool or enable cluster autoscaler for GKE to resize your cluster automatically.

Error: Unbound PersistentVolumeClaims

Unbound PersistentVolumeClaims indicates that the Pod references a PersistentVolumeClaim that is not bound. This error might happen if your PersistentVolume failed to provision. You can verify that provisioning failed by getting the events for your PersistentVolumeClaim and examining them for failures.

To get events, run the following command:

kubectl describe pvc STATEFULSET_NAME-PVC_NAME-0

Replace the following:

  • STATEFULSET_NAME: the name of the StatefulSet object.
  • PVC_NAME: the name of the PersistentVolumeClaim object.

This can also happen if there was a configuration error during your manual pre-provisioning of a PersistentVolume and its binding to a PersistentVolumeClaim.

To resolve this error, try to pre-provision the volume again.

Error: Insufficient quota

Verify that your project has sufficient Compute Engine quota for GKE to scale up your cluster. If GKE attempts to add a node to your cluster to schedule the Pod, and scaling up would exceed your project's available quota, you receive the scale.up.error.quota.exceeded error message.

To learn more, see ScaleUp errors.

Issue: Deprecated APIs

Ensure that you are not using deprecated APIs that are removed with your cluster's minor version. To learn more, see GKE deprecations.

Error: Didn't have free ports for the requested Pod ports

If you see an error similar to the following, you likely have multiple Pods on the same node with the same value defined in the hostPort field:

0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Binding a Pod to a hostPort limits where GKE can schedule the Pod because each hostIP, hostPort, and protocol combination must be unique.

To resolve the issue, consider following Kubernetes best practices and using a NodePort instead of a hostPort.

If you must use a hostPort, check the manifests of the Pods and make sure that all Pods on the same node have unique values defined for hostPort.

What's next

If you need additional assistance, reach out to Cloud Customer Care.