This page helps you resolve issues with Pods experiencing CrashLoopBackOff
events in Google Kubernetes Engine (GKE).
This page is for Application developers who want to identify app-level issues, such as configuration errors or code-related bugs, that cause their containers to crash. It is also for Platform admins and operators who need to identify platform-level root causes for container restarts, such as resource exhaustion, node disruptions, or misconfigured liveness probes. For more information about common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.
Understand a CrashLoopBackOff
event
When your Pod is stuck in a CrashLoopBackOff
state, a container within it is
repeatedly starting and crashing or exiting. This CrashLoop triggers
Kubernetes to attempt restarting the container by adhering to its
restartPolicy
. With each failed restart, the BackOff delay before the next
attempt increases exponentially (for example, 10s, 20s, 40s), up to a maximum of
five minutes.
Although this event indicates a problem within your container, it's also a
valuable diagnostic signal. A CrashLoopBackOff
event confirms that many
foundational steps of Pod creation, such as assignment to a node and pulling the
container image, have already completed. This knowledge lets you focus your
investigation on the container's app or configuration, rather than the cluster
infrastructure.
The CrashLoopBackOff
state occurs because of how Kubernetes, specifically the
kubelet, handles container termination based on the Pod's
restart policy.
The cycle typically follows this pattern:
- The container starts.
- The container exits.
- The kubelet observes the stopped container and restarts it according to the
Pod's
restartPolicy
. - This cycle repeats, with the container restarted after an increasing exponential back-off delay.
The Pod's restartPolicy
is the key to this behavior. The default policy,
Always
, is the most common cause of this loop because it restarts a container
if it exits for any reason, even after a successful exit. The OnFailure
policy is less likely to cause a loop because it only restarts on non-zero exit
codes, and the Never
policy avoids a restart entirely.
Identify symptoms of a CrashLoopBackOff
event
A Pod with the CrashLoopBackOff
status is the primary indication of a
CrashLoopBackOff
event.
However, you might experience some less obvious symptoms of a CrashLoopBackOff
event:
- Zero healthy replicas for a workload.
- A sharp decrease in healthy replicas.
- Workloads with horizontal Pod autoscaling enabled are scaling slowly or failing to scale.
If a system
workload (for example, a logging or metrics agent) has the
CrashLoopBackOff
status, you might also notice the following symptoms:
- Some GKE metrics aren't reported.
- Some GKE dashboards and graphs have gaps.
- Connectivity issues on Pod-level networking.
If you observe any of these less obvious symptoms, your next step should be to
confirm if a CrashLoopBackOff
event occurred.
Confirm a CrashLoopBackOff
event
To confirm and investigate a CrashLoopBackOff
event, gather evidence from
Kubernetes events and the container's app logs. These two sources provide
different, but complementary views of the problem:
- Kubernetes events confirm that a Pod is crashing.
- The container's app logs can show you why the process inside the container is failing.
To view this information, select one of the following options:
Console
To view Kubernetes events and app logs, do the following:
In the Google Cloud console, go to the Workloads page.
Select the workload that you want to investigate. The Overview or Details tab displays more information about the status of the workload.
From the Managed Pods section, click the name of the problematic Pod.
On the Pod details page, investigate the following:
- To see details about Kubernetes events, go to the Events tab.
- To view the container's app logs, go to the Logs tab. This page is where you find app-specific error messages or stack traces.
kubectl
To view Kubernetes events and app logs, do the following:
View the status of all Pods running in your cluster:
kubectl get pods
The output is similar to the following:
NAME READY STATUS RESTARTS AGE POD_NAME 0/1 CrashLoopBackOff 23 8d
In the output, review the following columns:
Ready
: review how many containers are ready. In this example,0/1
indicates that zero out of one expected container is in a ready state. This value is a clear sign of a problem.Status
: look for Pods with a status ofCrashLoopBackOff
.Restarts
: a high value indicates that Kubernetes is repeatedly trying and failing to start the container.
After you identify a failing Pod, describe it to see cluster-level events that are related to the Pod's state:
kubectl describe pod POD_NAME -n NAMESPACE_NAME
Replace the following:
POD_NAME
: the name of the Pod that you identified in the output of thekubectl get
command.NAMESPACE_NAME
: the namespace of the Pod.
The output is similar to the following:
Containers: container-name: ... State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: context deadline exceeded: unknown Exit Code: 128 Started: Thu, 01 Jan 1970 00:00:00 +0000 Finished: Fri, 27 Jun 2025 16:20:03 +0000 Ready: False Restart Count: 3459 ... Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Failed 12m (x216 over 25h) kubelet Error: context deadline exceeded Warning Failed 8m34s (x216 over 25h) kubelet Error: context deadline exceeded Warning BackOff 4m24s (x3134 over 25h) kubelet Back-off restarting failed container container-name in pod failing-pod(11111111-2222-3333-4444-555555555555)
In the output, review the following fields for signs of a
CrashLoopBackOff
event:State
: the state of the container likely showsWaiting
with the reasonCrashLoopBackOff
.Last State
: the state of the previously terminated container. Look for aTerminated
status and review the exit code to see if there was a crash (non-zero exit code) or an unexpected successful exit (zero exit code).Events
: actions taken by the cluster itself. Look for messages about the container being started, followed by liveness probe failures or back-off warnings likeBack-off restarting failed container
.
To learn more about why the Pod failed, view its app logs:
kubectl logs POD_NAME --previous
The
--previous
flag retrieves logs from the prior, terminated container, which is where you can find the specific stack trace or error message that reveals the cause of the crash. The current container might be too new to have recorded any logs.In the output, look for app-specific errors that would cause the process to exit. If you use a custom-made app, the developers who wrote it are best equipped to interpret these error messages. If you use a prebuilt app, these apps often provide their own debugging instructions.
Use the Crashlooping Pods interactive playbook
After you confirm a CrashLoopBackOff
event, begin troubleshooting with
the interactive playbook:
In the Google Cloud console, go to the GKE Interactive Playbook - Crashlooping Pods page.
In the Cluster list, select the cluster that you want to troubleshoot. If you can't find your cluster, enter the name of the cluster in the Filter field.
In the Namespace list, select the namespace that you want to troubleshoot. If you can't find your namespace, enter the namespace in the Filter field.
Work through each section to help you answer the following questions:
- Identify App Errors: which containers are restarting?
- Investigate Out Of Memory Issues: is there a misconfiguration or an error related to the app?
- Investigate Node Disruptions: are disruptions on the node resource causing container restarts?
- Investigate Liveness Probe Failures: are liveness probes stopping your containers?
- Correlate Change Events: what happened around the time the containers started crashing?
Optional: To get notifications about future
CrashLoopBackOff
events, in the Future Mitigation Tips section, select Create an Alert.
If your problem persists after using the playbook, read the rest of the guide
for more information about resolving CrashLoopBackOff
events.
Resolve a CrashLoopBackOff
event
The following sections help you resolve the most common causes of
CrashLoopBackOff
events:
Resolve resource exhaustion
A CrashLoopBackOff
event is often caused by an Out of Memory (OOM) issue. You
can confirm if this is the cause if the kubectl describe
output shows the
following:
Last State: Terminated
Reason: OOMKilled
For information about how to diagnose and resolve OOM events, see Troubleshoot OOM events.
Resolve liveness probe failures
A
liveness probe
is a periodic health check performed by the kubelet
. If the probe fails a
specified number of times (the default number is three), the kubelet
restarts
the container, potentially causing a CrashLoopBackOff
event if the probe
failures continue.
Confirm if a liveness probe is the cause
To confirm if liveness probe failures are triggering the CrashLoopBackOff
event, query your kubelet
logs. These logs often contain explicit messages
indicating probe failures and subsequent restarts.
In the Google Cloud console, go to the Logs Explorer page.
In the query pane, filter for any liveness-probe-related restarts by entering the following query:
resource.type="k8s_node" log_id("kubelet") jsonPayload.MESSAGE:"failed liveness probe, will be restarted" resource.labels.cluster_name="CLUSTER_NAME"
Replace
CLUSTER_NAME
with the name of your cluster.Review the output. If a liveness probe failure is the cause of your
CrashLoopBackOff
events, the query returns log messages similar to the following:Container probe failed liveness probe, will be restarted
After you confirm that liveness probes are the cause of the CrashLoopBackOff
event, proceed to troubleshoot common causes:
- Review liveness probe configuration.
- Inspect CPU and disk I/0 utilization.
- Address large deployments.
- Address transient errors.
- Address probe resource consumption.
Review liveness probe configuration
Misconfigured probes are a frequent cause of CrashLoopBackOff
events. Check
the following settings in the manifest of your probe:
- Verify probe type: your probe's configuration must match how your app
reports its health. For example, if your app has a health check URL (like
/healthz
), use thehttpGet
probe type. If its health is determined by running a command, use theexec
probe type. For example, to check if a network port is open and listening, use thetcpSocket
probe type. - Check probe parameters:
- Path (for
httpGet
probe type): make sure the HTTP path is correct and that your app serves health checks on it. - Port: verify that the port configured in the probe is actually used and exposed by the app.
- Command (for
exec
probe type): make sure the command exists within the container, returns an exit code of0
for success, and completes within the configuredtimeoutSeconds
period. - Timeout: make sure that the
timeoutSeconds
value is sufficient for the app to respond, especially during startup or under load. - Initial delay (
initialDelaySeconds
): check if the initial delay is sufficient for the app to start before probes begin.
- Path (for
For more information, see Configure Liveness, Readiness and Startup Probes in the Kubernetes documentation.
Inspect CPU and disk I/O utilization
Resource contention results in probe timeouts, which is a major cause of liveness probe failures. To see if resource usage is the cause of the liveness probe failure, try the following solutions:
- Analyze CPU usage: monitor the CPU utilization of the affected
container and the node it's running on during the probe intervals. A key
metric to track is
kubernetes.io/container/cpu/core_usage_time
. High CPU usage on the container or the node can prevent the app from responding to the probe in time. - Monitor disk I/O: check disk I/O metrics for the node. You can use the
compute.googleapis.com/guest/disk/operation_time
metric to assess the amount of time spent on the disk operations, which are categorized by reads and writes. High disk I/O can significantly slow down container startup, app initialization, or overall app performance, leading to probe timeouts.
Address large deployments
In scenarios where a large number of Pods are deployed simultaneously (for example, by a CI/CD tool like ArgoCD), a sudden surge of new Pods can overwhelm cluster resources, leading to control plane resource exhaustion. This lack of resources delays app startup and can cause liveness probes to fail repeatedly before the apps are ready.
To resolve this issue, try the following solutions:
- Implement staggered deployments: implement strategies to deploy Pods in batches or over a longer period to avoid overwhelming node resources.
- Reconfigure or scale nodes: if staggered deployments aren't feasible, consider upgrading nodes with faster or larger disks, or Persistent Volume Claims, to better handle increased I/O demand. Ensure your cluster autoscaling is configured appropriately.
- Wait and observe: In some cases, if the cluster is not severely under-resourced, workloads might eventually deploy after a significant delay (sometimes 30 minutes or more).
Address transient errors
The app might experience temporary errors or slowdowns during startup or
initialization that cause the probe to fail initially. If the app eventually
recovers, consider increasing the values defined in the initialDelaySeconds
or
failureThreshold
fields in the manifest of your liveness probe.
Address probe resource consumption
In rare cases, the liveness probe's execution itself might consume significant resources, which could trigger resource constraints that potentially lead to the container being terminated due to an OOM kill. Ensure your probe commands are lightweight. A lightweight probe is more likely to execute quickly and reliably, giving it higher fidelity in accurately reporting your app's true health.
Resolve app misconfigurations
App misconfigurations cause many CrashLoopBackOff
events. To
understand why your app is stopping, the first step is to examine its exit code.
This code determines your troubleshooting path:
- Exit code
0
indicates a successful exit, which is unexpected for a long-running service and points to issues with the container's entry point or app design. - A non-zero exit code signals an app crash, directing your focus toward configuration errors, dependency issues, or bugs in the code.
Find the exit code
To find the exit code of your app, do the following:
Describe the Pod:
kubectl describe pod POD_NAME -n NAMESPACE_NAME
Replace the following:
POD_NAME
: the name of the problematic Pod.NAMESPACE_NAME
: the namespace of the Pod.
In the output, review the
Exit Code
field located under theLast State
section for the relevant container. If the exit code is0
, see Troubleshoot successful exits (exit code 0). If the exit code is a number other than0
, see Troubleshoot app crashes (non-zero exit code).
Troubleshoot successful exits (exit code 0
)
An exit code of 0
typically means the container's process finished successfully.
Although this is the outcome that you want for a task-based Job, it can signal a
problem for a long-running controller like a Deployment, StatefulSet, or
ReplicaSet.
These controllers work to ensure a Pod is always running, so they treat any exit
as a failure to be corrected. The kubelet
enforces this behavior by adhering
to the Pod's restartPolicy
(which defaults to Always
), restarting the
container even after a successful exit. This action creates a loop, which
ultimately triggers the CrashLoopBackOff
status.
The most common reasons for unexpected successful exits are the following:
Container command doesn't start a persistent process: a container remains running only as long as its initial process (
command
orentrypoint
) does. If this process isn't a long-running service, the container exits as soon as the command completes. For example, a command like["/bin/bash"]
exits immediately because it has no script to run. To resolve this issue, ensure your container's initial process starts a process that runs continuously.Worker app exits when a work queue is empty: many worker apps are designed to check a queue for a task and exit cleanly if the queue is empty. To resolve this, you can either use a Job controller (which is designed for tasks that run to completion) or modify the app's logic to run as a persistent service.
App exits due to missing or invalid configuration: Your app might exit immediately if it's missing required startup instructions, such as command-line arguments, environment variables, or a critical configuration file.
To resolve this issue, first inspect your app's logs for specific error messages related to configuration loading or missing parameters. Then, verify the following:
- App arguments or environment: ensure that all necessary command-line arguments and environment variables are correctly passed to the container as expected by your app.
- Configuration file presence: confirm that any required configuration files are present at the expected paths within the container.
- Configuration file content: validate the content and format of your configuration files for syntax errors, missing mandatory fields, or incorrect values.
A common example of this issue is when an app is configured to read from a file mounted with a
ConfigMap
volume. If theConfigMap
isn't attached, is empty, or has misnamed keys, an app designed to exit when its configuration is missing might stop with an exit code of0
. In such cases, verify the following settings: - TheConfigMap
name in your Pod's volume definition matches its actual name. - The keys within theConfigMap
match what your app expects to find as filenames in the mounted volume.
Troubleshoot app crashes (non-zero exit code)
When a container exits with a non-zero code, Kubernetes restarts it. If the
underlying issue that caused the error is persistent, the app crashes again and
the cycle repeats, culminating in a CrashLoopBackOff
state.
The non-zero exit code is a clear signal that an error occurred within the app itself, which directs your debugging efforts toward its internal workings and environment. The following issues often cause this termination:
Configuration errors: a non-zero exit code often points to problems with the app's configuration or the environment it's running in. Check your app for these common issues:
- Missing configuration file: the app might not be able to locate or access a required configuration file.
- Invalid configuration: the configuration file might contain syntax errors, incorrect values, or incompatible settings, causing the app to crash.
- Permissions issues: the app could lack the necessary permissions to read or write the configuration file.
- Environment variables: incorrect or missing environment variables can cause the app to malfunction or fail to start.
- Invalid
entrypoint
orcommand
: the command specified in the container'sentrypoint
orcommand
field might be incorrect. This issue can happen with newly deployed images where the path to the executable is wrong or the file itself is not present in the container image. This misconfiguration often results in the128
exit code. Uncontrolled image updates (
:latest
tag): if your workload images use the:latest
tag, new Pods might pull an updated image version that introduces breaking changes.To help ensure consistency and reproducibility, always use specific, immutable image tags (for example,
v1.2.3
) or SHA digests (for example,sha256:45b23dee08...
) in production environments. This practice helps ensure that the exact same image content is pulled every time.
Dependency issues: your app might crash if it can't connect to the other services it depends on, or if it fails to authenticate or has insufficient permissions to access them.
External service unavailable: the app might depend on external services (for example, databases or APIs) that are unreachable due to network connectivity problems or service outages. To troubleshoot this issue, connect to the Pod. For more information, see Debug Running Pods in the Kubernetes documentation.
After you connect to the Pod, you can run commands to check for access to files, databases, or to test the network. For example, you can use a tool like
curl
to try and reach a service's URL. This action helps you determine if a problem is caused by network policies, DNS, or the service itself.Authentication failures: the app might be unable to authenticate with external services due to incorrect credentials. Inspect the container's logs for messages like
401 Unauthorized
(bad credentials) or403 Forbidden
(insufficient permissions), which often indicate that the service account for the Pod lacks the necessary IAM roles to make external Google Cloud service calls.If you use GKE Workload Identity Federation, verify that the principal identifier has the permissions required for the task. For more information about granting IAM roles to principals by using GKE Workload Identity Federation, see Configure authorization and principals. You should also verify that the resource usage of GKE Metadata Server hasn't exceeded its limits.
Timeouts: the app might experience timeouts when waiting for responses from external services, leading to crashes.
App-specific errors: if configuration and external dependencies seem correct, the error might be within the app's code. Inspect the app logs for these common internal errors:
- Unhandled exceptions: the app logs might contain stack traces or error messages indicating unhandled exceptions or other code-related bugs.
- Deadlocks or livelocks: the app might be stuck in a deadlock, where multiple processes are waiting for each other to complete. In this scenario, the app might not exit, but it stops responding indefinitely.
- Port conflicts: the app might fail to start if it attempts to bind to a port that is already in use by another process.
- Incompatible libraries: the app might depend on libraries or dependencies that are missing or incompatible with the runtime environment.
To find the root cause, inspect the container's logs for a specific error message or stack trace. This information helps you decide whether to fix the app code, adjust resource limits, or correct the environment's configuration. For more information about logs, see About GKE logs.
What's next
If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care.
- Getting support from the community by
asking questions on StackOverflow
and using the
google-kubernetes-engine
tag to search for similar issues. You can also join the#kubernetes-engine
Slack channel for more community support. - Opening bugs or feature requests by using the public issue tracker.