Introduction to GKE troubleshooting

Autopilot Standard

This page introduces you to fundamental troubleshooting techniques for Google Kubernetes Engine (GKE). This page is for users who are new to Kubernetes and GKE who want to learn effective troubleshooting practices.

This page provides an overview of the following tools and techniques for monitoring, diagnosing, and resolving issues with GKE:

Review cluster and workload health in the Google Cloud console: get a high-level overview to quickly identify potential issues with your clusters and apps.
Investigate the cluster's state with the kubectl command-line tool: use these commands to view the live status of resources such as nodes and Pods.
Conduct historical analysis with Cloud Logging: query historical log data and examine events to identify the root cause of failures.
Perform proactive monitoring with Cloud Monitoring: track performance metrics over time, visualize trends, and create alerts to detect and respond to issues before they affect users.
Accelerate diagnosis with Gemini Cloud Assist: analyze complex error messages, get step-by-step troubleshooting guidance, and investigate issues automatically.
Put it all together: Example troubleshooting scenario: see how to use these tools together in a step-by-step walkthrough to diagnose and resolve a common real-world app failure.

Understand core concepts

If you're new to Kubernetes and GKE, understanding core concepts, like cluster architecture and the relationship between Pods and nodes, is essential before you start to troubleshoot. If you want to learn more, see Start learning about GKE.

It's also helpful to understand which parts of GKE you're responsible for maintaining and which parts Google Cloud is responsible for maintaining. For more information, see GKE shared responsibility.

Review cluster and workload health in the Google Cloud console

The Google Cloud console is a good starting point for troubleshooting because it provides a quick view of the health of your clusters and workloads. Cluster health refers to the health of the underlying GKE infrastructure like nodes and networking, while workload health refers to the status and performance of your apps running on the cluster.

The following sections describe the cluster and workload pages. To provide a complete picture of your app's health, the Google Cloud console also gives you access to powerful logging and monitoring tools, letting you investigate the root cause of past failures and proactively prevent future ones. For more information about these tools, see the Conduct historical analysis with Cloud Logging and Perform proactive monitoring with Cloud Monitoring sections.

Find cluster issues

The Kubernetes clusters page provides you with an overview of the health of your clusters. To identify problems with any of your clusters, start on this page.

To get started, in the Google Cloud console, go to the Kubernetes clusters page.

Go to Kubernetes clusters

Here are some examples of how you can use this page for troubleshooting:

For advice on improving the health of your cluster, your upgrade strategy, and cost optimization, click View recommendations.
To identify unhealthy clusters, review the Status column. Any cluster that doesn't have a green checkmark needs attention.
To see potential issues, review the Notifications column. Click any notification messages for more information.

Investigate a specific cluster

After you discover a problem with a cluster, explore the cluster's Details page for in-depth information that helps you troubleshoot your cluster and understand its configuration.

To go to a cluster's Details page, do the following:

Go to the Kubernetes clusters page.

Go to Kubernetes clusters
Review the Name column and click the name of the cluster that you want to investigate.

Here are some examples of how to use the cluster Details page to troubleshoot your cluster:

For general health checks, try the following options:
- To view cluster-level dashboards, go to the Observability tab. By default, GKE enables Cloud Monitoring when you create a cluster. When Cloud Monitoring is enabled, GKE automatically sets up the dashboards on this page. Here are some of the views you might find most useful for troubleshooting:
  - Overview: view a high-level summary of your cluster's health, resource utilization, and key events. This dashboard helps you quickly assess the overall state of your cluster and identify potential issues.
  - Traffic metrics: view node-based networking metrics for insights into the traffic between your Kubernetes workloads.
  - Workload state: view the state of Deployments, Pods, and containers. Identify failing or unhealthy instances, and detect resource constraints.
  - Control plane: view the control plane's health and performance. This dashboard lets you monitor key metrics of components such as kube-apiserver and etcd, identify performance bottlenecks, and detect component failures.
    
    Tip: These dashboards can serve as a starting point for creating customized dashboards. You can copy a predefined dashboard and then customize it to include specific metrics, filters, or visualizations relevant to your needs.
- To view recent app errors, go to the App errors tab. The information on this tab can help you prioritize and resolve errors by showing the number of occurrences, when an error first appeared, and when it last happened.
  
  To investigate an error further, click the error message to view a detailed error report, including links to relevant logs.
If you're troubleshooting issues after a recent upgrade or change, check the Cluster basics section in the cluster Details tab. Confirm that the version listed in the Version field is what you expect. For further investigation, click Show upgrade history in the Upgrades section.
If you're using a Standard cluster and your Pods are stuck in a Pending state, or you suspect that nodes are overloaded, check the Nodes tab. The Nodes tab isn't available for Autopilot clusters because GKE manages nodes for you.
- In the Node Pools section, check that autoscaling is configured correctly and that the machine type is appropriate for your workloads.
- In the Nodes section, look for any node with a status other than Ready. A NotReady status indicates a problem with the node itself, such as resource pressure or an issue with the kubelet (the kubelet is the agent that runs on each node to manage containers).

Find workload issues

When you suspect that there's a problem with a specific app, like a failed Deployment, go to the Workloads page in the Google Cloud console. This page provides a centralized view of all of the apps that run within your clusters.

To get started, in the Google Cloud console, go to the Workloads page.

Go to Workloads

Here are some examples of how you can use this page for troubleshooting:

To identify unhealthy workloads, review the Status column. Any workload that doesn't have a green checkmark needs attention.
If an app is unresponsive, review the Pods column. For example, a status like 1/3 means only one of three app replicas is running, indicating a problem.

Investigate a specific workload

After you identify a problematic workload from the overview, explore the workload Details page to begin to isolate the root cause.

To go to a workload's Details page, do the following:

Go to the Workloads page.

Go to Workloads
View the Name column and click the name of the workload that you want to investigate.

Here are some examples of how to use the workload Details page to troubleshoot your workloads:

To check the workload's configuration, use the workload Overview and Details tabs. You can use this information to verify events such as whether the correct container image tag was deployed or check the workload's resource requests and limits.

Note: Depending on your workload type, you might not have an Overview tab. For example, StatefulSets have only a Details tab. However, both tabs help you review your configuration.
To find the name of a specific crashing Pod, go to the Managed Pods section. You might need this information for kubectl commands. This section lists all the Pods controlled by the workload, along with their statuses. To see a history of recent changes to a workload, go to the Revision history tab. If you notice performance issues after a new deployment, then use this section to identify which revision is active. You can then compare the configurations of the current revision with previous ones to pinpoint the source of the problem. If this tab isn't visible, the workload is either a type that doesn't use revisions or it hasn't yet had any updates.
If a Deployment seems to have failed, go to the Events tab. This page is often the most valuable source of information because it shows Kubernetes-level events.
To look at your app's logs, click the Logs tab. This page helps you understand what's happening inside your cluster. Look here for error messages and stack traces that can help you diagnose issues.
To confirm exactly what was deployed, view the YAML tab. This page shows the live YAML manifest for the workload as it exists on the cluster. This information is useful for finding any discrepancies from your source-controlled manifests. If you're viewing a single Pod's YAML manifest, this tab also shows you the status of the Pod, which provides insights about Pod-level failures.

Investigate the cluster's state with the `kubectl` command-line tool

Although the Google Cloud console helps you understand if there's a problem, the kubectl command-line tool is essential for discovering why. By communicating directly with the Kubernetes control plane, the kubectl command-line tool lets you gather the detailed information that you need to troubleshoot your GKE environment.

The following sections introduce you to some essential commands that are a powerful starting point for GKE troubleshooting.

Before you begin

Before you start, perform the following tasks:

Install kubectl.
Configure the kubectl command-line tool to communicate with your cluster:
```
gcloud container clusters get-credentials CLUSTER_NAME \
    --location=LOCATION
```
Replace the following:
- CLUSTER_NAME: the name of your cluster.
- LOCATION: the Compute Engine location of the control plane of your cluster. Provide a region for regional clusters, or a zone for zonal clusters.
Review your permissions. To see if you have the required permissions to run kubectl commands, use the kubectl auth can-i command. For example, to see if you have permission to run kubectl get nodes, run the kubectl auth can-i get nodes command.

If you have the required permissions, the command returns yes; otherwise, the command returns no.

If you lack permission to run a kubectl command, you might see an error message similar to the following:
```
Error from server (Forbidden): pods "POD_NAME" is forbidden: User
"USERNAME@DOMAIN.com" cannot list resource "pods" in API group "" in the
namespace "default"
```
If you don't have the required permissions, ask your cluster administrator to assign the necessary roles to you.

Get an overview of what's running

The kubectl get command helps you to see an overall view of what's happening in your cluster. Use the following commands to see the status of two of the most important cluster components, nodes and Pods:

To check if your nodes are healthy, view details about all nodes and their statuses:

kubectl get nodes

The output is similar to the following:

NAME                                        STATUS   ROLES    AGE     VERSION

gke-cs-cluster-default-pool-8b8a777f-224a   Ready    <none>   4d23h   v1.32.3-gke.1785003
gke-cs-cluster-default-pool-8b8a777f-egb2   Ready    <none>   4d22h   v1.32.3-gke.1785003
gke-cs-cluster-default-pool-8b8a777f-p5bn   Ready    <none>   4d22h   v1.32.3-gke.1785003

Any status other than Ready requires additional investigation.

To check if your Pods are healthy, view details about all Pods and their statuses:
```
kubectl get pods --all-namespaces
```
The output is similar to the following:
```
NAMESPACE   NAME       READY   STATUS      RESTARTS   AGE
kube-system netd-6nbsq 3/3     Running     0          4d23h
kube-system netd-g7tpl 3/3     Running     0          4d23h
```
Any status other than Running requires additional investigation. Here are some common statuses that you might see:
- Running: a healthy, running state.
- Pending: the Pod is waiting to be scheduled on a node.
- CrashLoopBackOff: the containers in the Pod are repeatedly crashing in a loop because the app starts, exits with an error, and is then restarted by Kubernetes.
- ImagePullBackOff: the Pod can't pull the container image.

The preceding commands are only two examples of how you can use the kubectl get command. You can also use the command to learn more about many types of Kubernetes resources. For a full list of the resources that you can explore, see kubectl get in the Kubernetes documentation.

Learn more about specific resources

After you identify a problem, you need to get more details. An example of a problem could be a Pod that doesn't have a status of Running. To get more details, use the kubectl describe command.

For example, to describe a specific Pod, run the following command:

kubectl describe pod POD_NAME -n NAMESPACE_NAME

Replace the following:

POD_NAME: the name of the Pod experiencing issues.
NAMESPACE_NAME: the namespace that the Pod is in. If you're not sure what the namespace is, review the Namespace column from the output of the kubectl get pods command.

The output of the kubectl describe command includes detailed information about your resource. Here are some of the most helpful sections to review when you troubleshoot a Pod:

Status: the current status of the Pod.
Conditions: the overall health and readiness of the Pod.
Restart Count: how many times the containers in the Pod have restarted. High numbers can be a cause of concern.
Events: a log of important things that have happened to this Pod, like being scheduled to a node, pulling its container image, and whether any errors occurred. The Events section is often where you can find the direct clues to why a Pod is failing.

Like the kubectl get command, you can use the kubectl describe command to learn more about multiple types of resources. For a full list of the resources that you can explore, see kubectl describe in the Kubernetes documentation.

Conduct historical analysis with Cloud Logging

Although the kubectl command-line tool is invaluable for inspecting the live state of your Kubernetes objects, its view is often limited to the present moment. To understand the root cause of a problem, you often need to investigate what happened over time. When you need that historical context, use Cloud Logging.

Cloud Logging aggregates logs from your GKE clusters, containerized apps, and other Google Cloud services.

Understand key log types for troubleshooting

Cloud Logging automatically collects several different types of GKE logs that can help you troubleshoot:

Node and runtime logs (kubelet, containerd): the logs from the underlying node services. Because the kubelet manages the lifecycle of all Pods on the node, its logs are essential for troubleshooting issues like container startups, Out of Memory (OOM) events, probe failures, and volume mount errors. These logs are also crucial for diagnosing node-level problems, such as a node that has a NotReady status.

Because containerd manages the lifecycle of your containers, including pulling images, its logs are crucial for troubleshooting issues that happen before the kubelet can start the container. containerd logs help you diagnose node-level problems in GKE, as they document the specific activities and potential errors of the container runtime.
App logs (stdout, stderr): the standard output and error streams from your containerized processes. These logs are essential for debugging app-specific issues like crashes, errors, or unexpected behavior.
Audit logs: these logs answer "who did what, where, and when?" for your cluster. They track administrative actions and API calls made to the Kubernetes API server, which is useful for diagnosing issues caused by configuration changes or unauthorized access.

Common troubleshooting scenarios

After you identify an issue, you can query these logs to find out what happened. To help get you started, reviewing logs can help you with these issues:

If a node has a NotReady status, review its node logs. The kubelet and containerd logs often reveal the underlying cause, such as network problems or resource constraints.
If a new node fails to provision and join the cluster, review the node's serial port logs. These logs capture early boot and kubelet startup activity before the node's logging agents are fully active.
If a Pod failed to start in the past, review the app logs for that Pod to check for crashes. If the logs are empty or the Pod can't be scheduled, check the audit logs for relevant events or the node logs on the target node for clues about resource pressure or image pull errors.
If a critical Deployment was deleted and no one knows why, query the Admin Activity audit logs. These logs can help you identify which user or service account issued the delete API call, providing a clear starting point for your investigation.

How to access logs

Use Logs Explorer to query, view, and analyze GKE logs in the Google Cloud console. Logs Explorer provides powerful filtering options that help you to isolate your issue.

To access and use Logs Explorer, complete the following steps:

In the Google Cloud console, go to the Logs Explorer page.

Go to Logs Explorer

In the query pane, enter a query. Use the Logging query language to write targeted queries. Here are some common filters to get you started:

Filter type	Description	Example value
`resource.type`	The type of Kubernetes resource.	`k8s_cluster`, `k8s_node`, `k8s_pod`, `k8s_container`
`log_id`	The log stream from the resource.	`stdout`, `stderr`
`resource.labels.RESOURCE_TYPE.name`	Filter for resources with a specific name. Replace `RESOURCE_TYPE` with the name of the resource that you want to query. For example, `namespace` or `pod`.	`example-namespace-name`, `example-pod-name`
`severity`	The log severity level.	`DEFAULT`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
`jsonPayload.message=~`	A regular expression search for text within the log message.	`scale.down.error.failed.to.delete.node.min.size.reached`

For example, to troubleshoot a specific Pod, you might want to isolate its error logs. To see only logs with an ERROR severity for that Pod, use the following query:

resource.type="k8s_container"
resource.labels.pod_name="POD_NAME"
resource.labels.namespace_name="NAMESPACE_NAME"
severity=ERROR

Replace the following:

POD_NAME: the name of the Pod experiencing issues.
NAMESPACE_NAME: the namespace that the Pod is in. If you're not sure what the namespace is, review the Namespace column from the output of the kubectl get pods command.

For more examples, see Kubernetes-related queries in the Google Cloud Observability documentation.

Click Run query.
To see the full log message, including the JSON payload, metadata, and timestamp, click the log entry.

For more information about GKE logs, see About GKE logs.

Perform proactive monitoring with Cloud Monitoring

After an issue occurs, reviewing logs is a critical step in troubleshooting. However, a truly resilient system also requires a proactive approach to identify problems before they cause an outage.

To proactively identify future problems and track key performance indicators over time, use Cloud Monitoring. Cloud Monitoring provides dashboards, metrics, and alerting capabilities. These tools help you find rising error rates, increasing latency, or resource constraints, which help you act before users are affected.

Review useful metrics

GKE automatically sends a set of metrics to Cloud Monitoring. The following sections list some of the most important metrics for troubleshooting:

Container performance and health metrics
Node performance and health metrics
Pod performance and health metrics

For a complete list of GKE metrics, see GKE system metrics.

Container performance and health metrics

Start with these metrics when you suspect a problem with a specific app. These metrics help you monitor the health of your app, including discovering if a container is restarting frequently, running out of memory, or being throttled by CPU limits.

Metric	Description	Troubleshooting significance
`kubernetes.io/container/cpu/limit_utilization`	The fraction of the CPU limit that is currently in use on the instance. This value can be greater than 1 as a container might be allowed to exceed its CPU limit.	Identifies CPU throttling. High values can lead to performance degradation.
`kubernetes.io/container/memory/limit_utilization`	The fraction of the memory limit that is currently in use on the instance. This value cannot exceed 1.	Monitors for risk of OutOfMemory (OOM) errors.
`kubernetes.io/container/memory/used_bytes`	Actual memory consumed by the container in bytes.	Tracks memory consumption to identify potential memory leaks or risk of OOM errors.
`kubernetes.io/container/memory/page_fault_count`	Number of page faults, broken down by type: major and minor.	Indicates significant memory pressure. Major page faults mean memory is being read from disk (swapping), even if memory limits aren't reached.
`kubernetes.io/container/restart_count`	Number of times the container has restarted.	Highlights potential problems such as crashing apps, misconfigurations, or resource exhaustion through a high or increasing number of restarts.
`kubernetes.io/container/ephemeral_storage/used_bytes`	Local ephemeral storage usage in bytes.	Monitors temporary disk usage to prevent Pod evictions due to full ephemeral storage.
`kubernetes.io/container/cpu/request_utilization`	The fraction of the requested CPU that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request.	Identifies over or under-provisioned CPU requests to help you optimize resource allocation.
`kubernetes.io/container/memory/request_utilization`	The fraction of the requested memory that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request.	Identifies over or under-provisioned memory requests to improve scheduling and prevent OOM errors.

Node performance and health metrics

Examine these metrics when you need to diagnose issues with the underlying GKE infrastructure. These metrics are crucial for understanding the overall health and capacity of your nodes, helping you investigate whether the node is unhealthy or under pressure, or whether the node has enough memory to schedule new Pods.

Metric	Description	Troubleshooting significance
`kubernetes.io/node/cpu/allocatable_utilization`	The fraction of the allocatable CPU that is currently in use on the instance.	Indicates if the sum of Pod usage is straining the node's available CPU resources.
`kubernetes.io/node/memory/allocatable_utilization`	The fraction of the allocatable memory that is currently in use on the instance. This value cannot exceed 1 as usage cannot exceed allocatable memory bytes.	Suggests that the node lacks memory for scheduling new Pods or for existing Pods to operate, especially when values are high.
`kubernetes.io/node/status_condition` (BETA)	Condition of a node from the node status condition field.	Reports node health conditions like `Ready`, `MemoryPressure`, or `DiskPressure`.
`kubernetes.io/node/ephemeral_storage/used_bytes`	Local ephemeral storage bytes used by the node.	Helps prevent Pod startup failures or evictions by providing warnings about high ephemeral storage usage.
`kubernetes.io/node/ephemeral_storage/inodes_free`	Free number of index nodes (inodes) on local ephemeral storage.	Monitors the number of free inodes. Running out of inodes can halt operations even if disk space is available.
`kubernetes.io/node/interruption_count` (BETA)	Interruptions are system evictions of infrastructure while the customer is in control of that infrastructure. This metric is the current count of interruptions by type and reason.	Explains why a node might disappear unexpectedly due to system evictions.

Pod performance and health metrics

These metrics help you troubleshoot issues related to a Pod's interaction with its environment, such as networking and storage. Use these metrics when you need to diagnose slow-starting Pods, investigate potential network connectivity issues, or proactively manage storage to prevent write failures from full volumes.

Metric	Description	Troubleshooting significance
`kubernetes.io/pod/network/received_bytes_count`	Cumulative number of bytes received by the Pod over the network.	Identifies unusual network activity (high or low) that can indicate app or network issues.
`kubernetes.io/pod/network/policy_event_count` (BETA)	Change in the number of network policy events seen in the dataplane.	Identifies connectivity issues caused by network policies.
`kubernetes.io/pod/volume/utilization`	The fraction of the volume that is currently being used by the instance. This value cannot be greater than 1 as usage cannot exceed the total available volume space.	Enables proactive management of volume space by warning when high utilization (approaching 1) might lead to write failures.
`kubernetes.io/pod/latencies/pod_first_ready` (BETA)	The Pod end-to-end startup latency (from Pod Created to Ready), including image pulls.	Diagnoses slow-starting Pods.

Visualize metrics with Metrics Explorer

To visualize the state of your GKE environment, create charts based on metrics with Metrics Explorer.

To use Metrics Explorer, complete the following steps:

In the Google Cloud console, go to the Metrics Explorer page.

Go to Metrics Explorer
In the Metrics field, select or enter the metric that you want to inspect.
View the results and observe any trends over time.

For example, to investigate the memory consumption of Pods in a specific namespace, you can do the following:

In the Select a metric list, choose the metric kubernetes.io/container/memory/used_bytes and click Apply.
Click Add filter and select namespace_name.
In the Value list, select the namespace you want to investigate.
In the Aggregation field, select Sum > pod_name and click OK. This setting displays a separate time series line for each Pod.
Click Save chart.

The resulting chart shows you the memory usage for each Pod over time, which can help you visually identify any Pods with unusually high or spiking memory consumption.

Metrics Explorer has a great deal of flexibility in how to construct the metrics that you want to view. For more information about advanced Metrics Explorer options, see Create charts with Metrics Explorer in the Cloud Monitoring documentation.

Create alerts for proactive issue detection

To receive notifications when things go wrong or when metrics breach certain thresholds, set up alerting policies in Cloud Monitoring.

For example, to set up an alerting policy that notifies you when the container CPU limit is over 80% for five minutes, do the following:

In the Google Cloud console, go to the Alerting page.

Go to Alerting
Click Create policy.
In the Select a metric box, filter for CPU limit utilization and then select the following metric: kubernetes.io/container/cpu/limit_utilization.
Click Apply.
Leave the Add a filter field blank. This setting triggers an alert when any cluster violates your threshold.
In the Transform data section, do the following:
1. In the Rolling window list, select 1 minute. This setting means that Google Cloud calculates an average value every minute.
2. In the Rolling window function list, select mean.
  
  Both of these settings average the CPU limit utilization for each container every minute.
Click Next.
In the Configure alert section, do the following:
1. For Condition type, select Threshold.
2. For Alert trigger, select Any time series violates.
3. For Threshold position, select Above threshold.
4. For Threshold value, enter 0.8. This value represents the 80% threshold that you want to monitor for.
5. Click Advanced options.
6. In the Retest window list, select 5 min. This setting means that the alert triggers only if the CPU utilization stays over 80% for a continuous five-minute period, which reduces false alarms from brief spikes.
7. In the Condition name field, give the condition a descriptive name.
8. Click Next.
In the Configure the notifications and finalize the alert section, do the following:
1. In the Notification channels list, select the channel where you want to receive the alert. If you don't have a channel, click Manage notification channels to create one.
2. In the Name the alert policy field, give the policy a clear and descriptive name.
3. Leave all other fields with their default values.
4. Click Next.
Review your policy, and if it all looks correct, click Create policy.

To learn about the additional ways that you can create alerts, see Alerting overview in the Cloud Monitoring documentation.

Accelerate diagnosis with Gemini Cloud Assist

Sometimes, the cause of your issue isn't immediately obvious, even when you used the tools discussed in the preceding sections. Investigating complex cases can be time-consuming and requires deep expertise. For scenarios like this, Gemini Cloud Assist can help. It can automatically detect hidden patterns, surface anomalies, and provide summaries to help you quickly pinpoint a likely cause.

Access Gemini Cloud Assist

To access Gemini Cloud Assist, complete the following steps:

In the Google Cloud console, go to any page.
In the Google Cloud console toolbar, click spark Open or close Gemini Cloud Assist chat.

The Cloud Assist panel opens. You can click example prompts if they are displayed, or you can enter a prompt in the Enter a prompt field.

Explore example prompts

To help you understand how Gemini Cloud Assist can help you, here are some example prompts:

Theme	Scenario	Example prompt	How Gemini Cloud Assist can help
Confusing error message	A Pod has the `CrashLoopBackoff` status, but the error message is hard to understand.	What does this GKE Pod error mean and what are common causes: `panic: runtime error: invalid memory address or nil pointer dereference`?	Gemini Cloud Assist analyzes the message and explains it in clear terms. It also offers potential causes and solutions.
Performance issues	Your team notices high latency for an app that runs in GKE.	My `api-gateway` service in the `prod` GKE cluster is experiencing high latency. What metrics should I check first, and can you suggest some common GKE-related causes for this?	Gemini Cloud Assist suggests key metrics to examine, explores potential issues (for example, resource constraints, or network congestion), and recommends tools and techniques for further investigation.
Node issues	A GKE node is stuck with a status of `NotReady`.	One of my GKE nodes (`node-xyz`) is showing a `NotReady` status. What are the typical steps to troubleshoot this?	Gemini Cloud Assist provides a step-by-step investigation plan, explaining concepts like node auto-repair and suggesting relevant `kubectl` commands.
Understanding GKE	You're unsure about a specific GKE feature or how to implement a best practice.	What are the best practices for securing a GKE cluster? Is there any way I can learn more?	Gemini Cloud Assist provides clear explanations of GKE best practices. Click Show related content to see links to official documentation.

For more information, see the following resources:

Learn how to write better prompts.
Learn how to use the Gemini Cloud Assist panel.
Read Gemini for Google Cloud overview.
Learn how Gemini for Google Cloud uses your data.

Use Gemini Cloud Assist Investigations

In addition to interactive chat, Gemini Cloud Assist can perform more automated, in-depth analysis through Gemini Cloud Assist Investigations. This feature is integrated directly into workflows like Logs Explorer, and is a powerful root-cause analysis tool.

When you initiate an investigation from an error or a specific resource, Gemini Cloud Assist analyzes logs, configurations, and metrics. It uses this data to produce ranked observations and hypotheses about probable root causes, and then provides you with recommended next steps. You can also transfer these results to a Google Cloud support case to provide valuable context that can help you resolve your issue faster.

For more information, see Gemini Cloud Assist Investigations in the Gemini documentation.

Put it all together: Example troubleshooting scenario

This example shows how you can use a combination of GKE tools to diagnose and understand a common real-world problem: a container that is repeatedly crashing due to insufficient memory.

The scenario

You are the on-call engineer for a web app named product-catalog that runs in GKE.

Your investigation begins when you receive an automated alert from Cloud Monitoring:

Alert: High memory utilization for container 'product-catalog' in 'prod' cluster.

This alert tells you that a problem exists and indicates that the problem has something to do with the product-catalog workload.

Confirm the problem in the Google Cloud console

You start with a high-level view of your workloads to confirm the issue.

In the Google Cloud console, you navigate to the Workloads page and filter for your product-catalog workload.
You look at the Pods status column. Instead of the healthy 3/3, you see the value steadily showing an unhealthy status: 2/3. This value tells you that one of your app's Pods doesn't have a status of Ready.
You want to investigate further, so you click the name of the product-catalog workload to go to its details page.
On the details page, you view the Managed Pods section. You immediately identify a problem: the Restarts column for your Pod shows 14, an unusually high number.

This high restart count confirms the issue is causing app instability, and suggests that a container is failing its health checks or crashing.

Find the reason with `kubectl` commands

Now that you know that your app is repeatedly restarting, you need to find out why. The kubectl describe command is a good tool for this.

You get the exact name of the unstable Pod:

kubectl get pods -n prod

The output is the following:

NAME                             READY  STATUS            RESTARTS  AGE
product-catalog-d84857dcf-g7v2x  0/1    CrashLoopBackOff  14        25m
product-catalog-d84857dcf-lq8m4  1/1    Running           0         2h30m
product-catalog-d84857dcf-wz9p1  1/1    Running           0         2h30m

You describe the unstable Pod to get the detailed event history:

kubectl describe pod product-catalog-d84857dcf-g7v2x -n prod

You review the output and find clues under the Last State and Events sections:

Containers:
  product-catalog-api:
    ...
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 23 Jun 2025 10:50:15 -0700
      Finished:     Mon, 23 Jun 2025 10:54:58 -0700
    Ready:          False
    Restart Count:  14
...
Events:
  Type     Reason     Age                           From                Message
  ----     ------     ----                          ----                -------
  Normal   Scheduled  25m                           default-scheduler   Successfully assigned prod/product-catalog-d84857dcf-g7v2x to gke-cs-cluster-default-pool-8b8a777f-224a
  Normal   Pulled     8m (x14 over 25m)             kubelet             Container image "us-central1-docker.pkg.dev/my-project/product-catalog/api:v1.2" already present on machine
  Normal   Created    8m (x14 over 25m)             kubelet             Created container product-catalog-api
  Normal   Started    8m (x14 over 25m)             kubelet             Started container product-catalog-api
  Warning  BackOff    3m (x68 over 22m)             kubelet             Back-off restarting failed container

The output gives you two critical clues:

First, the Last State section shows that the container was terminated with Reason: OOMKilled, which tells you it ran out of memory. This reason is confirmed by the Exit Code: 137, which is the standard Linux exit code for a process that has been killed due to excessive memory consumption.
Second, the Events section shows a Warning: BackOff event with the message Back-off restarting failed container. This message confirms that the container is in a failure loop, which is the direct cause of the CrashLoopBackOff status that you saw earlier.

Visualize the behavior with metrics

The kubectl describe command told you what happened, but Cloud Monitoring can show you the behavior of your environment over time.

In the Google Cloud console, you go to Metrics Explorer.
You select the container/memory/used_bytes metric.
You filter the output down to your specific cluster, namespace, and Pod name.

The chart shows a distinct pattern: the memory usage climbs steadily, then abruptly drops to zero when the container is OOM killed and restarts. This visual evidence confirms either a memory leak or insufficient memory limit.

Find the root cause in logs

You now know the container is running out of memory, but you still don't know exactly why. To discover the root cause, use Logs Explorer.

In the Google Cloud console, you navigate to Logs Explorer.

You write a query to filter for your specific container's logs from just before the time of the last crash (which you saw in the output of the kubectl describe command):

resource.type="k8s_container"
resource.labels.cluster_name="example-cluster"
resource.labels.namespace_name="prod"
resource.labels.pod_name="product-catalog-d84857dcf-g7v2x"
timestamp >= "2025-06-23T17:50:00Z"
timestamp < "2025-06-23T17:55:00Z"

In the logs, you find a repeating pattern of messages right before each crash:

{
  "message": "Processing large image file product-image-large.jpg",
  "severity": "INFO"
},
{
  "message": "WARN: Memory cache size now at 248MB, nearing limit.",
  "severity": "WARNING"
}

These log entries tell you that the app is trying to process large image files by loading them entirely into memory, which eventually exhausts the container's memory limit.

The findings

By using the tools together, you have a complete picture of the problem:

The monitoring alert notified you that there was a problem.
The Google Cloud console showed you that the issue was affecting users (restarts).
kubectl commands pinpointed the exact reason for the restarts (OOMKilled).
Metrics Explorer visualized the memory leak pattern over time.
Logs Explorer revealed the specific behavior causing the memory issue.

You're now ready to implement a solution. You can either optimize the app code to handle large files more efficiently or, as a short-term fix, increase the container's memory limit (specifically, the spec.containers.resources.limits.memory value) in the workload's YAML manifest.

What's next

For advice about resolving specific problems, review GKE's troubleshooting guides.
If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care.
- Getting support from the community by asking questions on StackOverflow and using the google-kubernetes-engine tag to search for similar issues. You can also join the #kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using the public issue tracker.