This page introduces you to fundamental troubleshooting techniques for Google Kubernetes Engine (GKE). This page is for users who are new to Kubernetes and GKE who want to learn effective troubleshooting practices.
This page provides an overview of the following tools and techniques for monitoring, diagnosing, and resolving issues with GKE:
- Review cluster and workload health in the Google Cloud console: get a high-level overview to quickly identify potential issues with your clusters and apps.
- Investigate the cluster's state with the
kubectl
command-line tool: use these commands to view the live status of resources such as nodes and Pods. - Conduct historical analysis with Cloud Logging: query historical log data and examine events to identify the root cause of failures.
- Perform proactive monitoring with Cloud Monitoring: track performance metrics over time, visualize trends, and create alerts to detect and respond to issues before they affect users.
- Accelerate diagnosis with Gemini Cloud Assist: analyze complex error messages, get step-by-step troubleshooting guidance, and investigate issues automatically.
- Put it all together: Example troubleshooting scenario: see how to use these tools together in a step-by-step walkthrough to diagnose and resolve a common real-world app failure.
Understand core concepts
If you're new to Kubernetes and GKE, understanding core concepts, like cluster architecture and the relationship between Pods and nodes, is essential before you start to troubleshoot. If you want to learn more, see Start learning about GKE.
It's also helpful to understand which parts of GKE you're responsible for maintaining and which parts Google Cloud is responsible for maintaining. For more information, see GKE shared responsibility.
Review cluster and workload health in the Google Cloud console
The Google Cloud console is a good starting point for troubleshooting because it provides a quick view of the health of your clusters and workloads. Cluster health refers to the health of the underlying GKE infrastructure like nodes and networking, while workload health refers to the status and performance of your apps running on the cluster.
The following sections describe the cluster and workload pages. To provide a complete picture of your app's health, the Google Cloud console also gives you access to powerful logging and monitoring tools, letting you investigate the root cause of past failures and proactively prevent future ones. For more information about these tools, see the Conduct historical analysis with Cloud Logging and Perform proactive monitoring with Cloud Monitoring sections.
Find cluster issues
The Kubernetes clusters page provides you with an overview of the health of your clusters. To identify problems with any of your clusters, start on this page.
To get started, in the Google Cloud console, go to the Kubernetes clusters page.
Here are some examples of how you can use this page for troubleshooting:
- For advice on improving the health of your cluster, your upgrade strategy, and cost optimization, click View recommendations.
- To identify unhealthy clusters, review the Status column. Any cluster that doesn't have a green checkmark needs attention.
- To see potential issues, review the Notifications column. Click any notification messages for more information.
Investigate a specific cluster
After you discover a problem with a cluster, explore the cluster's Details page for in-depth information that helps you troubleshoot your cluster and understand its configuration.
To go to a cluster's Details page, do the following:
Go to the Kubernetes clusters page.
Review the Name column and click the name of the cluster that you want to investigate.
Here are some examples of how to use the cluster Details page to troubleshoot your cluster:
For general health checks, try the following options:
To view cluster-level dashboards, go to the Observability tab. By default, GKE enables Cloud Monitoring when you create a cluster. When Cloud Monitoring is enabled, GKE automatically sets up the dashboards on this page. Here are some of the views you might find most useful for troubleshooting:
- Overview: view a high-level summary of your cluster's health, resource utilization, and key events. This dashboard helps you quickly assess the overall state of your cluster and identify potential issues.
- Traffic metrics: view node-based networking metrics for insights into the traffic between your Kubernetes workloads.
- Workload state: view the state of Deployments, Pods, and containers. Identify failing or unhealthy instances, and detect resource constraints.
Control plane: view the control plane's health and performance. This dashboard lets you monitor key metrics of components such as
kube-apiserver
andetcd
, identify performance bottlenecks, and detect component failures.
To view recent app errors, go to the App errors tab. The information on this tab can help you prioritize and resolve errors by showing the number of occurrences, when an error first appeared, and when it last happened.
To investigate an error further, click the error message to view a detailed error report, including links to relevant logs.
If you're troubleshooting issues after a recent upgrade or change, check the Cluster basics section in the cluster Details tab. Confirm that the version listed in the Version field is what you expect. For further investigation, click Show upgrade history in the Upgrades section.
If you're using a Standard cluster and your Pods are stuck in a
Pending
state, or you suspect that nodes are overloaded, check the Nodes tab. The Nodes tab isn't available for Autopilot clusters because GKE manages nodes for you.- In the Node Pools section, check that autoscaling is configured correctly and that the machine type is appropriate for your workloads.
- In the Nodes section, look for any node with a status other than
Ready
. ANotReady
status indicates a problem with the node itself, such as resource pressure or an issue with the kubelet (the kubelet is the agent that runs on each node to manage containers).
Find workload issues
When you suspect that there's a problem with a specific app, like a failed Deployment, go to the Workloads page in the Google Cloud console. This page provides a centralized view of all of the apps that run within your clusters.
To get started, in the Google Cloud console, go to the Workloads page.
Here are some examples of how you can use this page for troubleshooting:
- To identify unhealthy workloads, review the Status column. Any workload that doesn't have a green checkmark needs attention.
- If an app is unresponsive, review the Pods column. For example, a status like 1/3 means only one of three app replicas is running, indicating a problem.
Investigate a specific workload
After you identify a problematic workload from the overview, explore the workload Details page to begin to isolate the root cause.
To go to a workload's Details page, do the following:
Go to the Workloads page.
View the Name column and click the name of the workload that you want to investigate.
Here are some examples of how to use the workload Details page to troubleshoot your workloads:
To check the workload's configuration, use the workload Overview and Details tabs. You can use this information to verify events such as whether the correct container image tag was deployed or check the workload's resource requests and limits.
To find the name of a specific crashing Pod, go to the Managed Pods section. You might need this information for
kubectl
commands. This section lists all the Pods controlled by the workload, along with their statuses. To see a history of recent changes to a workload, go to the Revision history tab. If you notice performance issues after a new deployment, then use this section to identify which revision is active. You can then compare the configurations of the current revision with previous ones to pinpoint the source of the problem. If this tab isn't visible, the workload is either a type that doesn't use revisions or it hasn't yet had any updates.If a Deployment seems to have failed, go to the Events tab. This page is often the most valuable source of information because it shows Kubernetes-level events.
To look at your app's logs, click the Logs tab. This page helps you understand what's happening inside your cluster. Look here for error messages and stack traces that can help you diagnose issues.
To confirm exactly what was deployed, view the YAML tab. This page shows the live YAML manifest for the workload as it exists on the cluster. This information is useful for finding any discrepancies from your source-controlled manifests. If you're viewing a single Pod's YAML manifest, this tab also shows you the status of the Pod, which provides insights about Pod-level failures.
Investigate the cluster's state with the kubectl
command-line tool
Although the Google Cloud console helps you understand if there's a problem,
the kubectl
command-line tool is essential for discovering why. By
communicating directly with the Kubernetes control plane, the kubectl
command-line tool lets you gather the detailed information that you need to
troubleshoot your GKE environment.
The following sections introduce you to some essential commands that are a powerful starting point for GKE troubleshooting.
Before you begin
Before you start, perform the following tasks:
- Install kubectl.
Configure the
kubectl
command-line tool to communicate with your cluster:gcloud container clusters get-credentials CLUSTER_NAME \ --location=LOCATION
Replace the following:
CLUSTER_NAME
: the name of your cluster.LOCATION
: the Compute Engine location of the control plane of your cluster. Provide a region for regional clusters, or a zone for zonal clusters.
Review your permissions. To see if you have the required permissions to run
kubectl
commands, use thekubectl auth can-i
command. For example, to see if you have permission to runkubectl get nodes
, run thekubectl auth can-i get nodes
command.If you have the required permissions, the command returns
yes
; otherwise, the command returnsno
.If you lack permission to run a
kubectl
command, you might see an error message similar to the following:Error from server (Forbidden): pods "POD_NAME" is forbidden: User "USERNAME@DOMAIN.com" cannot list resource "pods" in API group "" in the namespace "default"
If you don't have the required permissions, ask your cluster administrator to assign the necessary roles to you.
Get an overview of what's running
The kubectl get
command helps you to see an overall view of what's happening
in your cluster. Use the following commands to see the status of two of the most
important cluster components, nodes and Pods:
To check if your nodes are healthy, view details about all nodes and their statuses:
kubectl get nodes
The output is similar to the following:
NAME STATUS ROLES AGE VERSION gke-cs-cluster-default-pool-8b8a777f-224a Ready <none> 4d23h v1.32.3-gke.1785003 gke-cs-cluster-default-pool-8b8a777f-egb2 Ready <none> 4d22h v1.32.3-gke.1785003 gke-cs-cluster-default-pool-8b8a777f-p5bn Ready <none> 4d22h v1.32.3-gke.1785003
Any status other than
Ready
requires additional investigation.To check if your Pods are healthy, view details about all Pods and their statuses:
kubectl get pods --all-namespaces
The output is similar to the following:
NAMESPACE NAME READY STATUS RESTARTS AGE kube-system netd-6nbsq 3/3 Running 0 4d23h kube-system netd-g7tpl 3/3 Running 0 4d23h
Any status other than
Running
requires additional investigation. Here are some common statuses that you might see:Running
: a healthy, running state.Pending
: the Pod is waiting to be scheduled on a node.CrashLoopBackOff
: the containers in the Pod are repeatedly crashing in a loop because the app starts, exits with an error, and is then restarted by Kubernetes.ImagePullBackOff
: the Pod can't pull the container image.
The preceding commands are only two examples of how you can use the kubectl
get
command. You can also use the command to learn more about many types of
Kubernetes resources. For a full list of the resources that you can explore, see
kubectl get
in the Kubernetes documentation.
Learn more about specific resources
After you identify a problem, you need to get more details. An example of a
problem could be a Pod that doesn't have a status of Running
. To get more
details, use the kubectl describe
command.
For example, to describe a specific Pod, run the following command:
kubectl describe pod POD_NAME -n NAMESPACE_NAME
Replace the following:
POD_NAME
: the name of the Pod experiencing issues.NAMESPACE_NAME
: the namespace that the Pod is in. If you're not sure what the namespace is, review theNamespace
column from the output of thekubectl get pods
command.
The output of the kubectl describe
command includes detailed information about
your resource. Here are some of the most helpful sections to review when you
troubleshoot a Pod:
Status
: the current status of the Pod.Conditions
: the overall health and readiness of the Pod.Restart Count
: how many times the containers in the Pod have restarted. High numbers can be a cause of concern.Events
: a log of important things that have happened to this Pod, like being scheduled to a node, pulling its container image, and whether any errors occurred. TheEvents
section is often where you can find the direct clues to why a Pod is failing.
Like the kubectl get
command, you can use the kubectl describe
command to
learn more about multiple types of resources. For a full list of the resources
that you can explore, see
kubectl describe
in the Kubernetes documentation.
Conduct historical analysis with Cloud Logging
Although the kubectl
command-line tool is invaluable for inspecting the live
state of your Kubernetes objects, its view is often limited to the present
moment. To understand the root cause of a problem, you often need to investigate
what happened over time. When you need that historical context, use
Cloud Logging.
Cloud Logging aggregates logs from your GKE clusters, containerized apps, and other Google Cloud services.
Understand key log types for troubleshooting
Cloud Logging automatically collects several different types of GKE logs that can help you troubleshoot:
Node and runtime logs (
kubelet
,containerd
): the logs from the underlying node services. Because thekubelet
manages the lifecycle of all Pods on the node, its logs are essential for troubleshooting issues like container startups, Out of Memory (OOM) events, probe failures, and volume mount errors. These logs are also crucial for diagnosing node-level problems, such as a node that has aNotReady
status.Because containerd manages the lifecycle of your containers, including pulling images, its logs are crucial for troubleshooting issues that happen before the kubelet can start the container. containerd logs help you diagnose node-level problems in GKE, as they document the specific activities and potential errors of the container runtime.
App logs (
stdout
,stderr
): the standard output and error streams from your containerized processes. These logs are essential for debugging app-specific issues like crashes, errors, or unexpected behavior.Audit logs: these logs answer "who did what, where, and when?" for your cluster. They track administrative actions and API calls made to the Kubernetes API server, which is useful for diagnosing issues caused by configuration changes or unauthorized access.
Common troubleshooting scenarios
After you identify an issue, you can query these logs to find out what happened. To help get you started, reviewing logs can help you with these issues:
- If a node has a
NotReady
status, review its node logs. Thekubelet
andcontainerd
logs often reveal the underlying cause, such as network problems or resource constraints. - If a new node fails to provision and join the cluster, review the node's serial port logs. These logs capture early boot and kubelet startup activity before the node's logging agents are fully active.
- If a Pod failed to start in the past, review the app logs for that Pod to check for crashes. If the logs are empty or the Pod can't be scheduled, check the audit logs for relevant events or the node logs on the target node for clues about resource pressure or image pull errors.
- If a critical Deployment was deleted and no one knows why, query the Admin Activity audit logs. These logs can help you identify which user or service account issued the delete API call, providing a clear starting point for your investigation.
How to access logs
Use Logs Explorer to query, view, and analyze GKE logs in the Google Cloud console. Logs Explorer provides powerful filtering options that help you to isolate your issue.
To access and use Logs Explorer, complete the following steps:
In the Google Cloud console, go to the Logs Explorer page.
In the query pane, enter a query. Use the Logging query language to write targeted queries. Here are some common filters to get you started:
Filter type Description Example value resource.type
The type of Kubernetes resource. k8s_cluster
,k8s_node
,k8s_pod
,k8s_container
log_id
The log stream from the resource. stdout
,stderr
resource.labels.RESOURCE_TYPE.name
Filter for resources with a specific name.
ReplaceRESOURCE_TYPE
with the name of the resource that you want to query. For example,namespace
orpod
.example-namespace-name
,example-pod-name
severity
The log severity level. DEFAULT
,INFO
,WARNING
,ERROR
,CRITICAL
jsonPayload.message=~
A regular expression search for text within the log message. scale.down.error.failed.to.delete.node.min.size.reached
For example, to troubleshoot a specific Pod, you might want to isolate its error logs. To see only logs with an
ERROR
severity for that Pod, use the following query:resource.type="k8s_container" resource.labels.pod_name="POD_NAME" resource.labels.namespace_name="NAMESPACE_NAME" severity=ERROR
Replace the following:
POD_NAME
: the name of the Pod experiencing issues.NAMESPACE_NAME
: the namespace that the Pod is in. If you're not sure what the namespace is, review theNamespace
column from the output of thekubectl get pods
command.
For more examples, see Kubernetes-related queries in the Google Cloud Observability documentation.
Click Run query.
To see the full log message, including the JSON payload, metadata, and timestamp, click the log entry.
For more information about GKE logs, see About GKE logs.
Perform proactive monitoring with Cloud Monitoring
After an issue occurs, reviewing logs is a critical step in troubleshooting. However, a truly resilient system also requires a proactive approach to identify problems before they cause an outage.
To proactively identify future problems and track key performance indicators over time, use Cloud Monitoring. Cloud Monitoring provides dashboards, metrics, and alerting capabilities. These tools help you find rising error rates, increasing latency, or resource constraints, which help you act before users are affected.
Review useful metrics
GKE automatically sends a set of metrics to Cloud Monitoring. The following sections list some of the most important metrics for troubleshooting:
- Container performance and health metrics
- Node performance and health metrics
- Pod performance and health metrics
For a complete list of GKE metrics, see GKE system metrics.
Container performance and health metrics
Start with these metrics when you suspect a problem with a specific app. These metrics help you monitor the health of your app, including discovering if a container is restarting frequently, running out of memory, or being throttled by CPU limits.
Metric | Description | Troubleshooting significance |
---|---|---|
kubernetes.io/container/cpu/limit_utilization |
The fraction of the CPU limit that is currently in use on the instance. This value can be greater than 1 as a container might be allowed to exceed its CPU limit. | Identifies CPU throttling. High values can lead to performance degradation. |
kubernetes.io/container/memory/limit_utilization |
The fraction of the memory limit that is currently in use on the instance. This value cannot exceed 1. | Monitors for risk of OutOfMemory (OOM) errors. |
kubernetes.io/container/memory/used_bytes |
Actual memory consumed by the container in bytes. | Tracks memory consumption to identify potential memory leaks or risk of OOM errors. |
kubernetes.io/container/memory/page_fault_count |
Number of page faults, broken down by type: major and minor. | Indicates significant memory pressure. Major page faults mean memory is being read from disk (swapping), even if memory limits aren't reached. |
kubernetes.io/container/restart_count |
Number of times the container has restarted. | Highlights potential problems such as crashing apps, misconfigurations, or resource exhaustion through a high or increasing number of restarts. |
kubernetes.io/container/ephemeral_storage/used_bytes |
Local ephemeral storage usage in bytes. | Monitors temporary disk usage to prevent Pod evictions due to full ephemeral storage. |
kubernetes.io/container/cpu/request_utilization |
The fraction of the requested CPU that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request. | Identifies over or under-provisioned CPU requests to help you optimize resource allocation. |
kubernetes.io/container/memory/request_utilization |
The fraction of the requested memory that is currently in use on the instance. This value can be greater than 1 as usage can exceed the request. | Identifies over or under-provisioned memory requests to improve scheduling and prevent OOM errors. |
Node performance and health metrics
Examine these metrics when you need to diagnose issues with the underlying GKE infrastructure. These metrics are crucial for understanding the overall health and capacity of your nodes, helping you investigate whether the node is unhealthy or under pressure, or whether the node has enough memory to schedule new Pods.
Metric | Description | Troubleshooting significance |
---|---|---|
kubernetes.io/node/cpu/allocatable_utilization |
The fraction of the allocatable CPU that is currently in use on the instance. | Indicates if the sum of Pod usage is straining the node's available CPU resources. |
kubernetes.io/node/memory/allocatable_utilization |
The fraction of the allocatable memory that is currently in use on the instance. This value cannot exceed 1 as usage cannot exceed allocatable memory bytes. | Suggests that the node lacks memory for scheduling new Pods or for existing Pods to operate, especially when values are high. |
kubernetes.io/node/status_condition (BETA) |
Condition of a node from the node status condition field. | Reports node health conditions like Ready , MemoryPressure , or DiskPressure . |
kubernetes.io/node/ephemeral_storage/used_bytes |
Local ephemeral storage bytes used by the node. | Helps prevent Pod startup failures or evictions by providing warnings about high ephemeral storage usage. |
kubernetes.io/node/ephemeral_storage/inodes_free |
Free number of index nodes (inodes) on local ephemeral storage. | Monitors the number of free inodes. Running out of inodes can halt operations even if disk space is available. |
kubernetes.io/node/interruption_count (BETA) |
Interruptions are system evictions of infrastructure while the customer is in control of that infrastructure. This metric is the current count of interruptions by type and reason. | Explains why a node might disappear unexpectedly due to system evictions. |
Pod performance and health metrics
These metrics help you troubleshoot issues related to a Pod's interaction with its environment, such as networking and storage. Use these metrics when you need to diagnose slow-starting Pods, investigate potential network connectivity issues, or proactively manage storage to prevent write failures from full volumes.
Metric | Description | Troubleshooting significance |
---|---|---|
kubernetes.io/pod/network/received_bytes_count |
Cumulative number of bytes received by the Pod over the network. | Identifies unusual network activity (high or low) that can indicate app or network issues. |
kubernetes.io/pod/network/policy_event_count (BETA) |
Change in the number of network policy events seen in the dataplane. | Identifies connectivity issues caused by network policies. |
kubernetes.io/pod/volume/utilization |
The fraction of the volume that is currently being used by the instance. This value cannot be greater than 1 as usage cannot exceed the total available volume space. | Enables proactive management of volume space by warning when high utilization (approaching 1) might lead to write failures. |
kubernetes.io/pod/latencies/pod_first_ready (BETA) |
The Pod end-to-end startup latency (from Pod Created to Ready), including image pulls. | Diagnoses slow-starting Pods. |
Visualize metrics with Metrics Explorer
To visualize the state of your GKE environment, create charts based on metrics with Metrics Explorer.
To use Metrics Explorer, complete the following steps:
In the Google Cloud console, go to the Metrics Explorer page.
In the Metrics field, select or enter the metric that you want to inspect.
View the results and observe any trends over time.
For example, to investigate the memory consumption of Pods in a specific namespace, you can do the following:
- In the Select a metric list, choose the metric
kubernetes.io/container/memory/used_bytes
and click Apply. - Click Add filter and select namespace_name.
- In the Value list, select the namespace you want to investigate.
- In the Aggregation field, select Sum > pod_name and click OK. This setting displays a separate time series line for each Pod.
- Click Save chart.
The resulting chart shows you the memory usage for each Pod over time, which can help you visually identify any Pods with unusually high or spiking memory consumption.
Metrics Explorer has a great deal of flexibility in how to construct the metrics that you want to view. For more information about advanced Metrics Explorer options, see Create charts with Metrics Explorer in the Cloud Monitoring documentation.
Create alerts for proactive issue detection
To receive notifications when things go wrong or when metrics breach certain thresholds, set up alerting policies in Cloud Monitoring.
For example, to set up an alerting policy that notifies you when the container CPU limit is over 80% for five minutes, do the following:
In the Google Cloud console, go to the Alerting page.
Click Create policy.
In the Select a metric box, filter for
CPU limit utilization
and then select the following metric: kubernetes.io/container/cpu/limit_utilization.Click Apply.
Leave the Add a filter field blank. This setting triggers an alert when any cluster violates your threshold.
In the Transform data section, do the following:
- In the Rolling window list, select 1 minute. This setting means that Google Cloud calculates an average value every minute.
In the Rolling window function list, select mean.
Both of these settings average the CPU limit utilization for each container every minute.
Click Next.
In the Configure alert section, do the following:
- For Condition type, select Threshold.
- For Alert trigger, select Any time series violates.
- For Threshold position, select Above threshold.
- For Threshold value, enter
0.8
. This value represents the 80% threshold that you want to monitor for. - Click Advanced options.
- In the Retest window list, select 5 min. This setting means that the alert triggers only if the CPU utilization stays over 80% for a continuous five-minute period, which reduces false alarms from brief spikes.
- In the Condition name field, give the condition a descriptive name.
- Click Next.
In the Configure the notifications and finalize the alert section, do the following:
- In the Notification channels list, select the channel where you want to receive the alert. If you don't have a channel, click Manage notification channels to create one.
- In the Name the alert policy field, give the policy a clear and descriptive name.
- Leave all other fields with their default values.
- Click Next.
Review your policy, and if it all looks correct, click Create policy.
To learn about the additional ways that you can create alerts, see Alerting overview in the Cloud Monitoring documentation.
Accelerate diagnosis with Gemini Cloud Assist
Sometimes, the cause of your issue isn't immediately obvious, even when you used the tools discussed in the preceding sections. Investigating complex cases can be time-consuming and requires deep expertise. For scenarios like this, Gemini Cloud Assist can help. It can automatically detect hidden patterns, surface anomalies, and provide summaries to help you quickly pinpoint a likely cause.
Access Gemini Cloud Assist
To access Gemini Cloud Assist, complete the following steps:
- In the Google Cloud console, go to any page.
In the Google Cloud console toolbar, click spark Open or close Gemini Cloud Assist chat.
The Cloud Assist panel opens. You can click example prompts if they are displayed, or you can enter a prompt in the Enter a prompt field.
Explore example prompts
To help you understand how Gemini Cloud Assist can help you, here are some example prompts:
Theme | Scenario | Example prompt | How Gemini Cloud Assist can help |
---|---|---|---|
Confusing error message | A Pod has the CrashLoopBackoff status, but the error message is hard to understand. |
What does this GKE Pod error mean and what are common causes: panic: runtime error: invalid memory address or nil pointer dereference ? |
Gemini Cloud Assist analyzes the message and explains it in clear terms. It also offers potential causes and solutions. |
Performance issues | Your team notices high latency for an app that runs in GKE. | My api-gateway service in the prod GKE cluster is experiencing high latency. What metrics should I check first, and can you suggest some common GKE-related causes for this? |
Gemini Cloud Assist suggests key metrics to examine, explores potential issues (for example, resource constraints, or network congestion), and recommends tools and techniques for further investigation. |
Node issues | A GKE node is stuck with a status of NotReady . |
One of my GKE nodes (node-xyz ) is showing a NotReady status. What are the typical steps to troubleshoot this? |
Gemini Cloud Assist provides a step-by-step investigation plan, explaining concepts like node auto-repair and suggesting relevant kubectl commands. |
Understanding GKE | You're unsure about a specific GKE feature or how to implement a best practice. | What are the best practices for securing a GKE cluster? Is there any way I can learn more? | Gemini Cloud Assist provides clear explanations of GKE best practices. Click Show related content to see links to official documentation. |
For more information, see the following resources:
- Learn how to write better prompts.
- Learn how to use the Gemini Cloud Assist panel.
- Read Gemini for Google Cloud overview.
- Learn how Gemini for Google Cloud uses your data.
Use Gemini Cloud Assist Investigations
In addition to interactive chat, Gemini Cloud Assist can perform more automated, in-depth analysis through Gemini Cloud Assist Investigations. This feature is integrated directly into workflows like Logs Explorer, and is a powerful root-cause analysis tool.
When you initiate an investigation from an error or a specific resource, Gemini Cloud Assist analyzes logs, configurations, and metrics. It uses this data to produce ranked observations and hypotheses about probable root causes, and then provides you with recommended next steps. You can also transfer these results to a Google Cloud support case to provide valuable context that can help you resolve your issue faster.
For more information, see Gemini Cloud Assist Investigations in the Gemini documentation.
Put it all together: Example troubleshooting scenario
This example shows how you can use a combination of GKE tools to diagnose and understand a common real-world problem: a container that is repeatedly crashing due to insufficient memory.
The scenario
You are the on-call engineer for a web app named product-catalog
that runs in
GKE.
Your investigation begins when you receive an automated alert from Cloud Monitoring:
Alert: High memory utilization for container 'product-catalog' in 'prod' cluster.
This alert tells you that a problem exists and indicates that the problem has
something to do with the product-catalog
workload.
Confirm the problem in the Google Cloud console
You start with a high-level view of your workloads to confirm the issue.
- In the Google Cloud console, you navigate to the Workloads page
and filter for your
product-catalog
workload. - You look at the Pods status column. Instead of the healthy
3/3
, you see the value steadily showing an unhealthy status:2/3
. This value tells you that one of your app's Pods doesn't have a status ofReady
. - You want to investigate further, so you click the name of the
product-catalog
workload to go to its details page. - On the details page, you view the Managed Pods section. You
immediately identify a problem: the
Restarts
column for your Pod shows14
, an unusually high number.
This high restart count confirms the issue is causing app instability, and suggests that a container is failing its health checks or crashing.
Find the reason with kubectl
commands
Now that you know that your app is repeatedly restarting, you need to find out
why. The kubectl describe
command is a good tool for this.
You get the exact name of the unstable Pod:
kubectl get pods -n prod
The output is the following:
NAME READY STATUS RESTARTS AGE product-catalog-d84857dcf-g7v2x 0/1 CrashLoopBackOff 14 25m product-catalog-d84857dcf-lq8m4 1/1 Running 0 2h30m product-catalog-d84857dcf-wz9p1 1/1 Running 0 2h30m
You describe the unstable Pod to get the detailed event history:
kubectl describe pod product-catalog-d84857dcf-g7v2x -n prod
You review the output and find clues under the
Last State
andEvents
sections:Containers: product-catalog-api: ... State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 23 Jun 2025 10:50:15 -0700 Finished: Mon, 23 Jun 2025 10:54:58 -0700 Ready: False Restart Count: 14 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 25m default-scheduler Successfully assigned prod/product-catalog-d84857dcf-g7v2x to gke-cs-cluster-default-pool-8b8a777f-224a Normal Pulled 8m (x14 over 25m) kubelet Container image "us-central1-docker.pkg.dev/my-project/product-catalog/api:v1.2" already present on machine Normal Created 8m (x14 over 25m) kubelet Created container product-catalog-api Normal Started 8m (x14 over 25m) kubelet Started container product-catalog-api Warning BackOff 3m (x68 over 22m) kubelet Back-off restarting failed container
The output gives you two critical clues:
- First, the
Last State
section shows that the container was terminated withReason: OOMKilled
, which tells you it ran out of memory. This reason is confirmed by theExit Code: 137
, which is the standard Linux exit code for a process that has been killed due to excessive memory consumption. - Second, the
Events
section shows aWarning: BackOff
event with the messageBack-off restarting failed container
. This message confirms that the container is in a failure loop, which is the direct cause of theCrashLoopBackOff
status that you saw earlier.
- First, the
Visualize the behavior with metrics
The kubectl describe
command told you what happened, but Cloud Monitoring
can show you the behavior of your environment over time.
- In the Google Cloud console, you go to Metrics Explorer.
- You select the
container/memory/used_bytes
metric. - You filter the output down to your specific cluster, namespace, and Pod name.
The chart shows a distinct pattern: the memory usage climbs steadily, then abruptly drops to zero when the container is OOM killed and restarts. This visual evidence confirms either a memory leak or insufficient memory limit.
Find the root cause in logs
You now know the container is running out of memory, but you still don't know exactly why. To discover the root cause, use Logs Explorer.
- In the Google Cloud console, you navigate to Logs Explorer.
You write a query to filter for your specific container's logs from just before the time of the last crash (which you saw in the output of the
kubectl describe
command):resource.type="k8s_container" resource.labels.cluster_name="example-cluster" resource.labels.namespace_name="prod" resource.labels.pod_name="product-catalog-d84857dcf-g7v2x" timestamp >= "2025-06-23T17:50:00Z" timestamp < "2025-06-23T17:55:00Z"
In the logs, you find a repeating pattern of messages right before each crash:
{ "message": "Processing large image file product-image-large.jpg", "severity": "INFO" }, { "message": "WARN: Memory cache size now at 248MB, nearing limit.", "severity": "WARNING" }
These log entries tell you that the app is trying to process large image files by loading them entirely into memory, which eventually exhausts the container's memory limit.
The findings
By using the tools together, you have a complete picture of the problem:
- The monitoring alert notified you that there was a problem.
- The Google Cloud console showed you that the issue was affecting users (restarts).
kubectl
commands pinpointed the exact reason for the restarts (OOMKilled
).- Metrics Explorer visualized the memory leak pattern over time.
- Logs Explorer revealed the specific behavior causing the memory issue.
You're now ready to implement a solution. You can either optimize the app code
to handle large files more efficiently or, as a short-term fix, increase the
container's memory limit (specifically, the
spec.containers.resources.limits.memory
value) in the workload's YAML
manifest.
What's next
For advice about resolving specific problems, review GKE's troubleshooting guides.
If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care.
- Getting support from the community by
asking questions on StackOverflow
and using the
google-kubernetes-engine
tag to search for similar issues. You can also join the#kubernetes-engine
Slack channel for more community support. - Opening bugs or feature requests by using the public issue tracker.