Troubleshoot GPUs in GKE

Autopilot Standard

If your Google Kubernetes Engine (GKE) Pods are stuck in a Pending state while requesting nvidia.com/gpu resources, or if your nodes fail to register their available GPUs, you might have an issue with the NVIDIA driver installation or your node pool configuration. These problems prevent your workloads from accessing the GPU hardware that they need.

This document shows you how to diagnose and resolve common problems that prevent GKE from scheduling or running GPU-accelerated workloads. Learn how to verify the GPU driver installation, inspect Pod and node logs for errors, and confirm that your configurations are correct.

This information is for Platform admins and operators who manage GPU-enabled node pools and need to resolve NVIDIA driver issues, and for Application developers who need to debug GPU workloads that are stuck or failing to start. For more information about the common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

GPU driver installation

This section provides troubleshooting information for automatic NVIDIA device driver installation in GKE.

Driver installation fails in Ubuntu nodes

If you use Ubuntu nodes that have attached L4, RTX PRO 6000, H100, or H200 GPUs, the default GPU driver that GKE installs might not be at or later than the required version for those GPUs. As a result, the GPU device plugin Pod remains stuck in the Pending state and your GPU workloads on those nodes might experience issues.

To resolve this issue, see the instructions for the respective GPU:

L4 and H100

To resolve this issue for L4 and H100 GPUs, we recommend upgrading to the following GKE versions which install GPU driver version 535 as the default driver:

1.26.15-gke.1483000 and later
1.27.15-gke.1039000 and later
1.28.11-gke.1044000 and later
1.29.6-gke.1073000 and later
1.30.2-gke.1124000 and later

Alternatively, you can manually install driver version 535 or later by running the following command:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R535.yaml

RTX PRO 6000

To resolve this issue for RTX PRO 6000 GPUs, upgrade to one of the following GKE versions. These versions install GPU driver version 580 as the default driver:

1.32.8-gke.1170000 and later
1.33.4-gke.1245000 and later
1.34.0-gke.1662000 and later

H200

To resolve this issue for H200 GPUs, you must manually install driver version 550 or later by running the following command:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R550.yaml

GPU device plugins fail with CrashLoopBackOff errors

The following issue occurs if you used the manual driver installation method in your node pool prior to January 25, 2023 and later upgraded your node pool to a GKE version that supports automatic driver installation. Both installation workloads exist at the same time and try to install conflicting driver versions on your nodes.

The GPU device plugin init container fails with the Init:CrashLoopBackOff status. The logs for the container are similar to the following:

failed to verify installation: failed to verify GPU driver installation: exit status 18

To resolve this issue, try the following methods:

Remove the manual driver installation DaemonSet from your cluster. This deletes the conflicting installation workload and lets GKE automatically install a driver to your nodes.

Note: Ensure that all of your node pools use automatic installation before you delete the DaemonSet.
```
kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
```
Re-apply the manual driver installation DaemonSet manifest to your cluster. On January 25, 2023, we updated the manifest to ignore nodes that use automatic driver installation.
```
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
```
Disable automatic driver installation for your node pool. The existing driver installation DaemonSet should work as expected after the update operation completes.
```
gcloud container node-pools update POOL_NAME \
    --accelerator=type=GPU_TYPE,count=GPU_COUNT,gpu-driver-version=disabled \
    --cluster=CLUSTER_NAME \
    --location=LOCATION
```
Replace the following:
- POOL_NAME: the name of the node pool.
- GPU_TYPE: the GPU type that the node pool already uses.
- GPU_COUNT: the number of GPUs that are already attached to the node pool.
- CLUSTER_NAME: the name of the GKE cluster that contains the node pool.
- LOCATION: the Compute Engine location of the cluster.

For more information about mapping the GPU driver version to GKE version, see Map the GKE version and Container-Optimized OS node image version to the GPU driver version.

Error: "Container image cos-nvidia-installer:fixed is not present with pull policy of Never." or "Container image ubuntu-nvidia-installer:fixed is not present with pull policy of Never."

This issue occurs when the nvidia-driver-installer Pods are in the PodInitializing state and the GPU plugin device or the GPU driver installer Pods report the following error. The specific error message depends on the operating system running on your node:

COS

Container image "cos-nvidia-installer:fixed" is not present with pull policy of Never.

Ubuntu

Container image "gke-nvidia-installer:fixed" is not present with pull policy of Never.

This issue can occur when the garbage collector removes the preloaded NVIDIA driver image to free space on a node. When the driver Pod is recreated or its container is restarted, GKE won't be able to locate the preloaded image.

To mitigate the garbage collection issue when you are running COS, upgrade your GKE nodes to one of these versions that contain the fix:

1.25.15-gke.1040000 and later
1.26.10-gke.1030000 and later
1.27.6-gke.1513000 and later
1.28.3-gke.1061000 and later

For more information about mapping the GPU driver version to GKE version, see Map the GKE version and Container-Optimized OS node image version to the GPU driver version.

If your nodes are running Ubuntu, no fix is available yet for this garbage collection issue. To mitigate this issue on Ubuntu, you can run a privileged container that interacts with the host to ensure the correct setup of NVIDIA GPU drivers. To do so, run sudo /usr/local/bin/nvidia-container-first-boot from your node or apply the following manifest:

apiVersion: v1
kind: Pod
metadata:
  name: gke-nvidia-installer-fixup
spec:
  nodeSelector:
    cloud.google.com/gke-os-distribution: ubuntu
  hostPID: true
  containers:
  - name: installer
    image: ubuntu
    securityContext:
      privileged: true
    command:
      - nsenter
      - -at
      - '1'
      - --
      - sh
      - -c
      - "/usr/local/bin/nvidia-container-first-boot"
  restartPolicy: Never

Another potential cause of the issue is when the NVIDIA driver images are lost after node reboot or host maintenance. This may occur on confidential nodes, or nodes with GPUs, that use ephemeral local SSD storage. In this situation, GKE preloads the nvidia-installer-driver container images on nodes and moves them from the boot disk to the local SSD on first boot.

To confirm there was a host maintenance event, use the following log filter:

resource.type="gce_instance"
protoPayload.serviceName="compute.googleapis.com"
log_id("cloudaudit.googleapis.com/system_event")

To mitigate the host maintenance issue, upgrade your GKE version to one of these versions:

1.27.13-gke.1166000 and later
1.29.3-gke.1227000 and later
1.28.8-gke.1171000 and later

Error: failed to configure GPU driver installation dirs: failed to create lib64 overlay: failed to create dir /usr/local/nvidia/lib64: mkdir /usr/local/nvidia/lib64: not a directory.

You encounter this error from the GPU driver installer container inside the GPU device plugin when NCCL fastsocket is enabled:

failed to configure GPU driver installation dirs: failed to create lib64 overlay: failed to create dir /usr/local/nvidia/lib64: mkdir /usr/local/nvidia/lib64: not a directory.

This issue only happens on clusters and nodes running GKE 1.28 and 1.29.

The issue is caused by a NCCL fastsocket race condition with the GPU driver installer.

To mitigate this issue, upgrade your GKE version to one of these versions:

1.28.8-gke.1206000 and later
1.29.3-gke.1344000 and later

For more information, read the GPUDirect-TCPXO Release Notes.

Error: Failed to get device for nvidia0: device nvidia0 not found.

The following error indicates that XID 62 and RmInitAdapter failed for GPU with minor 0:

Failed to get device for nvidia0: device nvidia0 not found.

NVIDIA driver version 525.105.17 has a bug that can cause communication errors (XID) and prevent the GPU from initializing properly, leading to a failure to initialize the GPU.

To fix this issue, upgrade the NVIDIA driver to driver version 525.110.11 or later.

Map the GKE version and Container-Optimized OS node image version to the GPU driver version

To find the GPU driver versions that are mapped with GKE versions and Container-Optimized OS node image versions, do the following steps:

Map Container-Optimized OS node image versions to GKE patch versions for the specific GKE version where you want to find the GPU driver version. For example, 1.33.0-gke.1552000 uses cos-121-18867-90-4.
Choose the milestone of the Container-Optimized OS node image version in the Container-Optimized OS release notes. For example, choose Milestone 121 for cos-121-18867-90-4.
In the release notes page for the specific milestone, find the release note corresponding with the specific Container-Optimized OS node image version. For example, in Container-Optimized OS Release Notes: Milestone 121, see cos-121-18867-90-4. In the table in the GPU Drivers column, click See List to see the GPU driver version information.

Reset GPUs on A3 VMs

Some issues might require you to reset the GPU on an A3 VM.

To reset the GPU, follow these steps:

Remove Pods that request GPU resources from the node where you need to reset the GPU.

Disable the GPU device plugin on the node:

kubectl get nodes \
    --selector=kubernetes.io/hostname=NODE_NAME \
    --no-headers | awk '{print $1}' \
    | xargs -I{} kubectl label node {} gke-no-default-nvidia-gpu-device-plugin=true

Replace NODE_NAME with the name of the node.

Connect to the VM backing the node.

In the SSH session, reset the GPU:

/home/kubernetes/bin/nvidia/bin/nvidia-smi --gpu-reset

Re-enable the GPU device plugin:

kubectl get nodes --selector=kubernetes.io/hostname=NODE_NAME \
    --no-headers \| awk '{print $1}' \
    | xargs -I{} kubectl label node {} gke-no-default-nvidia-gpu-device-plugin=false \
    --overwrite

GPUs on Confidential GKE Nodes

The following sections show you how to identify and fix issues with GPUs that run on Confidential GKE Nodes.

GPU workloads not scheduling on Confidential GKE Nodes

Confidential GKE Nodes requires that you manually install a GPU driver that corresponds to your selected GPU type and your GKE version. If your GPU Pods aren't scheduling on Confidential GKE Nodes and remain in the Pending state, describe the driver installation DaemonSet:

kubectl --namespace=kube-system get daemonset nvidia-driver-installer -o yaml

If the output returns a NotFound error, install the driver.

If the DaemonSet is running, the output is similar to the following:

apiVersion: apps/v1
kind: DaemonSet
# lines omitted for clarity
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: nvidia-driver-installer
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: nvidia-driver-installer
        name: nvidia-driver-installer
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
              - key: cloud.google.com/gke-gpu-driver-version
                operator: DoesNotExist
              - key: cloud.google.com/gke-confidential-nodes-instance-type
                operator: In
                values:
                - TDX

In this output, verify that the nodeAffinity field contains the cloud.google.com/gke-confidential-nodes-instance-type key. If the output doesn't contain this key, the driver installation DaemonSet doesn't support Confidential GKE Nodes.

Deploy the DaemonSet that supports GPUs on Confidential GKE Nodes:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/cos/daemonset-confidential.yaml

After you install the drivers, check whether your GPU workloads start successfully.

Error: Failed to allocate device vector

The following error message in your GPU container logs indicates that the GPU was detached from the node VM instance:

Failed to allocate device vector A (error code unknown error)!

This detachment might happen because of a hardware error or because of an issue with the encryption keys.

To resolve this issue, reboot the node instance. This operation is disruptive, and affects all of the workloads on that node. To reboot the instance, do the following steps:

Get the name of the node that runs the GPU Pod:
```
kubectl get pod POD_NAME -o yaml | grep "nodeName"
```
Replace POD_NAME with the name of the failing Pod.

The output is similar to the following:
```
nodeName: gke-cluster-1-default-pool-b7asdfbt-fd3e
```
Reset the Compute Engine instance:
```
gcloud compute instances reset NODE_NAME
```
Replace NODE_NAME with the node name from the output of the previous step.

The gcloud CLI looks for VMs with that name in your active project. If you see a prompt to select a zone, specify Y.
Check whether your GPU workloads run without errors.

Error: Decryption failed with error -74

The following error message in your node logs indicates that the encryption keys for the GPU were lost:

Decryption failed with error -74

This error happens when the NVIDIA persistence daemon, which runs on the node VM instance, fails. If you see this error message, reset the instance:

gcloud compute instances reset NODE_NAME

Replace NODE_NAME with the name of the failing node.

The gcloud CLI looks for VMs with that name in your active project. If you see a prompt to select a zone, specify Y.

If resetting the instance doesn't fix this issue, contact Cloud Customer Care or submit a product bug. For more information, see Get support.

Finding XID errors

The gpu-device-plugin daemonset runs within the kube-system namespace and is responsible for the following:

GPU workload scheduling: allocating GPU resources to Pods.
GPU health checking: monitoring the health of your GPUs.
GPU metrics gathering: collecting GPU-related metrics, such as duty cycle and memory usage.

The gpu-device-plugin uses NVIDIA Management Library (NVML) to detect XID errors. When an XID error occurs, the gpu-device-plugin Pod running on the affected node logs the error. You will find two types of XID error logs:

Non-critical XID errors:
- Log format: Skip error Xid=%d as it is not Xid Critical
- Meaning: These errors are considered non-critical. They can be caused by various software or hardware issues.
- Action: GKE takes no automated action for non-critical XID errors.
Critical XID errors:
- Log format: XidCriticalError: Xid=%d, All devices will go unhealthy
- Meaning: These errors indicate a GPU hardware issue.
- Action:
  - GKE marks the node's GPU resources as unhealthy.
  - GKE prevents GPU workloads from being scheduled on the node.
  - If node auto-repair is enabled, GKE will recreate the node.

GPUDirect-TCPX(O) issues

This section provides troubleshooting information for GPUDirect-TCPX(O) issues.

Release note and upgrade instructions

For new users, Maximize GPU network bandwidth in Standard mode clusters provides guidance on using GPUDirect-TCPX(O). For existing users, read the GPUDirect-TCPXO Release Notes for release information and upgrade instructions, because new versions are continuously released.

Debug with NCCL logs

If you can't resolve an issue with NCCL, collect NCCL logs with debugging information. These logs contain valuable information about NCCL operations and can help you find the source of your problem. If you can't resolve the issue, collect these logs before you open a case with Cloud Customer Care. These logs can help Cloud Customer Care resolve your issue more quickly.

To generate and collect the logs, complete the following steps:

Set the following environment variables inside your Pod or application manifest:
```
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=INIT,NET,ENV,COLL,GRAPH
NCCL_DEBUG_FILE=/DIRECTORY/FILE_NAME.%h.%p
```
For more information about these environment variables, read collect NCCL debugging logs.
To generate data for your logs, run an NCCL test. The way to run this test depends on the type of cluster that you use. For GKE clusters, you can deploy and run NCCL test with Topology Aware Scheduling (TAS). After you run the NCCL test, NCCL automatically generates the logs on all participating nodes.
Collect the logs from all nodes. Verify that you've collected NCCL logs from all nodes by verifying that the logs contain the following information:
- The hostnames of all VMs that are involved in a workload.
- The PIDs of all relevant processes on the VM.
- The ranks of all GPUs that are used by the workload on each VM.
If you're not sure where the log files are located, the following example shows you where NCCL creates the log files when the NCCL_DEBUG_FILE variable is set to /tmp/nccl_log.%h.%p. You have two VMs named a3plus-vm-1 and a3plus-vm-2, and each VM runs eight processes within the workload container. In this scenario, NCCL creates the following log files under the /tmp directory within the workload container on each VM:
- On a3plus-vm-1: eight log files named nccl_log.a3plus-vm-1.PID, where PID is the process ID.
- On a3plus-vm-2: eight log files named nccl_log.a3plus-vm-2.PID.
Review the logs. NCCL log entries have the following format:
```
HOSTNAME:PID:TID [RANK] NCCL_MESSAGE
```
These log entries contain the following values:
- HOSTNAME: the hostname of the VM. This value identifies which VM was being used when NCCL generated the log entry.
- PID: the PID. This value identifies which process generated the log entry.
- TID: the thread ID. This value identifies which thread within the process was being used when NCCL generated the log entry.
- RANK: the local rank ID. This value identifies which GPU was being used when NCCL generated the log entry. Ranks are numbered from 0-N, where N is the total number of GPUs that are involved in the process. For example, if your workload runs with eight GPUs per VM, then each VM should have eight different rank values (0-7).
- NCCL_MESSAGE: a descriptive message that provides more information about the event and includes the timestamp of when NCCL created the log.
For example:
```
gke-a3plus-mega-np-2-aa33fe53-7wvq:1581:1634 [1] NCCL INFO 00:09:24.631392: NET/FasTrak plugin initialized.
```
This example has the following values:
- gke-a3plus-mega-np-2-aa33fe53-7wvq: the hostname.
- 1581: the process ID.
- 1634: the thread ID.
- 1: the local rank ID.
- NCCL INFO 00:09:24.631392: NET/FasTrak plugin initialized.: the message explaining what happened.
If you're opening a support case, package the logs that you collected, along with the output of the NCCL test, into a zip file. Include the zip file when you submit a support case to Cloud Customer Care.
To stop collecting NCCL debugging logs, remove the variables that you added in step 1.

What's next

If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care.
- Getting support from the community by asking questions on StackOverflow and using the google-kubernetes-engine tag to search for similar issues. You can also join the #kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using the public issue tracker.