Troubleshooting NVMe disks


This document lists errors that you might encounter when using disks with the nonvolatile memory express (NVMe) interface.

You can use the NVMe interface for Local SSDs and persistent disks (Persistent Disk or Google Cloud Hyperdisk). Only the most recent machine series, such as Tau T2A, M3, C3, C3D, and H3 use the NVMe interface for Persistent Disk. Confidential VMs also use NVMe for Persistent Disk. All other Compute Engine machine series use the SCSI disk interface for persistent disks.

I/O operation timeout error

If you are encountering I/O timeout errors, latency could be exceeding the default timeout parameter for I/O operations submitted to NVMe devices.

Error message:

[1369407.045521] nvme nvme0: I/O 252 QID 2 timeout, aborting
[1369407.050941] nvme nvme0: I/O 253 QID 2 timeout, aborting
[1369407.056354] nvme nvme0: I/O 254 QID 2 timeout, aborting
[1369407.061766] nvme nvme0: I/O 255 QID 2 timeout, aborting
[1369407.067168] nvme nvme0: I/O 256 QID 2 timeout, aborting
[1369407.072583] nvme nvme0: I/O 257 QID 2 timeout, aborting
[1369407.077987] nvme nvme0: I/O 258 QID 2 timeout, aborting
[1369407.083395] nvme nvme0: I/O 259 QID 2 timeout, aborting
[1369407.088802] nvme nvme0: I/O 260 QID 2 timeout, aborting
...

Resolution:

To resolve this issue, increase the value of the timeout parameter.

  1. View the current value of the timeout parameter.

    1. Determine which NVMe controller is used by the persistent disk or Local SSD volume.
      ls -l /dev/disk/by-id
      
    2. Display the io_timeout setting, specified in seconds, for the disk.

      cat /sys/class/nvme/CONTROLLER_ID/NAMESPACE/queue/io_timeout
      
      Replace the following:

      • CONTROLLER_ID: the ID of the NVMe disk controller, for example, nvme1
      • NAMESPACE: the namespace of the NVMe disk, for example, nvme1n1

      If you only have a single disk that uses NVMe, then use the command:

      cat /sys/class/nvme/nvme0/nvme0n1/queue/io_timeout
      

  2. To increase the timeout parameter for I/O operations submitted to NVMe devices, add the following line to the /lib/udev/rules.d/65-gce-disk-naming.rules file, and then restart the VM:

    KERNEL=="nvme*n*", ENV{DEVTYPE}=="disk", ATTRS{model}=="nvme_card-pd", ATTR{queue/io_timeout}="4294967295"
    

Detached disks still appear in the operating system of a compute instance

On VMs that use Linux kernel version 6.0 to 6.2, operations involving the Compute Engine API method instances.detachDisk or the gcloud compute instances detach-disk command might not work as expected. The Google Cloud console shows the device as removed, the compute instance metadata (compute disks describe command) shows the device as removed, but the device mount point and any symlinks created by udev rules are still visible in the guest operating system.

Error message:

Attempting to read from the detached disk on the VM results in I/O errors:

sudo head /dev/nvme0n3

head: error reading '/dev/nvme0n3': Input/output error

Issue:

Operating system images that use a Linux 6.0-6.2 kernel but don't include a backport of a NVMe fix fail to recognize when an NVMe disk is detached.

Resolution:

Reboot the VM to complete the process of removing the disk.

To avoid this issue, use an operating system with a Linux kernel version that doesn't have this problem:

  • 5.19 or older
  • 6.3 or newer

You can use the uname -r command in the guest OS to view the Linux kernel version.

What's next?