Troubleshooting

This page includes troubleshooting steps for some common issues and errors.

Known issues

  • The following Lustre features are not supported:
    • Client-side data compression
    • Persistent client caching
    • Unsigned or open-source Lustre clients
  • Some Lustre commands are not supported.
  • Using Managed Lustre from Google Kubernetes Engine has the following limitations:
    • The GKE node pool version must be 1.31.5-gke.1299000 or 1.32.1-gke.1673000, or a newer patch version.
    • Only COS nodes are supported.
    • Only Standard clusters are supported. AutoPilot clusters are not supported.
  • Some exceptions to POSIX compliance exist.

Google Kubernetes Engine nodes are not able to connect to a Managed Lustre instance

Verify that the Managed Lustre instance has gke-support-enabled specified:

gcloud lustre instances describe INSTANCE_ID \
  --location=LOCATION | grep gkeSupportEnabled

If the GKE support flag has been enabled and you still cannot connect, continue to the next section.

Log queries

To check logs, run the following queries in Logs Explorer.

Managed Lustre CSI driver controller server logs:

resource.type="k8s_container"
resource.labels.pod_name=~"lustre-csi-controller*"

Managed Lustre CSI driver node server logs:

resource.type="k8s_container"
resource.labels.pod_name=~"lustre-csi-node*"

Pod event warnings

If your workload Pods cannot start up, run the following command to check the Pod events:

kubectl describe pod POD_NAME -n NAMESPACE

Then, read the following sections for information about your specific error.

CSI driver enablement issues

The following errors indicate issues with the CSI driver:

MountVolume.MountDevice failed for volume "xxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers
MountVolume.SetUp failed for volume "xxx" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers

These warnings indicate that the CSI driver is either not installed or not yet running. Double-check that the CSI driver is running in your cluster by following the instructions in Install the CSI driver.

If the cluster was recently scaled, updated, or upgraded, this warning is expected and should be transient, as the CSI driver Pods may take a few minutes to become fully functional after cluster operations.

MountVolume.SetUp failures

RPC error code values can be used to triage MountVolume.SetUp issues. For example, Unauthenticated and PermissionDenied errors usually mean the authentication was not configured correctly. An RPC error code of Internal means that unexpected issues occurred in the CSI driver; create a new issue on the GitHub project page.

Unauthenticated and PermissionDenied

RPC errors with code Unauthenticated or PermissionDenied usually indicate that your authentication was not configured correctly.

AlreadyExists

An AlreadyExists error may look like the following:

MountVolume.SetUp failed for volume "xxx" : rpc error: code = AlreadyExists desc = A mountpoint with the same lustre filesystem name "xxx" already exists on node "xxx". Please mount different lustre filesystems

Recreate the Managed Lustre instance with a different file system name, or use another Managed Lustre instance with a unique file system name. Mounting multiple volumes from different Managed Lustre instances with the same file system name on a single node is not supported. This is because identical file system names result in the same major and minor device numbers, which conflicts with the shared mount architecture on a per-node basis.

Internal

An Internal error code usually contains additional information to help locate the issue.

Is the MGS specification correct? Is the filesystem name correct?

MountVolume.MountDevice failed for volume "preprov-pv-wrongfs" : rpc error: code = Internal desc = Could not mount "10.90.2.4@tcp:/testlfs1" at "/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount" on node gke-lustre-default-nw-6988-pool-1-acbefebf-jl1v: mount failed: exit status 2
Mounting command: mount
Mounting arguments: -t lustre 10.90.2.4@tcp:/testlfs1 /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount
Output: mount.lustre: mount 10.90.2.4@tcp:/testlfs1 at /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

This error means the file system name of the Managed Lustre instance you're trying to mount is incorrect or does not exist. Double-check the file system name of the Managed Lustre instance.

Is the MGS running?

MountVolume.MountDevice failed for volume "preprov-pv-wrongfs" : rpc error: code = Internal desc = Could not mount "10.90.2.5@tcp:/testlfs" at "/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount" on node gke-lustre-default-nw-6988-pool-1-acbefebf-jl1v: mount failed: exit status 5
Mounting command: mount
Mounting arguments: -t lustre 10.90.2.5@tcp:/testlfs /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount
Output: mount.lustre: mount 10.90.2.5@tcp:/testlfs at /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount failed: Input/output error
Is the MGS running?

This error means your Google Kubernetes Engine cluster cannot connect to the Managed Lustre instance using the specified IP address and file system name. Ensure the IP address is correct and that your Google Kubernetes Engine cluster is in the same VPC network as your Managed Lustre instance.

To check that the IP address is correct, run the following command:

sudo lctl ping IP_ADDRESS@tcp0

If the IP address is correct, the result looks like the following:

12345-0@lo
12345-172.26.15.16@tcp

If the IP address is unreachable, the error looks like the following:

failed to ping 172.26.15.16@tcp: Connection timed out

Errors not listed

Warnings not listed in this section that include an RPC error code Internal indicate unexpected issues in the CSI driver. Create a new issue on the GitHub project page, including your GKE cluster version, detailed workload information, and the Pod event warning message in the issue.

Troubleshooting VPC networks

The following sections describe common VPC network issues.

Permission denied to add peering for service servicenetworking.googleapis.com

ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'.

This error means that you don't have servicenetworking.services.addPeering IAM permission on your user account.

See Access control with IAM for instructions on adding one of the following roles to your account:

  • roles/compute.networkAdmin or
  • roles/servicenetworking.networksAdmin

Cannot modify allocated ranges in CreateConnection

ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection."

This error is returned when you have already created a vpc-peering on this network with different IP ranges. There are two possible solutions:

Replace the existing IP ranges:

gcloud services vpc-peerings update \
  --network=NETWORK_NAME \
  --ranges=IP_RANGE_NAME \
  --service=servicenetworking.googleapis.com \
  --force

Or, add the new IP range to the existing connection:

  1. Retrieve the list of existing IP ranges for the peering:

    EXISTING_RANGES="$(
      gcloud services vpc-peerings list \
        --network=NETWORK_NAME \
        --service=servicenetworking.googleapis.com \
        --format="value(reservedPeeringRanges.list())" \
        --flatten=reservedPeeringRanges
    )
    
  2. Then, add the new range to the peering:

    gcloud services vpc-peerings update \
      --network=NETWORK_NAME \
      --ranges="${EXISTING_RANGES}",IP_RANGE_NAME \
      --service=servicenetworking.googleapis.com
    

IP address range exhausted

If instance creation fails with range exhausted error:

ERROR: (gcloud.alpha.Google Cloud Managed Lustre.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted

Follow the VPC guide to modify the existing private connection to add IP address ranges.

We recommend a prefix length of at least /20 (1024 addresses).