This page includes troubleshooting steps for some common issues and errors.
Known issues
- The following Lustre features are not supported:
- Client-side data compression
- Persistent client caching
- Some Lustre commands are not supported.
- Some exceptions to POSIX compliance exist.
Compute Engine issues
When encountering issues mounting a Managed Lustre file system on a Compute Engine instance, follow these steps to diagnose the problem.
Verify that the Managed Lustre instance is reachable
First, ensure that your Managed Lustre instance is reachable from your Compute Engine instance:
sudo lctl ping IP_ADDRESS@tcp
To obtain the value of IP_ADDRESS, see Get an instance.
A successful ping returns a response similar to the following:
12345-0@lo
12345-10.115.0.3@tcp
A failed ping returns the following:
failed to ping 10.115.0.3@tcp: Input/output error
If your ping fails:
Make sure your Managed Lustre instance and your Compute Engine instance are in the same VPC network. Compare the output of the following commands:
gcloud compute instances describe VM_NAME \ --zone=VM_ZONE \ --format='get(networkInterfaces[0].network)' gcloud lustre instances describe INSTANCE_NAME \ --location=ZONE --format='get(network)'
The output looks like:
https://www.googleapis.com/compute/v1/projects/my-project/global/networks/my-network projects/my-project/global/networks/my-network
The output of the
gcloud compute instances describe
command is prefixed withhttps://www.googleapis.com/compute/v1/
; everything following that string must match the output of thegcloud lustre instances describe
command.Review your VPC network's firewall rules and routing configurations to ensure they allow traffic between your Compute Engine instance and the Managed Lustre instance.
Check the LNet accept port
Managed Lustre instances can be configured to support
GKE clients by specifying the --gke-support-enabled
flag at
the time of creation.
If GKE support has been enabled, you must configure LNet on all
Compute Engine instances to use accept_port
6988. See
Configure LNet for gke-support-enabled
instances.
To determine whether the instance has been configured to support GKE clients, run the following command:
gcloud lustre instances describe INSTANCE_NAME \
--location=LOCATION | grep gkeSupportEnabled
If the command returns gkeSupportEnabled: true
then you must configure LNet.
Ubuntu kernel version mismatch with Lustre client
For Compute Engine instances running Ubuntu, the Ubuntu kernel version must match the specific version of the Lustre client packages. If your Lustre client tools are failing, check whether your Compute Engine instance has auto-upgraded to a newer kernel.
To check your kernel version:
uname -r
The response looks like:
6.8.0-1029-gcp
To check your Lustre client package version:
dpkg -l | grep -i lustre
The response looks like:
ii lustre-client-modules-6.8.0-1029-gcp 2.14.0-ddn198-1 amd64 Lustre Linux kernel module (kernel 6.8.0-1029-gcp)
ii lustre-client-utils 2.14.0-ddn198-1 amd64 Userspace utilities for the Lustre filesystem (client)
If there is a mismatch between the kernel version listed from both commands, you must re-install the Lustre client packages.
Check dmesg for Lustre errors
Many Lustre warnings and errors are logged to the Linux kernel ring buffer. The
dmesg
command prints the kernel ring buffer.
To search for Lustre-specific messages, use grep
in conjunction with dmesg
:
dmesg | grep -i lustre
Or, to look for more general errors that might be related:
dmesg | grep -i error
Information to include with a support request
If you're unable to resolve the mount failure, gather diagnostic information before creating a support case.
Run sosreport: This utility collects system logs and configuration information and generates a compressed tarball:
sudo sosreport
Attach the sosreport
archive and any relevant output from dmesg
to your
support case.
GKE issues
Before following the troubleshooting steps in this section, refer to the limitations when connecting to Managed Lustre from GKE.
Google Kubernetes Engine nodes are not able to connect to a Managed Lustre instance
Verify that the Managed Lustre instance has gke-support-enabled
specified:
gcloud lustre instances describe INSTANCE_ID \
--location=LOCATION | grep gkeSupportEnabled
If the GKE support flag has been enabled and you still cannot connect, continue to the next section.
Log queries
To check logs, run the following query in Logs Explorer.
To return Managed Lustre CSI driver node server logs:
resource.type="k8s_container"
resource.labels.pod_name=~"lustre-csi-node*"
Pod event warnings
If your workload Pods cannot start up, run the following command to check the Pod events:
kubectl describe pod POD_NAME -n NAMESPACE
Then, read the following sections for information about your specific error.
CSI driver enablement issues
The following errors indicate issues with the CSI driver:
MountVolume.MountDevice failed for volume "xxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers
MountVolume.SetUp failed for volume "xxx" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers
These warnings indicate that the CSI driver is either not installed or not yet running. Double-check that the CSI driver is running in your cluster by following the instructions in Install the CSI driver.
If the cluster was recently scaled, updated, or upgraded, this warning is expected and should be transient, as the CSI driver Pods may take a few minutes to become fully functional after cluster operations.
MountVolume failures
AlreadyExists
An AlreadyExists
error may look like the following:
MountVolume.MountDevice failed for volume "xxx" : rpc error: code = AlreadyExists
desc = A mountpoint with the same lustre filesystem name "xxx" already exists on
node "xxx". Please mount different lustre filesystems
Recreate the Managed Lustre instance with a different file system name, or use another Managed Lustre instance with a unique file system name. Mounting multiple volumes from different Managed Lustre instances with the same file system name on a single node is not supported. This is because identical file system names result in the same major and minor device numbers, which conflicts with the shared mount architecture on a per-node basis.
Internal
An Internal
error code usually contains additional information to help
locate the issue.
Is the MGS specification correct? Is the filesystem name correct?
MountVolume.MountDevice failed for volume "preprov-pv-wrongfs" : rpc error: code = Internal desc = Could not mount "10.90.2.4@tcp:/testlfs1" at "/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount" on node gke-lustre-default-nw-6988-pool-1-acbefebf-jl1v: mount failed: exit status 2
Mounting command: mount
Mounting arguments: -t lustre 10.90.2.4@tcp:/testlfs1 /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount
Output: mount.lustre: mount 10.90.2.4@tcp:/testlfs1 at /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
This error means the file system name of the Managed Lustre instance you're trying to mount is incorrect or does not exist. Double-check the file system name of the Managed Lustre instance.
Is the MGS running?
MountVolume.MountDevice failed for volume "preprov-pv-wrongfs" : rpc error: code = Internal desc = Could not mount "10.90.2.5@tcp:/testlfs" at "/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount" on node gke-lustre-default-nw-6988-pool-1-acbefebf-jl1v: mount failed: exit status 5
Mounting command: mount
Mounting arguments: -t lustre 10.90.2.5@tcp:/testlfs /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount
Output: mount.lustre: mount 10.90.2.5@tcp:/testlfs at /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount failed: Input/output error
Is the MGS running?
This error means your Google Kubernetes Engine cluster cannot connect to the Managed Lustre instance using the specified IP address and file system name. Ensure the IP address is correct and that your Google Kubernetes Engine cluster is in the same VPC network as your Managed Lustre instance.
To check that the IP address is correct, run the following command:
sudo lctl ping IP_ADDRESS@tcp0
If the IP address is correct, the result looks like the following:
12345-0@lo
12345-172.26.15.16@tcp
If the IP address is unreachable, the error looks like the following:
failed to ping 172.26.15.16@tcp: Connection timed out
Errors not listed
Warnings not listed in this section that include an RPC error code Internal
indicate unexpected issues in the CSI driver. Create a new
issue on the
GitHub project page, including your GKE cluster version,
detailed workload information, and the Pod event warning message in the issue.
VPC network issues
The following sections describe common VPC network issues.
Cannot connect from the 172.17.0.0/16 subnet range
Compute Engine and GKE clients with an IP address in the 172.17.0.0/16 subnet range cannot mount Managed Lustre instances.
Permission denied to add peering for service servicenetworking.googleapis.com
ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'.
This error means that you don't have servicenetworking.services.addPeering
IAM permission on your user account.
See Access control with IAM for instructions on adding one of the following roles to your account:
roles/compute.networkAdmin
orroles/servicenetworking.networksAdmin
Cannot modify allocated ranges in CreateConnection
ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection."
This error is returned when you have already created a vpc-peering on this network with different IP ranges. There are two possible solutions:
Replace the existing IP ranges:
gcloud services vpc-peerings update \
--network=NETWORK_NAME \
--ranges=IP_RANGE_NAME \
--service=servicenetworking.googleapis.com \
--force
Or, add the new IP range to the existing connection:
Retrieve the list of existing IP ranges for the peering:
EXISTING_RANGES="$( gcloud services vpc-peerings list \ --network=NETWORK_NAME \ --service=servicenetworking.googleapis.com \ --format="value(reservedPeeringRanges.list())" \ --flatten=reservedPeeringRanges )
Then, add the new range to the peering:
gcloud services vpc-peerings update \ --network=NETWORK_NAME \ --ranges="${EXISTING_RANGES}",IP_RANGE_NAME \ --service=servicenetworking.googleapis.com
IP address range exhausted
If instance creation fails with range exhausted error:
ERROR: (gcloud.alpha.Google Cloud Managed Lustre.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted
Follow the VPC guide to modify the existing private connection to add IP address ranges.
We recommend a prefix length of at least /20 (1024 addresses).