This page includes troubleshooting steps for some common issues and errors.
FAILED instance
The FAILED
status means that the instance data has been lost and the instance
must be deleted.
Parallelstore instances in a FAILED
state continue to be billed until they're
deleted.
To retrieve an instance's state, follow the instructions at Manage instances: Retrieve an instance.
To delete an instance, read Manage instances: Delete an instance.
Timeouts during dfuse mount or network tests
If, when mounting your Parallelstore instance, the dfuse -m
command times out;
or if network test commands such as self_test
or daos health net-test
time
out, this may be due to a network connectivity issue.
To verify connectivity to the Parallelstore servers, run
self_test --use-daos-agent-env -r 1
If the test reports a connection issue, two possible reasons are:
The DAOS agent may have selected the wrong network interface during setup
You may need to exclude network interfaces that are not able to reach the IPs
in the access_points
list.
Run
ifconfig
to list the available network interfaces. An example output may show several network interfaces such aseth0
,docker0
,ens8
,lo
, etc.Stop the daos_agent.
Edit
/etc/daos/daos_agent.yml
to exclude the unwanted network interfaces. Uncomment theexclude_fabric_ifaces
line and update the values. The entries you include are specific to your situation. For example:exclude_fabric_ifaces: ["docker0", "ens8", "lo"]
Restart the daos_agent.
The instance or client IP address conflicts with internal IP addresses
Parallelstore instances and clients cannot use an IP address from the 172.17.0.0/16 subnet range. See Known issues for more information.
ENOSPC
when there is unused capacity in the instance
If your instance uses minimum or (the default of) balanced striping, you might run into
ENOSPC
errors even if the existing files are not using all of the capacity of the
instance. This is likely to happen when writing large files that are generally greater
than 8 GiB, or when importing such files from Cloud Storage.
Use maximum file striping to reduce the likelihood of these errors.
Google Kubernetes Engine troubleshooting
The following section lists some common issues and steps to resolve them.
Transport endpoint is not connected
in workload Pods
This error is due to dfuse termination. In most cases, dfuse was terminated
because of out-of-memory. Use the Pod annotations
gke-parallelstore/[cpu-limit|memory-limit]
to allocate more resources to
the Parallelstore sidecar container. You can set
gke-parallelstore/memory-limit: "0"
to remove the sidecar memory limitation
if you don't know how much memory you want to allocate to it. Note that this
only works with Standard clusters; with Autopilot clusters, you cannot
use value 0
to unset the sidecar container resource limits and requests. You
have to explicitly set a larger resource limit for the sidecar container.
Once you've modified the annotations, you must restart your workload Pod. Adding annotations to a running workload doesn't dynamically modify the resource allocation.
Pod event warnings
If your workload Pods cannot start up, check the Pod events:
kubectl describe pod POD_NAME -n NAMESPACE
The following solutions are for common errors.
CSI driver enablement issues
Common CSI driver enablement errors are as follows:
MountVolume.MountDevice failed for volume "volume" : kubernetes.io/csi:
attacher.MountDevice failed to create newCsiDriverClient:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers
MountVolume.SetUp failed for volume "volume" : kubernetes.io/csi:
mounter.SetUpAt failed to get CSI client:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers
These warnings indicate that the CSI driver is not enabled, or not running.
If your cluster was just scaled, updated, or upgraded, this warning is normal and should be transient. It takes a few minutes for the CSI driver Pods to be functional after cluster operations.
Otherwise, confirm that the CSI driver is enabled on your cluster. See
Enable the CSI driver for details. If the CSI is enabled,
each node shows a Pod named parallelstore-csi-node-id
up and
running.
AttachVolume.Attach failures
After the Pod is scheduled to a node, the volume will be attached to the node and the mounter Pod will be created if using node mount.
This happens on the controller and involves the AttachVolume step in from attachdetach-controller.
Error code | Pod event warning | Solution |
InvalidArgument |
|
Invalid mount flags are passed to PersistentVolume or StorageClass. Check supported dfuse mount options for more details. |
NotFound |
|
The Parallelstore instance does not exist. Verify that the PersistentVolume's volumeHandle has the correct format. |
MountVolume.MountDevice failures
After the volume is attached to a node, the volume will be staged to the node.
This happens on the node and involves the MountVolume.MountDevice step in from kubelet.
Error code | Pod event warning | Solution |
FailedPrecondition |
|
This error is usually caused by the mounter pod being manually deleted. Delete all of the workloads consuming the PVC and re-deploy them. This will create a new mounter Pod. |
DeadlineExceeded |
|
There's trouble connecting to the Parallelstore instance. Verify that your VPC network and your access points are configured correctly. |
MountVolume.SetUp failures
After the volume is staged to the node, the volume will be mounted and provided to the container on the Pod. This happens on the node and involves the MountVolume.SetUp step in kubelet.
Pod mount
Error code | Pod event warning | Solution |
ResourceExhausted |
|
The dfuse process ended, which is usually caused by an
out-of-memory (OOM) condition. Consider
increasing the sidecar container memory limit by using the
gke-parallelstore/memory-limit annotation.
If you're unsure about the amount of memory you want to allocate to
the parallelstore-sidecar, we recommend setting
|
Aborted |
|
The volume mount operation was aborted due to rate limiting or existing operations. This warning is normal and should be transient. |
InvalidArgument |
MountVolume.SetUp failed for volume "volume" : rpc
error: code = InvalidArgument desc =
|
If you supplied invalid arguments in the StorageClass or PersistentVolume, the error log indicates the fields with the invalid arguments. For dynamic provisioning, check the Storage Class. For static provisioning, check the Persistent Volume. |
FailedPrecondition |
MountVolume.SetUp failed for volume "volume" : rpc
error: code = FailedPrecondition desc = can not find the sidecar
container in Pod spec
|
The Parallelstore sidecar container was not injected. Check that the
gke-parallelstore/volumes: "true" Pod annotation is set correctly.
|
Node mount
Error code | Pod event warning | Solution |
Aborted |
|
The volume mount operation was aborted due to rate limit or existing operations. This warning is normal and should be transient. |
InvalidArgument |
MountVolume.SetUp failed for volume "volume" : rpc
error: code = InvalidArgument desc =
|
If you have supplied invalid arguments in the StorageClass or Persistent Volume, the error log will indicate the fields with the invalid arguments. For dynamic provisioning, check the Storage Class. For static provisioning, check the Persistent Volume. |
FailedPrecondition |
MountVolume.SetUp failed for volume "volume" : rpc
error: code = FailedPrecondition desc = mounter pod expected to exist but was not found
|
The Parallelstore mounter pod does not exist. If the mounter Pod was accidentally deleted, re-create all workloads to prompt re-creation. |
DeadlineExceeded |
MountVolume.SetUp failed for volume "volume" : rpc
error: code = DeadlineExceeded desc = timeout waiting for mounter pod gRPC server to become available
|
The mounter Pod's gRPC server did not start. Check the mounter pod's logs for any errors. |
Troubleshooting VPC networks
Permission denied to add peering for service servicenetworking.googleapis.com
ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'.
This error means that you don't have servicenetworking.services.addPeering
IAM permission on your user account.
See Access control with IAM for instructions on adding one of the following roles to your account:
roles/compute.networkAdmin
orroles/servicenetworking.networksAdmin
Cannot modify allocated ranges in CreateConnection
ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection.
This error is returned when you have already created a vpc-peering on this network with different IP ranges. There are two possible solutions:
Replace the existing IP ranges:
gcloud services vpc-peerings update \
--network=NETWORK_NAME \
--ranges=IP_RANGE_NAME \
--service=servicenetworking.googleapis.com \
--force
Or, add the new IP range to the existing connection:
Retrieve the list of existing IP ranges for the peering:
EXISTING_RANGES=$( gcloud services vpc-peerings list \ --network=NETWORK_NAME \ --service=servicenetworking.googleapis.com \ --format="value(reservedPeeringRanges.list())" )
Then, add the new range to the peering:
gcloud services vpc-peerings update \ --network=NETWORK_NAME \ --ranges=$EXISTING_RANGES,IP_RANGE_NAME \ --service=servicenetworking.googleapis.com
IP address range exhausted
Problem: Instance creation fails with range exhausted error:
ERROR: (gcloud.alpha.Parallelstore.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted
Solution: Follow the VPC guide to either recreate the IP range or extend the existing IP range.
If you're recreating a Parallelstore instance, you must recreate the IP range instead of extending it.