This guide shows you how to resolve common issues that arise when transferring data to GKE clusters by using GKE Volume Populator. It guides you through debugging the problems related to the creation of PersistentVolumeClaim (PVC) and PersistentVolume (PV) creation, disk performance, and data transfer Job execution.
Inspect temporary Kubernetes resources
Here's how GKE Volume Populator uses temporary resources:
- A temporary PVC is created in the
gke-managed-volumepopulator
namespace. - For each zone involved in the transfer, a transfer Job, PVs, and PVCs are created in your PVC's namespace.
- After the data transfer is done, GKE Volume Populator automatically removes all these temporary resources.
To inspect the temporary resources, follow these steps:
Store the environment variables:
export PVC_NAME=PVC_NAME export NAMESPACE=NAMESPACE
Replace the following values:
PVC_NAME
: the name of yourPersistentVolumeClaim
resource.NAMESPACE
: the namespace where your workloads run.
Check the status:
export PVC_UID=$(kubectl get pvc ${PVC_NAME} -n ${NAMESPACE} -o jsonpath='{.metadata.uid}') export TEMP_PVC=prime-${PVC_UID} echo ${TEMP_PVC}
Inspect the temporary PVC in the
gke-managed-volumepopulator
namespace:kubectl describe pvc ${TEMP_PVC} -n gke-managed-volumepopulator
Get the names of the temporary PVCs in your namespace:
export TEMP_PVC_LIST=($(kubectl get pvc -n "$NAMESPACE" -o json | grep -Eo "\"name\":\s*\"$TEMP_PVC[^\"]*\"" | awk -F'"' '{print $4}')) for pvc in "${TEMP_PVC_LIST[@]}"; do echo "$pvc" done
Inspect the temporary PVCs:
kubectl describe pvc "${TEMP_PVC_LIST[0]}" -n $NAMESPACE
GKE Volume Populator creates a transfer Job in each zone (one in case of a single zone Hyperdisk ML volume and multiple in case of multi-zone Hyperdisk ML volumes). Get the transfer Job name by using the following command:
export TRANSFER_JOB=$(kubectl get pvc "${TEMP_PVC_LIST[0]}" -n "$NAMESPACE" -o "jsonpath={.metadata.annotations['volume-populator\.datalayer\.gke\.io/pd-transfer-requestid']}") echo $TRANSFER_JOB
Inspect the transfer Job:
kubectl describe job $TRANSFER_JOB -n $NAMESPACE
Get the Pod name from the transfer Job:
export TRANSFER_POD=$(kubectl get pods -n "$NAMESPACE" -l "job-name=$TRANSFER_JOB" -o jsonpath='{.items[0].metadata.name}') echo $TRANSFER_POD
Inspect the Pod:
kubectl describe pod $TRANSFER_POD -n $NAMESPACE
If you create a PVC across multiple zones, the GKE Volume Populator creates distinct temporary PVCs and the transfer Job resources for each specified zone. To inspect the resources for every zone involved in the transfer, replace the
0
of the index forTEMP_PVC_LIST
with other numbers.
Check if Workload Identity Federation is enabled
Workload Identity Federation allows transfer pods to securely access Google Cloud services. If the transfer Pods are unable to authenticate to Google Cloud, verify that Workload Identity Federation for GKE is enabled on your cluster.
To check if the
workloadIdentityConfig
is enabled on your cluster, run the following command:gcloud container clusters describe CLUSTER_NAME --location=LOCATION \ --project=PROJECT_ID \ --format="value(workloadIdentityConfig)"
Replace the following:
CLUSTER_NAME
: the name of your cluster.LOCATION
: the compute region or zone of your cluster.PROJECT_ID
: your Google Cloud project ID.
Look for the following output in the command:
PROJECT_ID.svc.id.goog
If
workloadIdentityConfig
is missing from the output, enable Workload Identity Federation for GKE.
Invalid transfer path
If you encounter an error similar to the following, the transfer path specified on the GCPDatasource
resource is incorrect, and the transfer will fail.
ERROR: (gcloud.storage.cp) The following URLs matched no objects or files:
gs://datasets-pd/llama2-7b-hfa/
To resolve this issue, delete the GCPDatasource
resource, update the uri
field with the correct value, and re-create the resource.
Insufficient permission to access the bucket
If the Kubernetes service account doesn't have access to the bucket URI that's specified in the GCPDatasource
resource, the transfer job will fail. The error might look similar to the following:
ERROR: (gcloud.storage.cp) [test-gke-dev.svc.id.goog] does not have permission to access b instance [small-bucket-7] (or it may not exist): Caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist). This command is authenticated as test-gke-dev.svc.id.goog which is the active account specified by the [core/account] property.
To resolve the issue, grant the necessary permissions to transfer data from the bucket to the disk.
gcloud storage buckets \
add-iam-policy-binding gs://GCS_BUCKET \
--member "principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/NAMESPACE/sa/KSA_NAME" \
--role "ROLE"
Replace the following:
GCS_BUCKET
: your Cloud Storage bucket name.PROJECT_NUMBER
: your Google Cloud project number.PROJECT_ID
: your Google Cloud project ID.NAMESPACE
: the namespace where your workloads run.KSA_NAME
: the name of your Kubernetes service account.ROLE
: the IAM role that provides the necessary permissions to access the bucket. For example, useroles/storage.objectViewer
to grant read-only access to the bucket.
Error: error generating accessibility requirements
You might see the following transient error when you check the PVC in the gke-managed-volumepopulator
namespace:
Error: error generating accessibility requirements: no available topology found.
If you use a GKE Autopilot cluster or a Standard cluster with node auto-provisioning enabled, this error can occur because no nodes might be Ready
in your cluster. The error should resolve itself within a few minutes after node auto-provisioning scales up a new node.
Transfer Pod is Pending
scheduling for a long time
Your PVC event might show that the status of the transfer Pod is Pending
for a long time.
To check the transfer Job events to verify if the scheduling failed for the Job, follow these steps:
Describe the PVC:
kubectl describe pvc $PVC_NAME -n $NAMESPACE
The output is similar to the following:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal TransferInProgress 1s (x2 over 2s) gkevolumepopulator-populator populateCompleteFn: For PVC pd-pvc79 in namespace default, job with request ID populator-job-0b93fec4-5490-4e02-af32-15b16435d559 is still active with pod status as - Phase: Pending
To inspect the transfer Pod, follow the steps in Inspect temporary Kubernetes resources.
The output is similar to the following:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 2m50s default-scheduler 0/3 nodes are available: 1 Insufficient cpu, 2 node(s) had volume node affinity conflict. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling. Warning FailedScheduling 37s (x2 over 39s) default-scheduler 0/3 nodes are available: 1 Insufficient cpu, 2 node(s) had volume node affinity conflict. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling. Normal NotTriggerScaleUp 2m40s cluster-autoscaler pod didn't trigger scale-up:
If you see the
NotTriggerScaleUp
message, check if your cluster has node auto-provisioning enabled:gcloud container clusters describe CLUSTER_NAME \ --location=LOCATION \ --format="value(autoscaling.enableNodeAutoprovisioning)"
Replace the following:
CLUSTER_NAME
: the name of your cluster.LOCATION
: the compute region or zone of your cluster.
If the output shows as 'False', enable node auto-provisioning by using the following command:
gcloud container clusters update CLUSTER_NAME \ --enable-autoprovisioning \ --location=LOCATION \ --project=PROJECT_ID \ --min-cpu MINIMUM_CPU \ --min-memory MINIMUM_MEMORY \ --max-cpu MAXIMUM_CPU \ --max-memory MAXIMUM_MEMORY \ --autoprovisioning-scopes=https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring,https://www.googleapis.com/auth/devstorage.read_only
Replace the following:
CLUSTER_NAME
: the name for the cluster that you're updating to enable node auto-provisioning.LOCATION
: the compute zone or region for your cluster. For example,us-central1-a
orus-central1
.PROJECT_ID
: your Google Cloud project ID.MINIMUM_CPU
: the minimum number of vCPUs to auto-provision. For example,10
.MINIMUM_MEMORY
: the minimum amount of memory in GiB to auto-provision. For example,200
.MAXIMUM_CPU
: the maximum number of vCPUs to auto-provision. For example,100
. This limit is the total of the CPU resources across all existing manually created node pools and all the node pools that GKE might automatically create.MAXIMUM_MEMORY
: the maximum amount of memory to auto-provision. For example,1000
. This limit is the total of the memory resources across all existing, manually created node pools and all the node pools that GKE might automatically create.
If node auto-provisioning is enabled, verify that node auto-provisioning has sufficient autoscaling
resourceLimits
to scale up the transfer Job. The transfer Job uses 24 vCPUs by default.gcloud container clusters describe CLUSTER_NAME \ --location=LOCATION \ --format="value(autoscaling.resourceLimits)"
Replace the following:
CLUSTER_NAME
: the name of your cluster.LOCATION
: the compute region or zone of your cluster.
The output is similar to the following:
{'maximum': '1000000000', 'resourceType': 'cpu'};{'maximum': '1000000000', 'resourceType': 'memory'};
If node auto-provisioning does not have sufficient autoscaling limits, update the cluster with the correct configuration.
gcloud container clusters update CLUSTER_NAME \ --location=LOCATION \ --project=PROJECT_ID \ --max-cpu MAXIMUM_CPU \ --max-memory MAXIMUM_MEMORY
Replace the following:
CLUSTER_NAME
: the name for the cluster that you're updating to enable node auto-provisioning.LOCATION
: the compute zone or region for your cluster. For example,us-central1-a
orus-central1
.PROJECT_ID
: your Google Cloud project ID.MAXIMUM_CPU
: the maximum number of vCPUs to auto-provision. For example,100
. This limit is the total of the CPU resources across all existing manually created node pools and all the node pools that GKE might automatically create.MAXIMUM_MEMORY
: the maximum amount of memory to auto-provision. For example,1000
. This limit is the total of the memory resources across all existing, manually created node pools and all the node pools that GKE might automatically create.
For Standard clusters without node auto-provisioning enabled, verify that the node you created for the transfer Job has the required compute class label:
kubectl get node -l cloud.google.com/compute-class=gcs-to-hdml-compute-class
If the output doesn't list the node that you created for the transfer Job, add the
gcs-to-hdml-compute-class
compute class label to the node:kubectl label node NODE_NAME cloud.google.com/compute-class=gcs-to-hdml-compute-class
Replace the
NODE_NAME
with the name of the node where you want to add the compute class label.
GCE quota exceeded
error
You might encounter an error message similar to the following when you check the Pod for the transfer Job:
Node scale up in zones us-west1-b associated with this pod failed: GCE quota exceeded. Pod is at risk of not being scheduled.
To inspect the transfer Pod, follow the steps in Inspect temporary Kubernetes resources.
To resolve the error, increase the quota or delete existing resources that might be preventing scale-up. For more information, see Troubleshoot quota errors.
Hyperdisk ML HDML_TOTAL_THROUGHPUT
exceeded error
If the temporary PVC in the gke-managed-volumepopulator
namespace fails to provision the Hyperdisk ML volume, it's possible that the regional quota for creating the new Hyperdisk ML volume for your data transfer is exceeded.
To confirm that the provisioning of the Hyperdisk ML volume failed because of a regional quota issue, inspect the event logs associated with the temporary PVC that was created by the GKE Volume Populator. Follow these steps:
Store the relevant environment variables:
export PVC_NAME=PVC_NAME export NAMESPACE=NAMESPACE
Replace the following values:
PVC_NAME
: the name of yourPersistentVolumeClaim
resource.NAMESPACE
: the namespace where your workloads run.
Check the status of the temporary PVC:
export PVC_UID=$(kubectl get pvc ${PVC_NAME} -n ${NAMESPACE} -o jsonpath='{.metadata.uid}') export TEMP_PVC=prime-$PVC_UID echo $TEMP_PVC kubectl describe pvc $TEMP_PVC -n gke-managed-volumepopulator
Check the PVC events to find the
QUOTA_EXCEEDED error
, which is similar to the following:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProvisioningFailed 105s pd.csi.storage.gke.io_gke-3ef909a7688d424b94a2-d0d9-b185-vm_6a77d057-54e3-415a-8b39-82b666516b6b failed to provision volume with StorageClass "pd-sc": rpc error: code = Unavailable desc = CreateVolume failed: rpc error: code = Unavailable desc = CreateVolume failed to create single zonal disk pvc-73c69fa8-d23f-4dcb-a244-bcd120a3c221: failed to insert zonal disk: unknown error when polling the operation: rpc error: code = ResourceExhausted desc = operation operation-1739194889804-62dc9dd9a1cae-9d24a5ad-938e5299 failed (QUOTA_EXCEEDED): Quota 'HDML_TOTAL_THROUGHPUT' exceeded. Limit: 30720.0 in region us-central1
To resolve this issue:
- Request additional quota to create new Hyperdisk ML volumes in your project.
- Delete any unused Hyperdisk ML disks in your project.
No space left on device
If you see the No space left on device
error message on your PVC, it means that the Hyperdisk ML volume is full and no more data can be written to it. The error might look similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning TransferContainerFailed 57m gkevolumepopulator-populator populateCompleteFn: For PVC vp-pvc in namespace default, job with request ID populator-job-c2a2a377-6168-4ff1-afc8-c4ca713c43e2 for zone us-central1-c has a failed pod container with message: on device
ERROR: Failed to download one or more components of sliced download.
ERROR: [Errno 28] No space left on device
To resolve this issue, delete your PVC, increase the value for spec.resources.requests.storage
field in your PVC manifest, and re-create the PVC to start the transfer process again.
What's next
- If you can't find a solution to your problem in the documentation, see Get support.