This document describes the errors and corresponding codes you might encounter when using Backup for GKE to perform restore operations. Included in each section are things to consider when performing actions to resolve the restore errors, and instructions on how to resolve the restore operation errors.
Error 200010301: Failure to complete restore operation due to unavailable admission webhook service
Error 200010301 occurs when an attempt to complete a restore operation fails
because an admission webhook service, also referred to as an HTTP callback, is
unavailable, which results in the following error message. The error message
indicates that the GKE API server attempted to contact an
admission webhook while trying to restore a resource but the service backing
the webhook was either unavailable or not found:
resource [/example-group/ClusterSecretStore/example-store] restore failed:
Internal error occurred: failed calling webhook "example-webhook.io":
failed to call webhook: Post "https://example-webhook.example-namespace.svc:443/validate-example": service "example-webhook" not found.
This error occurs when a ValidatingAdmissionWebhook or
MutatingAdmissionWebhook GKE resource is active in the target
cluster, but the GKE API server can't reach the endpoint
configured in the webhook. Admission webhooks intercept requests to the
GKE API server, and their configuration specifies how the
GKE API server should query the requests.
The webhook's clientConfig specifies the backend that handles the admission
requests, which can be an internal cluster service or an external URL. The
choice between these two options depends on the specific operational and
architectural requirements of your webhook. Depending on the option type,
the restore operation might have failed for the following reasons:
In-cluster services: the GKE service and its backing pods aren't restored or ready when the GKE API server attempted to call the webhook. This occurs during restore operations where cluster-scoped webhook configurations are applied before the namespaced services are fully in a
readystate.External URLs: the external endpoint is temporarily unavailable due to network connectivity issues between the GKE cluster and the external endpoint, or due to DNS resolution issues or firewall rules.
To resolve this error, use the following instructions:
Identify the failing webhook mentioned in the error message. For example,
failed calling webhook "...".Inspect the webhook by running the
kubectl get validatingwebhookconfigurationscommand:kubectl get validatingwebhookconfigurations WEBHOOK_NAME -o yamlReplace
WEBHOOK_NAMEwith the name of the webhook that was identified in the error message.You can also run the
kubectl get mutatingwebhookconfigurationscommand to inspect the webhook:kubectl get mutatingwebhookconfigurations WEBHOOK_NAME -o yamlReplace
WEBHOOK_NAMEwith the name of the webhook that was identified in the error message.Perform the following troubleshooting steps based on your configuration type:
Service-based
clientConfigDefine a custom restore order by modifying the
RestorePlanresource to include aRestoreOrderwithGroupKindDependencyentries. This allows the components backing the webhook such asDeployment,StatefulSet, orServiceto be restored and ready before theValidatingWebhookConfigurationorMutatingWebhookConfiguration.For instructions on how to define a custom restore order, see Specify resource restore ordering during restoration.
This approach can fail because the service's pods don't enter a fully
readystate even after theServiceobject is created. Another reason for failure could be because the webhook configuration might be created unexpectedly by another application. Alternatively, you can perform a two-stage restore operation using the following steps:Create a
Restoreresource using the backup by configuring the restore operation with a fine-grained restore filter which would include the specific resources that are required for the webhook to function, for example,Namespaces,Deployments,StatefulSets, orServices.For more information on how to configure the restore with a fine-grained restore filter, see Enable fine-grained restore.
Create another
Restoreresource for the backup operation and configure the rest of the resources you choose.
URL-based
clientConfigVerify the external HTTPS endpoint and make sure it's active, reachable, and functioning correctly.
Confirm that there is network connectivity from your GKE cluster's nodes and control plane to the external URL. You might also need to check firewall rules, for example, if you're using Virtual Private Cloud, on-premises, or a cloud provider hosting the webhook, network policies, and DNS resolution.
Retry the restore operation. If the operation continues to fail, contact Cloud Customer Care for further assistance.
Error 200010302: Failure to complete restore operation due to denied resource creation request
Error 200010302 occurs when an attempt to complete a restore operation fails
because an admission webhook denies a resource creation request, which results
in the following error message indicating that a resource from your backup
couldn't be created in the target cluster because an active admission webhook
intercepted the request and rejected it based on a custom policy:
[KubeError]; e.g. resource
[/example-namespace/example-api/ExampleResource/example-name]
restore failed: admission webhook "example-webhook.example.com" denied the request: {reason for denial}
This error is caused by the configuration set in the target GKE
cluster, which has either a ValidatingAdmissionWebhook or
MutatingAdmissionWebhook that enforces specific rules on resource creation
and modification, blocking the resource creation request.
For example, a webhook prevents the creation of a resource because a related
but conflicting resource already exists in the cluster. For example, a webhook
might deny the creation of a deployment if it's already managed by a
HorizontalPodAutoscaler GKE API resource.
To resolve this error, use the following instructions:
Identify the webhook that is denying the request using the error message that occurs when the restore operation fails. For example,
webhook WEBHOOK_NAME denied the requestThe error message contains the following information:Webhook name: the name of the webhook denying the request.
Reason for denial: the specific reason for denying the request.
Inspect the webhook by running the
kubectl get validatingwebhookconfigurationscommand:kubectl get validatingwebhookconfigurations WEBHOOK_NAME -o yamlReplace
WEBHOOK_NAMEwith the name of the webhook you identified in the error message.You can also run the
kubectl get mutatingwebhookconfigurationscommand to inspect the webhook:kubectl get mutatingwebhookconfigurations WEBHOOK_NAME -o yamlReplace
WEBHOOK_NAMEwith the name of the webhook you identified from the error message.Resolve the underlying issue in the target cluster. The correct action depends on the specific error. For the example, if there is an
HorizontalPodAutoscalerconflict, you need to delete the existingHorizontalPodAutoscalerin the target cluster before running the restore to allow the backed-up workloads and its associated resources to be created.Retry the restore operation. If the restore operation continues to fail, contact Cloud Customer Care for further assistance.
Error 200060202: Failure to complete restore operation due missing GKE resource during workload validation
Error 200060202 occurs during the workload validation phase of a restore
operation when a GKE resource that Backup for GKE expects to
validate cannot be found in the target cluster, resulting in the following
error message:
Workload Validation Error: [KIND] "[NAME]" not found
For example, Example: Workload Validation Error: pods "jenkins-0" not found
This error occurs when Backup for GKE successfully creates or updates the GKE resource as part of the restore operation's process but when the validation stage begins, one or more of the GKE resources is no longer present in the target cluster because the resource was deleted after the resource was created or updated initially by the restore process but before workload validation for the GKE resource could complete. An error like this can occur for the following reasons:
Manual deletion: a user or administrator manually deleted the resource using
kubectlor other Google Cloud tools.External automation: GitOps controllers such as Config Sync, ArgoCD, Flux, custom scripts, or other cluster management tools reverted or deleted the resource to match a desired state in a repository.
GKE controllers: a GKE controller deleted a resource because it conflicts with other resources or policies, or an
OwnerReferencechain leads to garbage collection, or the automated cleanup process by GKE that deletes dependent resources when theirownerresource is deleted.
To resolve this error, use the following instructions:
Identify the missing resource using the error message that appears when the restore operation fails to complete.
Locate the namespace the resource belongs to using one of the following methods:
GKE audit logs: examine the GKE audit logs that were generated when you attempted the restore operation. You can filter logs for delete operations on the resource
KindandName. The audit log entry contains the original namespace.Backup details: review the scope of your restore operation and the contents of the backup. The backup index shows the original namespace of the resource. You can also verify if the
RestorePlancontains aTransformationRulewhich specify rules to restore the resource in the namespace you choose.Search across namespaces: run the
kubectl getcommand to search for the resource across all namespaces:kubectl get KIND --all-namespaces | grep NAMEReplace
KINDandNAMEwith the values from the error message. If the resource still exists, this command will show its namespace.
Verify deletion by running the
kubectl getcommand:kubectl get KIND NAME -n [NAMESPACE]Replace
KINDandNAMEwith the values from the error message. You should receive anot founderror message.Investigate the cause of deletion using one of the following methods:
GKE audit logs: identify which entity issued the deletion request. For example, the user, service account, or controller.
Review configured automations: If you use GitOps or other automation tools, check their logs and status to see if they interfered with the restored resources.
Examine related events: check GKE events in the determined namespace by running the
kubectl get eventscommand:kubectl get events -n NAMESPACE_NAME --sort-by='.lastTimestamp'Replace
NAMESPACE_NAMEwith the name of the namespace.
Address the cause of the resource deletion based on the results of the previous step. For example, pause conflicting automations, correct misconfigurations, or adjust user permissions.
Recover the missing resource using one of the following methods:
Re-apply manifests files: if you have the manifest for the missing resource, you can re-apply it to the correct namespace.
Perform a fine-grained restore: perform a fine-grained restore operation to selectively restore just the missing resource from the same backup, which ensures you specify the correct namespace. For more information about how to perform a fine-grained restore operation, see Enable fine-grained restore.
Retry the restore operation. If the restore operation continues to fail, contact Cloud Customer Care for further assistance.
Error 200060201: Failure to complete restore operation due to workload validation timeout
Error 200060201 occurs when one or more restored workloads fail to become
fully ready during a restore operation within the expected time limit after
the resources have been created in the cluster, resulting in the following error
message:
Workload Validation Error: Timedout waiting for workloads to be ready - [namespace/workload_name, ...]
This error occurs because Backup for GKE performs a validation step after
restoring GKE resource configurations to ensure that critical
workloads are functioning correctly. Backup for GKE waits for certain
workloads to reach a ready state, but at least one workload didn't meet the
the following readiness criterion within the allocated timeout period:
For
Pods:status.PhaseisRunningFor
Deployments:status.ReadyReplicasequalsspec.ReplicasFor
StatefulSets:status.ReadyReplicasequalsspec.ReplicasFor
DaemonSets:status.NumberReadyequalsstatus.DesiredNumberScheduled
To resolve this error, use the following instructions:
Identify the workloads that aren't in a
readystate in the error message that lists the workloads and their namespaces that failed to enter areadystate.Inspect workload status and get details and events for the failed workloads by running the
kubectl describecommand:kubectl describe WORKLOAD_TYPE WORKLOAD_NAME -n NAMESPACE_NAME kubectl get pods -n NAMESPACE_NAME -l SELECTOR_FOR_WORKLOADReplace the following:
WORKLOAD_TYPE: the type of workload, for example,Deployment,StatefulSet, orDaemonSet.WORKLOAD_NAME: the name of the specific workload instance.NAMESPACE_NAME: the namespace where the workload is located.SELECTOR_FOR_WORKLOAD: the label selector to findPodsassociated with the workload. For example,app=my-app.
For pods within
DeploymentsorStatefulSetsworkload types, check the status of individual Pods by running thekubectl describe podcommand:kubectl describe pod POD_NAME -n NAMESPACE_NAMEReplace the following:
POD_NAME: the name of the specific pod.NAMESPACE_NAME: the namespace where the pod is located.
In the
Eventssection, analyze events and logs in thedescribeoutput and locate the following information:ImagePullBackOff / ErrImagePull: indicates that there are issues fetching container images.CrashLoopBackOff: indicates that containers are starting and crashing.
In the
Containerssection, analyze container logs in thedescribeoutput to find the container name by running thekubectl logscommand:kubectl logs POD_NAME -n NAMESPACE_NAME -c CONTAINER_NAMEReplace the following:
POD_NAME: the name of the specific pod.NAMESPACE_NAME: the namespace where the pod is located.CONTAINER_NAME: the name of the container within the Pod.
According to the
describeoutput, there are several reasons the pod might not appear in the resource output, including the following:Readiness probe failures: the container's readiness probes aren't succeeding.
Resource issues: there is insufficient CPU, memory, or other resources in the cluster or quota limits being reached.
Init container issues: failures in init containers blocking main containers from starting.
Config errors: errors in
ConfigMaps,Secrets, or environment variables.Network issues:
Podsare unable to communicate with required services.
Check the GKE cluster resources to ensure the GKE cluster has sufficient node capacity, CPU, and memory to run the restored workloads. In Autopilot clusters, node auto-provisioning might take additional time, therefore, we recommend checking for any node scaling limitations or errors. Address underlying issues based on your findings, and resolve the issues preventing the workloads from entering a
readystate. This appraoch can involve correcting manifests, adjusting resource requests or limits, fixing network policies, or ensuring dependencies are met.After underlying issues are resolved, wait for the workloads to enter a
readystate. You don't need to run the restore operation again.
If the issue persists, contact Cloud Customer Care for further assistance.
Error 200060102: Failure to complete restore operation due to volume validation error
Error 200060102 occurs because one or more VolumeRestore resources, which
manage the process of restoring data from VolumeBackup to a
PersistentVolume, have entered a failed or deleting state during the
volume validation phase of a restore operation. The failed volume restore
results in the following error message in the restore resource's stateReason
field:
Volume Validation Error: Some of the volume restores failed - [projects/PROJECT_ID/locations/LOCATION/restorePlans/RESTORE_PLAN_ID/restores/RESTORE_ID/volumeRestores/VOLUME_RESTORE_ID (PVC: NAMESPACE/PVC_NAME), ...]
The error message lists the full resource names of the failed VolumeRestore,
including the target PersistentVolumeClaim name and namespace. The error
message indicates that the data restoration process for the affected
PersistentVolumeClaim didn't complete successfully when Backup for GKE
initiated VolumeRestore resources to provision PersistentVolumes from
VolumeBackups, and the underlying Persistent Disk creation from the snapshot
failed. VolumeRestore failures can occur for the following reasons:
Insufficient quota: there isn't enough allocated Persistent Disk quota in the project or region, for example,
SSD_TOTAL_GB.Permission issues: the service account used by Backup for GKE lacks the necessary permissions to create disks or access snapshots.
Network issues: there are transient or persistent network issues interrupting the disk creation process.
Invalid snapshot: the source
VolumeBackupor the underlying Persistent Disk snapshot is corrupted or inaccessible.Resource constraints: other cluster resource constraints are hindering volume provisioning.
Internal errors: there are internal issues within the Persistent Disk service.
To resolve this error, use the following instructions:
Identify the failed
PersistentVolumeClaimslisted in the error message, which lists the full resource names of theVolumeRestoreobjects that failed.Get details of each failed
VolumeRestoreresource by running thegcloud beta container backup-restore volume-restores describecommand:gcloud beta container backup-restore volume-restores describe VOLUME_RESTORE_ID \ --project=PROJECT_ID \ --location=LOCATION \ --restore-plan=RESTORE_PLAN_ID \ --restore=RESTORE_IDReplace the following:
VOLUME_RESTORE_ID: the ID of the failedVolumeRestoreresource.PROJECT_ID: the ID of your Google Cloud project.LOCATION: the Google Cloud location of the restore.RESTORE_PLAN_ID: the ID of the restore plan.RESTORE_ID: the ID of the restore operation.
Examine the
stateandstateMessagefields in the output for details regarding the failure.Examine the state of the target
PersistentVolumeClaimby running thekubectl get pvccommand:kubectl get pvc PVC_NAME -n NAMESPACE_NAME -o yamlReplace the following:
PVC_NAME: the name of thePersistentVolumeClaimresource.NAMESPACE_NAME: the namespace where thePersistentVolumeClaimis located.
Confirm that the
status.phasesection of the output indicates aPendingphase. This phase means that thePersistentVolumeClaimisn't yet bound to aPersistentVolume, which is expected if theVolumeRestorefails.Inspect the
Eventssection in the YAML output for messages related to provisioning failures, such asProvisioningFailed, for example:Cloud KMS error when using key projects/PROJECT_ID/locations/LOCATION/keyRings/KEY_RING/cryptoKeys/CRYPTO_KEY: Permission 'cloudkms.cryptoKeyVersions.useToEncrypt' denied on resource 'projects/PROJECT_ID/locations/LOCATION/keyRings/KEY_RING/cryptoKeys/CRYPTO_KEY' (or it may not exist).The output indicates that there is a permission issue while accessing the encryption key during disk creation. To provide
compute service agentrelevant permission to access the key, use the instructions described in the Backup for GKE documentation about enabling CMEK encryption.Review the GKE events in the
PersistentVolumeClaimnamespace, which provide detailed error messages from thePersistentVolumecontroller or CSI driver, by running thekubectl get eventscommand:kubectl get events -n NAMESPACE_NAME --sort-by='.lastTimestamp'Replace
NAMESPACE_NAMEwith the namespace of thePersistentVolumeClaim,Identify events related to the
PersistentVolumeClaimname, which contains keywords such asFailedProvisioningorExternalProvisioning. The events can also contain errors from the storage provisioner, such aspd.csi.storage.gke.io.Examine Persistent Disk logs by checking Cloud Audit Logs and Persistent Disk logs in Cloud Logging for any errors related to disk creation operations around the time of the failure.
Based on the generated error messages, address the following underlying issues:
Increase Persistent Disk quotas if indicated, such as
(QUOTA_EXCEEDED}: Quota SSD_TOTAL_GB exceeded.Verify and correct IAM permissions.
Investigate and resolve network issues.
Contact Cloud Customer Care to resolve issues with the snapshot or the Persistent Disk service.
The
PersistentVolumeClaimremains in aPendingstate.The restore operation process doesn't automatically retry the
VolumeRestore. To resolve this, you should trigger a restore operation for theDeploymentorStatefulSetworkload that uses the affectedPersistentVolumeClaim.Use a fine-grained restore to selectively restore the
DeploymentorStatefulSetworkload associated with the failedPersistentVolumeClaim. This approach lets the standard GKE mechanisms handle thePersistentVolumeClaimcreation and binding process again if the underlying issue is fixed. For more information about fine-grained restore, see Enable fine-grained restore.
If the issue persists or the cause of the VolumeRestore failure is unclear,
contact Cloud Customer Care for further assistance.
Error 200060101: Failure to complete restore operation due to volume validation timeout
Error 200060101 occurs during the volume validation phase of a restore
operation when Backup for GKE stops waiting because at least one
VolumeRestore resource, which manages restoring data from a VolumeBackup,
didn't reach a succeeded state within the allocated timeout period. Other
VolumeRestore resources might also be incomplete.
The error message in the Restore resource's stateReason field shows the
first VolumeRestore resource encountered that wasn't yet in a succeeded
state when the timeout was checked. It includes the target
PersistentVolumeClaim name and namespace for that specific VolumeRestore,
for example:
Volume Validation Error: Timed out waiting for volume restore [projects/PROJECT_ID/locations/LOCATION/restorePlans/RESTORE_PLAN_NAME/restores/RESTORE_NAME/volumeRestores/VOLUME_RESTORE_ID (PVC: PVC_NAMESPACE/PVC_NAME)]
Backup for GKE initiates VolumeRestore resources to provision
PersistentVolumes from VolumeBackups. The error indicates that the
underlying Persistent Disk creation from the snapshot and the subsequent binding of
the PersistentVolumeClaim to the PersistentVolume took longer than the
calculated timeout for the cited VolumeRestore. Other VolumeRestores for
the same restore operation might also be in a non-completed state.
Even though the timeout was reached from a Backup for GKE perspective, the
underlying disk creation process for the mentioned VolumeRestore resource, and
potentially VolumeRestore resources, might still be ongoing or have failed.
To resolve this issue, use the following instructions:
Identify the timed-out
PersistentVolumeClaimname and namespace in the error message, for example,(PVC: PVC_NAMESPACE/PVC_NAME).List all the
VolumeRestoresassociated with the restore operation to see their current states by running thegcloud beta container backup-restore volume-restores listcommand:gcloud beta container backup-restore volume-restores list \ --project=PROJECT_ID \ --location=LOCATION \ --restore-plan=RESTORE_PLAN_NAME \ --restore=RESTORE_NAMEReplace the following:
PROJECT_ID: the ID of the Google Cloud project.LOCATION: the Google Cloud location of the restore.RESTORE_PLAN_NAME: the name of the restore plan.RESTORE_NAME: the name of the restore operation.
Locate
VolumeRestoresthat aren't in asucceededstate.Get details about the
VolumeRestorementioned in the error and any otherVolumeRestoresthat aren't in asucceededstate by running thegcloud beta container backup-restore volume-restores describecommand:gcloud beta container backup-restore volume-restores describe VOLUME_RESTORE_ID \ --project=PROJECT_ID \ --location=LOCATION \ --restore-plan=RESTORE_PLAN_NAME \ --restore=RESTORE_NAMEReplace the following:
VOLUME_RESTORE_ID: the ID of theVolumeRestoreresource.PROJECT_ID: the ID of your Google Cloud project.LOCATION: the Google Cloud location of the restore.RESTORE_PLAN_NAME: the name of the restore plan.RESTORE_NAME: the name of the restore operation.
Check the
stateandstateMessagefields. The value of thestatefield is likelycreatingorrestoring. ThestateMessagefield might provide more context and contain the targetPersistentVolumeClaimdetails.Examine the state of the identified target
PersistentVolumeClaimsby running thekubectl get pvccommand:kubectl get pvc PVC_NAME -n PVC_NAMESPACE -o yamlReplace the following:
PVC_NAME: the name of thePersistentVolumeClaim.PVC_NAMESPACE: the namespace of thePersistentVolumeClaim.
The value of the
PersistentVolumeClaim'sstatus.phaseis likely to bePending. Check theEventssection for the following errors:Waiting for first consumer to be created before binding: indicates that theStorageClasshasvolumeBindingMode: WaitForFirstConsumer.Provisioning of the
PersistentVolumeis delayed until aPodthat uses thePersistentVolumeClaimis created and scheduled. The issue might be with thePodscheduling, not the volume provisioning itself. Therefore, we recommend confirming why thePodsconsuming thePersistentVolumeClaimaren't being scheduled or aren't starting.FailedProvisioningor errors from the storage provisioner: For example,pd.csi.storage.gke.io.
Review GKE events in the relevant namespaces by running the
kubectl get eventscommand:kubectl get events -n PVC_NAMESPACE --sort-by='.lastTimestamp'Replace
PVC_NAMESPACEwith the namespace of thePersistentVolumeClaim.Look for events related to the
PersistentVolumeClaimnames, such as provisioning messages or errors.Check Cloud Audit Logs and Persistent Disk logs in Cloud Logging.
Monitor the status of all
VolumeRestoresincreatingandrestoringstates.After the issue, is fixed, the status of the
VolumeRestorescan transition to eithersucceededorfailedstates. If theVolumeRestoresreach asucceededstate, thePersistentVolumeClaimsshould becomeBoundand workloads should be functional. If anyVolumeRestoreenters afailedstate, you need to perform troubleshooting steps to resolve the volume validation error. For more information, see Error 200060102: Failure to complete restore operation due to volume validation error
If VolumeRestores remain in creating or restoring states for an excessive
period of time, contact Cloud Customer Care for further assistance.