Data backup and recovery for Parallelstore on Google Kubernetes Engine

Parallelstore is available by invitation only. If you'd like to request access to Parallelstore in your Google Cloud project, contact your sales representative.

This guide describes how you can back up the data in your (GKE) connected Parallelstore instance to a Cloud Storage bucket and prevent potential data loss by configuring a GKE CronJob to automatically back up the data on a schedule. This guide also describes how you can recover data for a Parallelstore instance.

Before you begin

Follow Create and connect to a Parallelstore instance from GKE to set up your GKE cluster and Parallelstore instance.

Data backup

The following section describes how you can set up a GKE CronJob to continually back up your data from a Parallelstore instance in the GKE cluster to prevent data loss.

Connect to your GKE cluster

Get the credentials for your GKE cluster:

    gcloud container clusters get-credentials CLUSTER_NAME \
      --project=PROJECT_ID \
      --location=CLUSTER_LOCATION

Replace the following:

  • CLUSTER_NAME: the GKE cluster name.
  • PROJECT_ID: the Google Cloud project ID.
  • CLUSTER_LOCATION: the Compute Engine zone containing the cluster. Your cluster must be in a supported zone for the Parallelstore CSI driver.

Provision required permissions

Your GKE CronJob needs the roles/parallelstore.admin and roles/storage.admin roles to import and export data between Cloud Storage and Parallelstore.

Create a Google Cloud service account

    gcloud iam service-accounts create parallelstore-sa \
      --project=PROJECT_ID

Grant the Google Cloud service account roles

Grant Parallelstore Admin and Cloud Storage Admin roles to the service account.

    gcloud projects add-iam-policy-binding PROJECT_ID \
      --member=serviceAccount:parallelstore-sa@PROJECT_ID.iam.gserviceaccount.com \
      --role=roles/parallelstore.admin
    gcloud projects add-iam-policy-binding PROJECT_ID \
      --member serviceAccount:parallelstore-sa@PROJECT_ID.iam.gserviceaccount.com \
      --role=roles/storage.admin

Set up a GKE service account

You need to set up a GKE service account and allow it to impersonate the Google Cloud service account. Use the following steps to allow GKE service account to bind to the Google Cloud service account.

  1. Create the following parallelstore-sa.yaml service account manifest:

      # GKE service account used by workload and will have access to Parallelstore and GCS
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: parallelstore-sa
        namespace: default
    

    Next, deploy it to your GKE cluster using this command:

      kubectl apply -f parallelstore-sa.yaml
    
  2. Allow the GKE service account to impersonate the Google Cloud service account.

      # Bind the GCP SA and GKE SA
      gcloud iam service-accounts add-iam-policy-binding parallelstore-sa@PROJECT_ID.iam.gserviceaccount.com \
          --role roles/iam.workloadIdentityUser \
          --member "serviceAccount:PROJECT_ID.svc.id.goog[default/parallelstore-sa]"
    
      # Annotate the GKE SA with GCP SA
      kubectl annotate serviceaccount parallelstore-sa \
          --namespace default \
      iam.gke.io/gcp-service-account=parallelstore-sa@PROJECT_ID.iam.gserviceaccount.com
    

Grant permissions to the Parallelstore Agent service account

    gcloud storage buckets add-iam-policy-binding GCS_BUCKET \
      --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-parallelstore.iam.gserviceaccount.com \
      --role=roles/storage.admin

Replace the following:

  • GCS_BUCKET: The Cloud Storage bucket URI in the format of gs://<bucket_name>.
  • PROJECT_NUMBER: The Google Cloud project number.

Start the CronJob

Configure and start a GKE CronJob for periodically exporting data from Parallelstore to Cloud Storage.

Create the configuration file ps-to-gcs-backup.yaml for the CronJob:

  apiVersion: batch/v1
  kind: CronJob
  metadata:
    name: ps-to-gcs-backup
  spec:
    concurrencyPolicy: Forbid
    failedJobsHistoryLimit: 1
    schedule: "0 * * * *"
    successfulJobsHistoryLimit: 3
    suspend: false
    jobTemplate:
      spec:
        template:
          metadata:
            annotations:
              gke-parallelstore/cpu-limit: "0"
              gke-parallelstore/ephemeral-storage-limit: "0"
              gke-parallelstore/memory-limit: "0"
              gke-parallelstore/volumes: "true"
          spec:
            serviceAccountName: parallelstore-sa
            containers:
            - name: pstore-backup
              image: google/cloud-sdk:slim
              imagePullPolicy: IfNotPresent
              command:
              - /bin/bash
              - -c
              - |
                #!/bin/bash
                set -ex

                # Retrieve modification timestamp for the latest file up to the minute
                latest_folder_timestamp=$(find $PSTORE_MOUNT_PATH/$SOURCE_PARALLELSTORE_PATH -type d -printf  '%T@ %p\n'| sort -n | tail -1 | cut -d' ' -f2- | xargs -I{} stat -c %x {} | xargs -I {} date -d {} +"%Y-%m-%d %H:%M")

                # Start exporting from PStore to GCS
                operation=$(gcloud beta parallelstore instances export-data $PSTORE_NAME \
                  --location=$PSTORE_LOCATION \
                  --source-parallelstore-path=$SOURCE_PARALLELSTORE_PATH \
                  --destination-gcs-bucket-uri=$DESTINATION_GCS_URI \
                  --async \
                  --format="value(name)")

                # Wait until operation complete
                while true; do
                  status=$(gcloud beta parallelstore operations describe $operation \
                    --location=$PSTORE_LOCATION \
                    --format="value(done)")
                  if [ "$status" == "True" ]; then
                    break
                  fi
                  sleep 60
                done

                # Check if export succeeded
                error=$(gcloud beta parallelstore operations describe $operation \
                  --location=$PSTORE_LOCATION \
                  --format="value(error)")
                if [ "$error" != "" ]; then
                  echo "!!! ERROR while exporting data !!!"
                fi

                # Delete the old files from PStore if requested
                # This will not delete the folder with the latest modification timestamp
                if $DELETE_AFTER_BACKUP && [ "$error" == "" ]; then
                  find $PSTORE_MOUNT_PATH/$SOURCE_PARALLELSTORE_PATH -type d -mindepth 1 |
                    while read dir; do
                        # Only delete folders that is modified earlier than the latest modification timestamp
                        folder_timestamp=$(stat -c %y $dir)
                        if [ $(date -d "$folder_timestamp" +%s) -lt $(date -d "$latest_folder_timestamp" +%s) ]; then
                          echo "Deleting $dir"
                          rm -rf "$dir"
                        fi
                    done
                fi
              env:
              - name: PSTORE_MOUNT_PATH # mount path of the Parallelstore instance, should match the volumeMount defined for this container
                value: "PSTORE_MOUNT_PATH"
              - name: PSTORE_NAME # name of the Parallelstore instance that need backup
                value: "PSTORE_NAME"
              - name: PSTORE_LOCATION # location/zone of the Parallelstore instance that need backup
                value: "PSTORE_LOCATION"
              - name: SOURCE_PARALLELSTORE_PATH # absolute path from the PStore instance, without volume mount path
                value: "SOURCE_PARALLELSTORE_PATH"
              - name: DESTINATION_GCS_URI # GCS bucket uri used for storing backups, starting with "gs://"
                value: "DESTINATION_GCS_URI"
              - name: DELETE_AFTER_BACKUP # will delete old data from Parallelstore if true
                value: "DELETE_AFTER_BACKUP"
              volumeMounts:
              - mountPath: PSTORE_MOUNT_PATH # should match the value of env var PSTORE_MOUNT_PATH
                name: PSTORE_PV_NAME
            dnsPolicy: ClusterFirst
            restartPolicy: OnFailure
            terminationGracePeriodSeconds: 30
            volumes:
            - name: PSTORE_PV_NAME
              persistentVolumeClaim:
                claimName: PSTORE_PVC_NAME

Replace the following variables:

  • PSTORE_MOUNT_PATH: The mount path of the Parallelstore instance, it should match the volumeMount defined for this container.
  • PSTORE_PV_NAME: The name of the GKE PersistentVolume that points to your Parallelstore instance. This should have been set up in your GKE cluster as part of the prerequisites.
  • PSTORE_PVC_NAME: The name of the GKE PersistentVolumeClaim that requests the usage of the Parallelstore PersistentVolume. This should have been set up in your GKE cluster as part of the prerequisites.
  • PSTORE_NAME: The name of the Parallelstore instance that need backup.
  • PSTORE_LOCATION: The location of the Parallelstore instance that need backup.
  • SOURCE_PARALLELSTORE_PATH: The absolute path from the Parallelstore instance without the volume mount path and must start with /.
  • DESTINATION_GCS_URI: The Cloud Storage bucket URI to a Cloud Storage bucket, or a path within a bucket, using the format of gs://<bucket_name>/<optional_path_inside_bucket>.
  • DELETE_AFTER_BACKUP: The configuration to decide whether to delete old data from Parallelstore after backup and free up space, supported values: true or false.

Deploy the CronJob to your GKE cluster using the following command:

  kubectl apply -f ps-to-gcs-backup.yaml

See CronJob for more details about setting up a CronJob.

Detecting data loss

When the state of a Parallelstore instance is FAILED, the data on the instance may no longer be accessible. You can use the following Google Cloud CLI command to check the state of the Parallelstore instance:

    gcloud beta parallelstore instances describe PARALLELSTORE_NAME \
    --location=PARALLELSTORE_LOCATION \
    --format="value(state)"

Data recovery

When a disaster happens or the Parallelstore instance fails for any reason, you can either use the GKE VolumePopulator to automatically preload data from Cloud Storage into a GKE managed Parallelstore instance, or manually create a new Parallelstore instance and import data from a Cloud Storage backup.

If you are recovering from a checkpoint of your workload, you need to decide which checkpoint to recover from by providing the path inside the Cloud Storage bucket.

The Parallelstore export in Cloud Storage might have partial data if the Parallelstore instance failed in the middle of the export operation. Check the data for completeness in the target Cloud Storage location before importing it to Parallelstore and resuming your workload.

GKE Volume Populator

GKE Volume Populator can be used to preload data from a Cloud Storage bucket path into a newly created Parallelstore instance. Instructions for this can be found in Preload Parallelstore.

Manual recovery

You can also create a Parallelstore instance manually and import data from a Cloud Storage bucket with the following steps.

  1. Create a new Parallelstore instance:

      gcloud beta parallelstore instances create PARALLELSTORE_NAME \
        --capacity-gib=CAPACITY_GIB \
        --location=PARALLELSTORE_LOCATION \
        --network=NETWORK_NAME \
        --project=PROJECT_ID
    
  2. Import data from Cloud Storage:

      gcloud beta parallelstore instances import-data PARALLELSTORE_NAME \
        --location=PARALLELSTORE_LOCATION \
        --source-gcs-bucket-uri=SOURCE_GCS_URI \
        --destination-parallelstore-path=DESTINATION_PARALLELSTORE_PATH \
        --async
    

Replace the following:

  • PARALLELSTORE_NAME: The name of this Parallelstore instance.
  • CAPACITY_GIB: The storage capacity of the Parallelstore instance in GB, value from 12000 to 100000, in multiples of 4000.
  • PARALLELSTORE_LOCATION: The location of the Parallelstore instance that need backup, it must be in the supported zone.
  • NETWORK_NAME: The name of the VPC network that you created during Configure a VPC network, it must be the same network your GKE cluster uses and have Private Services Access enabled.
  • SOURCE_GCS_URI: The Cloud Storage bucket URI to a Cloud Storage bucket, or a path within a bucket where you have the data you want to import from, using the format of gs://<bucket_name>/<optional_path_inside_bucket>.
  • DESTINATION_PARALLELSTORE_PATH: The absolute path from the Parallelstore instance where you want to import the data to, must start with /.

More details about importing data into a Parallelstore instance can be found in Transfer data to or from Cloud Storage.