About GKE Volume Populator


The Google Kubernetes Engine (GKE) Volume Populator can help you automate and streamline the process of preloading data from Cloud Storage buckets to destination PersistentVolumeClaims (PVCs) during dynamic provisioning.

How GKE Volume Populator works

GKE Volume Populator leverages the core Kubernetes Volume Populator concept. Instead of provisioning an empty volume, the GKE Volume Populator allows a PVC to reference a GCPDataSource custom resource. This custom resource specifies the source Cloud Storage bucket and the necessary credentials.

When you create a PVC with a dataSourceRef pointing to a GCPDataSource resource, the GKE Volume Populator initiates the data transfer. It copies data from the specified Cloud Storage bucket URI into the underlying persistent storage volume before making the volume available to your Pods.

This process reduces your need to use manual data transfer scripts or CLI commands, and automates the transfer of large datasets to persistent volumes. GKE Volume Populator supports data transfers between the following source and destination types:

GKE Volume Populator is a GKE managed component that's enabled by default on both Autopilot and Standard clusters. You primarily interact with GKE Volume Populator through the gcloud CLI and kubectl CLI.

Architecture

The following diagram shows how data flows from the source storage to the destination storage, and how the PersistentVolume for the destination storage is created by using GKE Volume Populator.

  1. You create a PVC that references a GCPDataSource custom resource.
  2. The GKE Volume Populator detects the PVC and initiates a data transfer Job.
  3. The transfer Job runs on an existing node pool, or a new one is created if node auto-provisioning is enabled.
  4. The transfer Job copies data from the Cloud Storage bucket specified in the GCPDataSource resource to the destination storage volume.
  5. After the transfer is complete, the PVC is bound to the destination storage volume, making the data available to the workload Pod.

Data transfer from source data storage and creation of PV for destination storage by using the GKE Volume Populator

Key benefits

The GKE Volume Populator offers several benefits:

  • Automated data population: automatically populate volumes with data from Cloud Storage during provisioning, which helps reduce operational overhead.
  • Seamless data portability: move data from object storage to high-performance file (Parallelstore) or block storage (Hyperdisk) systems to help optimize for price or performance based on your workload needs.
  • Simplified workflows: reduce the need for separate data loading Jobs, or manual intervention to prepare persistent volumes.
  • Integration with Identity and Access Management (IAM): use IAM-based authentication through Workload Identity Federation for GKE to help ensure secure data transfer with fine-grained access control.
  • Accelerated AI/ML workloads: quickly preload large datasets, models, and weights directly into high-performance storage to help speed up training and inference tasks.

Use cases for GKE Volume Populator

You can use GKE Volume Populator to load large training datasets for AI/ML. Imagine you have a multi-terabyte dataset for training a large language model (LLM) stored in a Cloud Storage bucket. Your training Job runs on GKE and requires high I/O performance. Instead of manually copying the data, you can use the GKE Volume Populator to automatically provision a Parallelstore or Hyperdisk ML volume, and populate it with the dataset from Cloud Storage when the PVC is created. This automated process helps ensure that your training Pods start with immediate, high-speed access to the data.

Here are some more examples where you can use the GKE Volume Populator:

  • Pre-caching AI/ML model weights and assets from Cloud Storage into Hyperdisk ML volumes to accelerate model loading times for inference serving.
  • Migrating data from Cloud Storage to persistent volumes for stateful applications requiring performant disk access.

What's next