Storage

If you're interested in Vertex AI Managed Training, contact your sales representative for access.

Choosing the right storage configuration is critical for the performance and stability of your Managed Training cluster. The service integrates with two distinct, high-performance storage solutions:

Filestore: A required managed file service that provides the shared /home directories for all nodes in the cluster.
Google Cloud Managed Lustre: An optional parallel file system designed for extreme I/O performance, ideal for training on massive datasets.

This page provides an overview of their key uses and outlines the specific networking and deployment requirements for a successful integration with your cluster.

Storage integration for Managed Training

Managed Training relies on specific, networked storage solutions for its operation. Filestore is required to provide the shared /home directories for the cluster, while Managed Lustre is an optional high-performance file system for demanding workloads.

It's critical to configure the networking for these storage services correctly before deploying your cluster.

Filestore for home directories

This service uses a Filestore instance to provide the shared /home directory for the cluster. To ensure proper connectivity, you must create your cloud resources in this specific order:

Create the VPC Network: First, deploy a VPC network configured with the recommended MTU (for example, 8896).
Create the Filestore instance: Next, deploy the Filestore instance into the VPC you just created.
Create the Managed Training cluster: Finally, deploy the cluster, which will then be able to connect to the Filestore instance within the same network.

Google Cloud Managed Lustre for high-performance workloads

For workloads that require maximum I/O performance, you can attach a Managed Lustre file system. This service connects to your VPC using Private Service Access.

Critical networking limitation: No transitive peering

A critical limitation for both Filestore and Google Cloud Managed Lustre is that they don't support transitive peering. This means only resources within the directly connected VPC can access the storage service. For example, if your cluster's VPC (N1) is peered with the storage service, another VPC (N2) that is peered with N1 won't have access.

Storage integration for Managed Training

Managed Training relies on specific, networked storage solutions for its operation. Filestore is required to provide the shared /home directories for the cluster, while Google Cloud Managed Lustre is an optional high-performance file system for demanding workloads. It's critical to configure the networking for these storage services correctly before deploying your cluster.

Filestore

Key uses and requirements of Filestore with Managed Training:

The Managed Training API mandates that a Filestore instance be attached to the cluster to serve as the /home directory for the cluster.
- This Filestore instance must reside in the same zone or region and the SAME network as all the compute nodes and login nodes.
- The instance ID for the Filestore used for the /home directory is configured using the FS_INSTANCE_ID environment variable.
- When creating a cluster using the API, the Filestore instance used for the home directory is specified in the following field: orchestrator_spec.slurm_spec.home_directory_storage
Storage tiers and performance: Filestore offers different service tiers that users can choose based on performance needs, including:
1. Zonal: Recommended for high performance, higher memory ceilings.
2. Basic SSD: Recommended for good performance, lowest cost, frequent simple jobs.
Configuring storage options: When creating a Managed Training cluster, you are presented with storage configuration options, which include Filestore, Managed Lustre, and Cloud Storage. Configuring Filestore requires enabling the Filestore API as a prerequisite.
Additional file storage: Users can also configure and attach additional Filestore instances to the cluster. Note: If specified in the node pool protocol, Managed Training automatically mounts this additional Filestore instance into the /mnt/filestore directory on the cluster nodes.

Configure Filestore storage

Create a zonal or regional Filestore instance in the zone where you want to create the cluster. Due to the GPU resource constraint, Managed Training API requires a Filestore to be attached to the cluster to serve as the /home directory. This Filestore has to be in the same zone or region and in the same network as all the compute nodes and login nodes. In the example below, 172.16.10.0/24 is used for the Filestore deployment.

    SERVICE_TIER=ZONAL # Can use BASIC_SSD

    # Create reserved IP address range
    gcloud compute addresses create CLUSTER_IDfs-ip-range \
        --project=PROJECT_ID \
        --global \
        --purpose=VPC_PEERING \
        --addresses=172.16.10.0 \
        --prefix-length=24 \
        --description="Filestore instance reserved IP range" \
        --network=NETWORK

    # Get the CIDR range
    FS_IP_RANGE=$(
      gcloud compute addresses describe CLUSTER_IDfs-ip-range \
        --global  \
        --format="value[separator=/](address, prefixLength)"
    )

    # Create the Filestore instance
    gcloud filestore instances create FS_INSTANCE_ID \
        --project=PROJECT_ID \
        --location=ZONE \
        --tier=ZONAL \
        --file-share=name="nfsshare",capacity=1024 \
    --network=name=NETWORK,connect-mode=DIRECT_PEERING,reserved-ip-range="${FS_IP_RANGE}"

Lustre

Google Cloud Managed Lustre delivers a high-performance, fully managed parallel file system optimized for AI and HPC applications. With multi-petabyte-scale capacity and up to 1 TBps throughput, Managed Lustre facilitates the migration of demanding workloads to the cloud.

Managed Lustre instances live in zones within regions. A region is a specific geographical location where you can run your resources. Each region is subdivided into several zones. For example, the us-central1 region in the central United States has zones us-central1-a, us-central1-b, us-central1-c, and us-central1-f. For more information, see Geography and regions.

To decrease network latency, we recommend creating a Managed Lustre instance in a region and zone that's close to where you plan to use it.

When creating a Managed Lustre instance, you must define the following properties:

The name of the instance used by Google Cloud.
The file system name used by client-side tools, for example lfs.
The storage capacity in gibibytes (GiB). Capacity can range from 9,000 GiB to ~8 PiB (7,632,000 GiB). The maximum size of an instance depends on its performance tier.
Managed Lustre offers performance tiers ranging from 125 MBps per TiB to 1000 MBps per TiB.
For best performance, create your instance in the same zone as your Managed Training cluster.
The VPC network for this instance must be the same one your Managed Training cluster uses.

Managed Lustre offers 4 performance tiers, each with a different maximum throughput speed per TiB. Performance tiers also affect the minimum and maximum instance size, and the step size between acceptable capacity values. You cannot change an instance's performance tier after it's been created.

Deploying Managed Lustre requires Private Service Access, which establishes VPC peering between the Managed Training VPC and the VPC hosting Managed Lustre, using a dedicated /20 subnet.

Configure Managed Lustre instance (optional)

Use Google Cloud Managed Lustre only if you want to use the Managed Lustre in Model Development Service.

Google Cloud Managed Lustre is a fully managed, high-performance parallel file system service on Google Cloud. It's specifically designed to accelerate demanding workloads in AI/Machine Learning and High-Performance Computing (HPC).

For optimal performance when using Managed Training, Google Cloud Managed Lustre should be deployed from the same VPC and zone as Managed Training using VPC peering to services networking.

Create Lustre instance

    gcloud lustre instances create LUSTRE_INSTANCE_ID \
    --project=PROJECT_ID \
    --location=ZONE \
    --filesystem=lustrefs \
    --per-unit-storage-throughput=500 \
    --capacity-gib=36000 \
    --network=NETWORK_NAME

Cloud Storage mounting

As a prerequisite, make sure that the VM service account has the Storage Object User role.

Default mount

Managed Training uses Cloud Storage FUSE to dynamically mount your Cloud Storage buckets on all login and compute nodes, making them accessible under the /gcs directory. Dynamically mounted buckets can't be listed from the root mount point /gcs. You can access the dynamically mounted buckets as subdirectories:

user@testcluster:$ ls /gcs/your-bucket-name
user@testcluster:$ cd /gcs/your-bucket-name

Custom mount

To mount a specific Cloud Storage bucket to a local directory with custom options, use the following command structure by either passing it as part of the startup script on cluster creation, or directly running on the node after the cluster is created.

sudo mkdir -p $MOUNT_DIR
echo "$GCS_BUCKET $MOUNT_DIR gcsfuse $OPTION_1,$OPTION_2,..." | sudo tee -a /etc/fstab
sudo mount -a

For example, to mount the bucket mtdata to the /data directory, use the following command:

sudo mkdir -p /data
echo "mtdata /data gcsfuse defaults,_netdev,implicit_dirs,allow_other,dir_mode=777,file-mode=777,metadata_cache_negative_ttl_secs=0,metadata_cache_ttl_secs=-1,stat_cache_max_size_mb=-1,type_cache_max_size_mb=-1,enable_streaming_writes=true" | sudo tee -a /etc/fstab
sudo mount -a

For a fully automated and consistent setup, include your custom mount scripts within the cluster's startup scripts. This practice ensures that your Cloud Storage buckets are automatically mounted across all nodes on startup, eliminating the need for manual configuration.

For additional configuration recommendations tailored to AI/ML workloads, see the Performance tuning best practices guide. It provides specific guidance for optimizing Cloud Storage FUSE for training, inference, and checkpointing.