Design storage for AI and ML workloads in Google Cloud

Last reviewed 2025-04-09 UTC

This document provides design guidance on how to choose and integrate Google Cloud storage services for your AI and ML workloads. Each stage in the ML lifecycle has different storage requirements. For example, when you upload the training dataset, you might prioritize storage capacity for training and high throughput for large datasets. Similarly, the training, tuning, serving, and archiving stages have different requirements. For most AI and ML workloads, we recommend that you use Google Cloud Managed Lustre as your storage solution. Managed Lustre offers high performance and scalability, which makes it ideal for training, checkpointing, and serving. serving, and archiving stages have different requirements.

This document helps you assess your capacity, latency, and throughput requirements to make informed choices to determine the appropriate storage solution. This document assumes that you've selected a compute platform that meets the requirements of your workload. For AI and ML workloads, we recommend that you use either Compute Engine or Google Kubernetes Engine (GKE). For more information about selecting a compute platform, see Hosting Applications on Google Cloud.

The following tabs provide a brief summary of the recommended storage choices for each stage of the ML workflow. For more information, see Choose appropriate storage.

Prepare

In the preparation stage of the ML workflow, you do the following:

Upload and ingest data.
Transform the data into the correct format before you train the model.

To optimize storage costs by using multiple storage classes, we recommend that you use the Cloud Storage Autoclass feature or object lifecycle management.

Train

In the training stage of the ML workflow, you do the following:

Model development: develop your model by using notebooks and applying iterative trial and error.
Model training:
- Use small-scale to large-scale numbers of machine accelerators to repeatedly read the training dataset.
- Apply an iterative process to model development and training.
Checkpointing and restarting:
- Save state periodically during model training by creating a checkpoint so that the training can restart after a node failure.
- Make the checkpointing selection based on the I/O pattern and the amount of data that needs to be saved at the checkpoint.

For the training stages, we recommend the following storage options:

Use Managed Lustre if your workload has these characteristics:
- A minimum training capacity requirement of 18 TiB.
- Training data that consists of small files of less than 50 MB to take advantage of low latency capabilities.
- A latency requirement of less than 1 millisecond to meet storage requirements for random I/O and metadata access.
- A requirement to perform frequent high-performance checkpointing.
- A desktop-like experience with full POSIX support to view and manage the data for your users.
Use Cloud Storage with Cloud Storage FUSE and Anywhere Cache if your workload has these characteristics:
- Training data that consists of large files of 50 MB or more.
- Tolerance for higher storage latency in the tens of milliseconds.
- A priority of data durability and high availability over storage performance.

To optimize costs, we recommend that you use the same storage service throughout all of the stages of model training.

Serve

In the serving stage of the ML workflow, you do the following:

Store the model.
Load the model into an instance that runs machine accelerators at startup.
Store results of model inference, such as generated images.
Optionally, store and load the dataset used for model inference.

For the serving stages, we recommend the following storage options:

Managed Lustre if your workload has these characteristics:
- Your training and checkpointing workload uses Managed Lustre.
- A requirement for 10 to 100 inferencing nodes.
- Frequent updates to your model.
Cloud Storage with Cloud Storage FUSE and Anywhere Cache if your workload has these characteristics:
- A requirement for a cost-effective solution for a dynamic environment where the number of inferencing nodes can change.
- Infrequent updates to your model.
- A priority for high availability and durability for your models, even in the event of regional disruptions.
Google Cloud Hyperdisk ML if your workload has these characteristics:
- A requirement for more than 100 inferencing nodes.
- Infrequent updates to your model.
- Your workload uses a supported virtual machine (VM) type.
- Your pipeline can manage read-only volumes to store models.

Overview of the design process

To determine the appropriate storage options for your AI and ML workload in Google Cloud, you do the following:

Consider the characteristics of your workload, performance expectations, and cost goals.
Review the recommended storage services and features in Google Cloud.
Based on your requirements and the available options, you choose the storage services and features that you need for each stage in the ML workflow: prepare, train, serve, and archive.

This document focuses on the stages of the ML workflow where careful consideration of storage options is most critical, but it doesn't cover the entirety of the ML lifecycle, processes, and capabilities.

The following provides an overview of the three-phase design process for choosing storage for your AI and ML workload:

Define your requirements:
- Workload characteristics
- Security constraints
- Resilience requirements
- Performance expectations
- Cost goals
Review storage options:
- Managed Lustre
- Cloud Storage
- Hyperdisk ML
Choose appropriate storage: choose storage services, features, and design options based on your workload characteristics at each stage of the ML workflow.

Define your requirements

Before you choose storage options for your AI and ML workload in Google Cloud, you must define the storage requirements for the workload. To define storage requirements, you should consider factors such as compute platform, capacity, throughput, and latency requirements.

To help you choose a storage option for your AI and ML workloads, consider the characteristics of your workload:

Are your I/O request sizes and file sizes small (KBs), medium, or large (MBs or GBs)?
Does your workload primarily exhibit sequential or random file access patterns?
Are your AI and ML workloads sensitive to I/O latency and time to first byte (TTFB)?
Do you require high read and write throughput for single clients, aggregated clients, or both?
What is the largest number of Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs) that your single largest AI and ML training workload requires?

You use your answers to these questions to choose appropriate storage later in this document.

Review storage options

Google Cloud offers storage services for all of the primary storage formats: block, file, parallel file system, and object. The following table describes options that you can consider for your AI and ML workload on Google Cloud. The table includes the three Google-managed storage options that this document focuses on for your AI and ML workloads. However, if you have specific requirements that aren't addressed by these offerings, consider partner-managed storage solutions that are available in Google Cloud Marketplace.

Review and evaluate the features, design options, and relative advantages of the services available for each storage format.

Storage service	Storage type	Features
Managed Lustre	Parallel file system	Portable Operating System Interface (POSIX) compliant. Persistent storage. Up to 1 TB/s throughput with ultra-low latency.
Cloud Storage	Object	Supported features include JSON API and transfer from Amazon S3 to Cloud Storage. Unstructured data and objects. Persistent read and write. More than 1 TB/s throughput with high latency.
Hyperdisk ML	Block	High performance at small storage capacity. Persistent read-only. Up to 1.2 TB/s aggregate throughput.

Managed Lustre

Managed Lustre is a fully managed file system in Google Cloud. Managed Lustre provides persistent, zonal instances that are built on the DDN EXAScaler Lustre file system. Managed Lustre is ideal for AI and ML workloads that need to provide low-latency access of less than one millisecond with high throughput and high input/output operations per second (IOPS). Managed Lustre can maintain high throughput and high IOPS for a few VMs or for thousands of VMs.

Managed Lustre provides the following benefits:

POSIX compliance: support for the POSIX standard, which helps to ensure compatibility with many existing applications and tools.
Lower TCO for training: accelerate training time by efficiently delivering data to compute nodes. This acceleration helps to reduce the TCO for AI and ML model training.
Lower TCO for serving: enable faster model loading and optimized inference serving in comparison to Cloud Storage. These capabilities help to lower compute costs and help to improve resource utilization.
Efficient resource utilization: combine checkpointing and training within a single instance. This resource utilization helps to maximize the efficient use of read and write throughput in a single, high-performance storage system.

Cloud Storage

Cloud Storage is a fully managed object storage service that's suitable for AI and ML workloads of any scale. Cloud Storage excels at handling unstructured data for all of the phases of the AI and ML workflow.

Cloud Storage provides the following benefits:

Massive scalability: gain unlimited storage capacity that scales to exabytes on a global basis.
High throughput: scale up to 1 TB/s with required planning.
Flexible location options: choose from regional, multi-region, and dual-region storage options for AI and ML workloads.
Cost-effectiveness: benefit from a range of storage classes that are designed to optimize costs based on your data access patterns.

Cloud Storage excels in scale and cost-effectiveness, but it's important to consider its latency and I/O characteristics. Expect latency in the tens of milliseconds, which is higher than other storage options. To maximize throughput, you need to use hundreds or thousands of threads, large files, and large I/O requests. Cloud Storage provides client libraries in various programming languages, and it provides Cloud Storage FUSE and Anywhere Cache.

Cloud Storage FUSE is an open source FUSE adapter that's supported by Google. Cloud Storage FUSE lets you mount Cloud Storage buckets as local drives. Cloud Storage FUSE isn't fully compliant with POSIX. Therefore, it's important that you understand the Cloud Storage FUSE limitations and differences from traditional file systems. With Cloud Storage FUSE, you can access your training data, models, and checkpoints with the scale, affordability, and performance of Cloud Storage.

Cloud Storage FUSE caching provides the following benefits:

Portability: mount and access Cloud Storage buckets by using standard file system semantics, which makes your applications more portable.
Compatibility: eliminate the need to refactor applications to use cloud-specific APIs, which saves you time and resources.
Reduced idle time: start training jobs quickly by directly accessing data in Cloud Storage, which minimizes idle time for your GPUs and TPUs.
High throughput: take advantage of the built-in scalability and performance of Cloud Storage, which is optimized for read-heavy ML workloads with GPUs or TPUs.
Client-local file cache: accelerate training with a client-local cache that speeds up repeated file reads. This acceleration can be further enhanced when you use it with the 6 TiB local SSD that's bundled with A3 machine types.

Anywhere Cache is a Cloud Storage feature that provides up to 1 PiB of SSD-backed zonal read cache for Cloud Storage buckets. Anywhere Cache is designed to accelerate data-intensive applications by providing a local, fast-access layer for frequently read data within a specific zone.

Anywhere Cache provides the following benefits:

Accelerated throughput: automatically scale cache capacity and bandwidth to deliver high throughput, exceeding regional bandwidth quotas, with consistent and predictable latencies.
Reduced cost: avoid data transfer egress fees or storage class retrieval fees for cached data. Anywhere Cache automatically sizes the cache and available bandwidth to meet your workload needs.

Hyperdisk ML

Hyperdisk ML is a high-performance block storage solution that's designed to accelerate AI and ML workloads that require read-only access to large datasets. Hyperdisk ML takes advantage of the scalability and availability of Colossus to balance performance across the underlying file system. Hyperdisk ML is particularly well-suited for serving tasks compared to other storage services on Google Cloud because it can provide exceptionally high aggregate throughput concurrently to many VMs.

Hyperdisk ML provides the following benefits:

Accelerated model serving and scalability: scale up to thousands of concurrent nodes and achieve a high aggregate throughput, which optimizes load times and resource utilization for inference workloads with Kubernetes ReadOnlyMany mode.
High-performance density: attain the highest throughput available on Google Cloud and fast read operations for large shared datasets.
Improved total cost of ownership (TCO): reduce costs with shorter GPU idle times, multi-attach capabilities, and performance pooling.
Concurrent read-only access: reduce costs by sharing disks between VMs, which is more cost-effective than having multiple disks with the same data. In order to access static data from multiple VMs, you must attach the same disk in read-only mode to hundreds of VMs. To update the volume, the disk must be detached from all of the VMs except for one.

Partner storage solutions

For workload requirements that the preceding storage services don't meet, you can use the following partner solutions, which are available in Cloud Marketplace:

These partner solutions are not managed by Google. You need to manage the deployment and operational tasks to ensure optimal integration and performance within your infrastructure.

Comparative analysis

The following table shows the key capabilities of the storage services in Google Cloud.

	Managed Lustre	Cloud Storage	Hyperdisk ML
Capacity	18 TiB - 8 PiB	No lower or upper limit.	4 GiB - 64 TiB
Scaling	Not scalable	Scales automatically based on usage.	Scale up
Sharing	Mountable on multiple Compute Engine VMs and GKE clusters.	Read/write from anywhere. Integrates with Cloud CDN and third-party CDNs.	Supported
Encryption key option	Google-owned and Google-managed encryption keys	Google-owned and Google-managed encryption keys Customer-managed Customer-supplied	Google-owned and Google-managed encryption keys Customer-managed Customer-supplied
Persistence	Lifetime of the Managed Lustre instance.	Lifetime of the bucket	Lifetime of the disk
Availability	Zonal	Regionally available Options for redundancy across regions	Zonal
Performance	Linear scaling with provisioned capacity	Autoscaling read-write rates, and dynamic load redistribution	Dynamic scaling persistent storage
Management	Fully managed, POSIX compliant	Fully managed	Manually format and mount

Data transfer tools

This section describes your options for moving data between storage services on Google Cloud. When you perform AI and ML tasks, you might need to move your data from one location to another. For example, if your data starts in Cloud Storage, you might move it elsewhere to train the model, and then copy the checkpoint snapshots or trained model back to Cloud Storage.

You can use the following methods to transfer data to Google Cloud:

Transfer data online by using Storage Transfer Service: automate the transfer of large amounts of data between object and file storage systems, including Cloud Storage, Amazon S3, Azure storage services, and on-premises data sources. Storage Transfer Service lets you copy your data securely from the source location to the target location and perform periodic transfers of changed data. It also provides data integrity validation, automatic retries, and load balancing.
Transfer data offline by using Transfer Appliance: transfer and load large amounts of data offline to Google Cloud in situations where network connectivity and bandwidth are unavailable, limited, or expensive.
Upload data to Cloud Storage: upload data online to Cloud Storage buckets by using the Google Cloud console, gcloud CLI, Cloud Storage APIs, or client libraries.

When you choose a data transfer method, consider factors like the data size, time constraints, bandwidth availability, cost goals, and security and compliance requirements. For information about planning and implementing data transfers to Google Cloud, see Migrate to Google Cloud: Transfer your large datasets.

Choose appropriate storage

AI and ML workloads typically involve four primary stages: prepare, train, serve, and archive. Each of these stages have unique storage requirements, and choosing the right solution can significantly impact performance, cost, and operational efficiency. A hybrid or locally optimized approach lets you tailor your storage choices to the specific demands of each stage for your AI and ML workload. However, if your priorities are unified management and ease of operation, then a globally simplified approach that uses a consistent solution across all of the stages can be beneficial for workloads of any scale. The effectiveness of the storage choice depends on the dataset properties, the scale of the required compute and storage resources, latency, and the workload requirements that you defined earlier.

The following sections provide details about the primary stages of AI and ML workloads and the factors that might influence your storage choice.

Prepare

The preparation stage sets the foundation for your AI and ML application. It involves uploading raw data from various sources into your cloud environment and transforming the data into a usable format for training your AI and ML model. This process includes tasks like cleaning, processing, and converting data types to ensure compatibility with your chosen AI and ML framework.

Cloud Storage is well-suited for the preparation stage due to its scalability, durability, and cost-efficiency, particularly for large datasets common in AI. Cloud Storage offers seamless integration with other Google Cloud services that lets you take advantage of potential optimizations for data-intensive training.

During the data preparation phase, you can reorganize your data into large chunks to improve access efficiency and avoid random read requests. To further reduce the I/O performance requirements on the storage system, you can increase the number of I/O threads by using pipelining, training optimization, or both.

Train

The training stage is the core of model development, where your AI and ML model learns from the provided data. This stage involves two key aspects that have distinct requirements: efficient data loading for accessing training data and reliable checkpointing for saving model progress. The following sections provide recommendations and the factors to consider to choose appropriate storage options for data loading and checkpointing.

Data loading

During data loading, GPUs or TPUs repeatedly import batches of data to train the model. In this phase, you can use a cache to optimize data-loading tasks, depending on the size of the batches and the order in which you request them. Your goal during data loading is to train the model with maximum efficiency but at the lowest cost.

If the size of your training data scales to petabytes, the data might need to be re-read multiple times. Such a scale requires intensive processing by a GPU or TPU accelerator. However, you need to ensure that your GPUs and TPUs aren't idle, and ensure that they process your data actively. Otherwise, you pay for an expensive, idle accelerator while you copy the data from one location to another.

To optimize performance and cost for data loading, consider the following factors:

Dataset size: the size of your overall training data corpus, and the size of each training dataset.
Access patterns: which one of the following options best categorizes your training workload I/O access pattern:
- Parallel and sequential access: a file is assigned to a single node and is read sequentially.
- Parallel and random access: a file is assigned to a single node and is read randomly to create a batch of samples.
- Fully random access: a node can read any range from any file to create a batch.
File size: the typical read request sizes.

Managed Lustre for data loading

You should choose Managed Lustre to load your data if any of the following conditions apply:

You have a minimum training capacity requirement of 18 TiB.
Your training data consists of small files of less than 50 MB to take advantage of low latency capabilities.
You have a latency requirement of less than 1 millisecond to meet storage requirements for random I/O and metadata access.
You need a desktop-like experience with full POSIX support to view and manage the data for your users.

You can use Managed Lustre as a high-performance cache on top of Cloud Storage to accelerate AI and ML workloads that require extremely high throughput and low latency I/O operations with a fully managed parallel file system. To minimize latency during training, you can import data to Managed Lustre from Cloud Storage. If you use GKE as your compute platform, you can use the

GKE Managed Lustre CSI driver to pre-populate PersistentVolumesClaims with data from Cloud Storage. After the training is complete, you can minimize your long-term storage expenses by exporting your data to a lower-cost Cloud Storage class.

Cloud Storage for data loading

You should generally choose Cloud Storage to load your data if any of the following conditions apply:

You have a training capacity requirement of 100 TiB or more.
Your training data consists of large files of 50 MB or more.
You prioritize data durability and high availability over storage performance.

Cloud Storage offers a scalable and cost-effective solution for storing massive datasets, and Cloud Storage FUSE lets you access the data as a local file system. Cloud Storage FUSE accelerates data access during training by keeping the training data close to the machine accelerators, which increases throughput.

For workloads that demand over 1 TB/s throughput, Anywhere Cache accelerates read speeds by caching data and scaling beyond regional bandwidth quotas. To assess if Anywhere Cache is suitable for your workload, use the Anywhere Cache recommender to analyze your data usage and storage.

Checkpointing and restore

For checkpointing and restore, training jobs need to periodically save their state so that they can recover quickly from instance failures. When the failure happens, jobs must restart, ingest the latest checkpoint, and then resume training. The exact mechanism that's used to create and ingest checkpoints is typically specific to a framework. To learn about checkpoints and optimization techniques for TensorFlow Core, see Training checkpoints. To learn about checkpoints and optimization techniques for PyTorch, see Saving and Loading Models.

You only need to save a few checkpoints at any one point in time. Checkpoint workloads usually consist of mostly writes, several deletes, and, ideally, infrequent reads when failures occur.

To optimize checkpointing and restore performance, consider the following factors:

Model size: the number of parameters that are in your AI and ML model. The size of your model directly impacts the size of its checkpoint files, which can range from GiB to TiB.
Checkpoint frequency: how often your model saves checkpoints. Frequent saves provide better fault tolerance, but increase storage costs and can impact training speed.
Checkpoint recovery time: the recovery time that you want for loading checkpoints and resuming training. To minimize recovery time, consider factors like checkpoint size, storage performance, and network bandwidth.

Managed Lustre for checkpointing

You should choose Managed Lustre for checkpointing if any of the following conditions apply:

Your training workload already uses Managed Lustre for data loading.
You have a requirement to perform frequent high-performance checkpointing.

To maximize resource utilization and minimize accelerator idle time, use Managed Lustre for training and checkpointing. Managed Lustre can achieve fast checkpoint writes that achieve a high per-VM throughput. You can keep checkpoints in the persistent Managed Lustre instance, or you can optimize costs by periodically exporting checkpoints to Cloud Storage.

Cloud Storage for checkpointing

You should choose Cloud Storage for checkpointing if any of the following conditions apply:

Your training workload uses Cloud Storage FUSE.
You prioritize data durability and high availability over storage performance.

To improve checkpointing performance, use Cloud Storage FUSE with hierarchical namespaces enabled to take advantage of the fast atomic rename operation and to save checkpoints asynchronously. To prevent accidental exposure of sensitive information from your training dataset during serving, you need to store checkpoints in a separate Cloud Storage bucket. To help reduce tail-end write latencies for stalled uploads, Cloud Storage FUSE attempts a retry after 10 seconds.

Serve

When you serve your model, which is also known as inference, the primary I/O pattern is read-only in order to load the model into GPU or TPU memory. Your goal at the serving stage is to run your model in production. The model is much smaller than the training data, which means that you can replicate and scale the model across multiple instances. When you serve data, it's important that you have high availability and protection against zonal and regional failures. Therefore, you must ensure that your model is available for a variety of failure scenarios.

For many generative AI and ML use cases, the input data to the model might be quite small and the data might not need to be stored persistently. In other cases, you might need to run large volumes of data over the model (for example, scientific datasets). To run large volumes of data, choose a storage option that can minimize GPU or TPU idle time during the analysis of the dataset, and use a persistent location to store the inference results.

Model load times directly affect accelerator idle time, which incurs substantial costs. An increase in per-node model load time can be amplified across many nodes, which can lead to a significant cost increase. Therefore, to achieve cost-efficiency in serving infrastructure, it's important that you optimize for rapid model loading.

To optimize serving performance and cost, consider the following factors:

Model size: the size of your model in GiB or TiB. Larger models require more computational resources and memory, which can increase latency.
Model load frequency: how often you plan to update your model. Frequent loading and unloading consume computational resources and increase latency.
Number of serving nodes: how many nodes will be serving your model. More nodes generally reduce latency and increase throughput, but they also increase infrastructure costs.

Managed Lustre for serving

You should choose Managed Lustre for serving your model if any of the following conditions apply:

Your training and checkpointing workload uses Managed Lustre.
Your workload uses 10 to 100 inferencing nodes.
You make frequent updates to your model.

If you're already using Managed Lustre for training and checkpointing, it can be a cost-effective and high-performance option for serving your models. Managed Lustre offers high per-VM throughput and aggregate cluster throughput that helps to reduce model load time. You can use Managed Lustre for any number of serving VMs.

Cloud Storage for serving

You should choose Cloud Storage for serving your model if any of the following conditions apply:

You require a cost-effective solution for a dynamic environment where the number of inferencing nodes can change.
You make infrequent updates to your model.
You prioritize high availability and durability for your models, even in the event of regional disruptions.

With multi-region or dual-region architecture, Cloud Storage provides high availability and protects your workload from zonal and regional failures. To accelerate model loading, you can use Cloud Storage FUSE with parallel downloads enabled so that parts of the model are fetched in parallel.

To achieve model serving with over 1 TB/s throughput, or for deployments exceeding a hundred serving nodes, use Anywhere Cache with a multi-region bucket. This combination provides high-performance, redundant storage across regions, and flexibility. Anywhere Cache also eliminates data egress and storage class retrieval fees on cached data.

Hyperdisk ML for serving

You should choose Hyperdisk ML for serving your model if any of the following conditions apply:

You require more than 100 inferencing nodes.
You make infrequent updates to your model.
Your workload uses a supported VM type.
Your pipeline can manage read-only volumes to store models.

Optimize serving performance by using Hyperdisk ML as a cache for your Cloud Storage data. Hyperdisk ML uses shared performance pooling and dynamically provisioned throughput across multiple VMs to efficiently handle read-only model serving at any scale.

What's next

For more information about storage options and AI and ML workloads, see the following resources:

Learn how to optimize AI and ML workloads with Managed Lustre.
Learn more about how to optimize AI and ML workloads with Cloud Storage FUSE.
Learn about the AI and ML perspective in the Google Cloud Well-Architected Framework.
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.

Contributors

Author: Samantha He | Technical Writer

Other contributors:

David Stiver | Group Product Manager
Dean Hildebrand | Technical Director, Office of the CTO
Kumar Dhanagopal | Cross-Product Solution Developer
Sean Derrington | Group Outbound Product Manager, Storage

Design storage for AI and ML workloads in Google Cloud

Prepare

Train

Serve

Archive

Overview of the design process

Define your requirements

Review storage options

Managed Lustre

Cloud Storage

Hyperdisk ML

Partner storage solutions

Comparative analysis

Data transfer tools

Choose appropriate storage

Prepare

Train

Data loading

Managed Lustre for data loading

Cloud Storage for data loading

Checkpointing and restore

Managed Lustre for checkpointing

Cloud Storage for checkpointing

Serve

Managed Lustre for serving

Cloud Storage for serving

Hyperdisk ML for serving

Archive

What's next

Contributors

Design storage for AI and ML workloads in Google Cloud Stay organized with collections Save and categorize content based on your preferences.

Prepare

Train

Serve

Archive

Overview of the design process

Define your requirements

Review storage options

Managed Lustre

Cloud Storage

Hyperdisk ML

Partner storage solutions

Comparative analysis

Data transfer tools

Choose appropriate storage

Prepare

Train

Data loading

Managed Lustre for data loading

Cloud Storage for data loading

Checkpointing and restore

Managed Lustre for checkpointing

Cloud Storage for checkpointing

Serve

Managed Lustre for serving

Cloud Storage for serving

Hyperdisk ML for serving

Archive

What's next

Contributors

Design storage for AI and ML workloads in Google Cloud