Design storage for AI and ML workloads in Google Cloud

Last reviewed 2024-03-20 UTC

When you choose Google Cloud storage services for your artificial intelligence (AI) and machine learning (ML) workloads, you must be careful to select the correct combination of storage options for each specific job. This need for careful selection applies when you upload your dataset, train and tune your model, place the model into production, or store the dataset and model in an archive. In short, you need to select the best storage services that provide the proper latency, scale, and cost for each stage of your AI and ML workloads.

To help you make well-informed choices, this document provides design guidance on how to use and integrate the variety of storage options offered by Google Cloud for key AI and ML workloads.

Figure 1 shows a summary of the primary storage choices. As shown in the diagram, you typically choose Cloud Storage when you have larger file sizes, lower input and output operations per second (IOPS), or higher latency. However, when you require higher IOPS, smaller file sizes, or lower latency, choose Filestore instead.

Figure 1: Primary AI and ML storage considerations

Choose Cloud Storage when you have larger file sizes, lower IOPS, or higher latency. Choose Filestore when you require higher IOPS, smaller file sizes, or lower latency.

Overview of AI and ML workload stages

AI and ML workloads consist of four primary stages: prepare, train, serve, and archive. These are the four times in the lifecycle of an AI and ML workload where you need to make a decision about which storage options you should choose to use. In most cases, we recommend that you continue to use the same storage choice that you select in the prepare stage for the remaining stages. Following this recommendation helps you to reduce the copying of datasets between storage services. However, there are some exceptions to this general rule, which are described later in this guide.

Some storage solutions work better than others at each stage and might need to be combined with additional storage choices for the best results. The effectiveness of the storage choice depends on the dataset properties, scale of the required compute and storage resources, latency, and other factors. The following table describes the stages and a brief summary of the recommended storage choices for each stage. For a visual representation of this table and additional details, see the decision tree.

Table 1: Storage recommendations for the stages and steps in AI and ML workloads
Stages Steps Storage recommendations

Prepare

Data preparation

  • Upload and ingest your data.
  • Transform the data into the correct format before training the model.

Cloud Storage

  • Large files (50 MB or larger) that can tolerate higher storage latency (tens of milliseconds).

Filestore Zonal

  • Smaller datasets with smaller files (less than 50 MB) and lower storage latency (~ 1 millisecond).

Train

  1. Model development
    • Develop your model by using notebooks and applying iterative trial and error.
  2. Model training
    • Use small-to-large scale numbers of graphics processing units (Cloud GPUs) or Tensor Processing Units (Cloud TPUs) to repeatedly read the training dataset.
    • Apply an iterative process to model development and training.

Cloud Storage

  • If you select Cloud Storage in the prepare stage, it's best to train your data in Cloud Storage.

Cloud Storage with Local SSD or Filestore

  • If you select Cloud Storage in the prepare stage but need to support small I/O requests or small files, you can supplement your training tasks. To do so, move some of your data from Cloud Storage to Local SSD or Filestore Zonal.

Filestore

  • If you select Filestore in the prepare stage, it's best to train your data in Filestore.
  • Create a Local SSD cache to supplement your Filestore training tasks.
  1. Checkpointing and restart
    • Save state periodically during model training by creating a checkpoint so that the training can restart after a node failure.
    • Make this selection based on the I/O pattern and the amount of data that needs to be saved at the checkpoint.

Cloud Storage

  • If you select Cloud Storage in the prepare stage, it's best to use Cloud Storage for checkpointing and restart.
  • Good for throughput, and workloads that need large numbers of threads.

Filestore Zonal

  • If you select Filestore in the prepare stage, it's best to use Filestore for checkpointing and restart.
  • Good for latency, high per-client throughput, and low numbers of threads.

Serve

  • Store the model.
  • Load the model into an instance running Cloud GPUs or Cloud TPUs at startup.
  • Store results of model inference, such as generated images.
  • Optionally, store and load the dataset used for model inference.

Cloud Storage

  • If you train your model in Cloud Storage, it's best to use Cloud Storage to serve your model.
  • Save the content generated by your model in Cloud Storage.

Filestore

  • If you train your model in Filestore, it's best to use Filestore for serving your model.
  • If you need durability and low latency when generating small files, choose Filestore Zonal (zonal) or Filestore Enterprise (regional).

Archive

  • Retain the training data and the model for extended time periods.

Cloud Storage

  • Optimize storage costs with multiple storage classes, Autoclass, or object lifecycle management.
  • If you use Filestore, you can use Filestore snapshots and backups, or copy the data to Cloud Storage.

For more details about the underlying assumptions for this table, see the following sections:

Criteria

To narrow your choices of which storage options to use for your AI and ML workloads, start by answering these questions:

  • Are your AI and ML I/O request sizes and file sizes small, medium, or large in size?
  • Are your AI and ML workloads sensitive to I/O latency and time to first byte (TTFB)?
  • Do you require high read and write throughput for single clients, aggregated clients, or both?
  • What is the largest number of Cloud GPUs or Cloud TPUs that your single largest AI and ML training workload requires?

In addition to answering the previous questions, you also need to be aware of the compute options and accelerators that you can choose to help optimize your AI and ML workloads.

Compute platform considerations

Google Cloud supports three primary methods for running AI and ML workloads:

For both Compute Engine and GKE, we recommend using the HPC Toolkit to deploy repeatable and turnkey clusters that follow Google Cloud best practices.

Accelerator considerations

When you select storage choices for AI and ML workloads, you also need to select the accelerator processing options that are appropriate for your task. Google Cloud supports two accelerator choices: NVIDIA Cloud GPUs and the custom-developed Google Cloud TPUs. Both types of accelerator are application-specific integrated circuits (ASICs) that are used to process machine learning workloads more efficiently than standard processors.

There are some important storage differences between Cloud GPUs and Cloud TPU accelerators. Instances that use Cloud GPUs support Local SSD with up to 200 GBps remote storage throughput available. Cloud TPU nodes and VMs don't support Local SSD, and rely exclusively on remote storage access.

For more information about accelerator-optimized machine types, see Accelerator-optimized machine family. For more information about Cloud GPUs, see Cloud GPUs platforms. For more information about Cloud TPUs, see Introduction to Cloud TPU. For more information about choosing between Cloud TPUs and Cloud GPUs, see When to use Cloud TPUs.

Storage options

As summarized previously in Table 1, use object storage or file storage with your AI and ML workloads and then supplement this storage option with block storage. Figure 2 shows three typical options that you can consider when selecting the initial storage choice for your AI and ML workload: Cloud Storage, Filestore, and Google Cloud NetApp Volumes.

Figure 2: AI and ML appropriate storage services offered by Google Cloud

The three options that you can consider when selecting the initial storage choice for your AI and ML workloads are Cloud Storage, Filestore, and NetApp Volumes.

If you need object storage, choose Cloud Storage. Cloud Storage provides the following:

  • A storage location for unstructured data and objects.
  • APIs, such as the Cloud Storage JSON API, to access your storage buckets.
  • Persistent storage to save your data.
  • Throughput of terabytes per second, but requires higher storage latency.

If you need file storage, you have two choices–Filestore and NetApp Volumes–which offer the following:

  • Filestore
    • Enterprise, high-performance file storage based on NFS.
    • Persistent storage to save your data.
    • Low storage latency, and throughput of 26 GBps.
  • NetApp Volumes
    • File storage compatible with NFS and Server Message Block (SMB).
    • Can be managed with the option to use NetApp ONTAP storage-software tool.
    • Persistent storage to save your data.
    • Throughput of 4.5 GBps.

Use the following storage options as your first choice for AI and ML workloads:

Use the following storage options to supplement your AI and ML workloads:

If you need to transfer data between these storage options, you can use the data transfer tools.

Cloud Storage

Cloud Storage is a fully managed object storage service that focuses on data preparation, AI model training, data serving, backup, and archiving for unstructured data. Some of the benefits of Cloud Storage include the following:

  • Unlimited storage capacity that scales to exabytes on a global basis
  • Ultra-high throughput performance
  • Regional and dual-region storage options for AI and ML workloads

Cloud Storage scales throughput to terabytes per second and beyond, but it has relatively higher latency (tens of milliseconds) than Filestore or a local file system. Individual thread throughput is limited to approximately 100-200 MB per second, which means that high throughput can only be achieved by using hundreds to thousands of individual threads. Additionally, high throughput also requires the use of large files and large I/O requests.

Cloud Storage supports client libraries in a variety of programming languages, but it also supports Cloud Storage FUSE. Cloud Storage FUSE lets you mount Cloud Storage buckets to your local file system. Cloud Storage FUSE enables your applications to use standard file system APIs to read from a bucket or write to a bucket. You can store and access your training data, models, and checkpoints with the scale, affordability, and performance of Cloud Storage.

To learn more about Cloud Storage, use the following resources:

Filestore

Filestore is a fully managed NFS file-based storage service. The Filestore service tiers used for AI and ML workloads include the following:

  • Enterprise tier: Used for mission-critical workloads requiring regional availability.
  • Zonal tier: Used for high-performance applications that require zonal availability with high IOPS and throughput performance requirements.
  • Basic tier: Used for file sharing, software development, web hosting, and basic AI and ML workloads.

Filestore delivers low latency I/O performance. It's a good choice for datasets with either small I/O access requirements or small files. However, Filestore can also handle large I/O or large file use cases as needed. Filestore can scale up to approximately 100 TB in size. For AI training workloads that read data repeatedly, you can improve read throughput by using FS-Cache with Local SSD.

For more information about Filestore, see the Filestore overview. For more information about Filestore service tiers, see Service tiers. For more information about Filestore performance, see Optimize and test instance performance.

Google Cloud NetApp Volumes

NetApp Volumes is a fully managed service with advanced data management features that support NFS, SMB, and multiprotocol environments. NetApp Volumes supports low latency, multi-tebibyte volumes, and gigabytes per second of throughput.

For more information about NetApp Volumes, see What is Google Cloud NetApp Volumes? For more information about NetApp Volumes performance, see Expected performance.

Block storage

After you select your primary storage choice, you can use block storage to supplement performance, transfer data between storage options, and take advantage of low latency operations. You have two storage options with block storage: Local SSD and Persistent Disk.

Local SSD

Local SSD provides local storage directly to a VM or a container. Most Google Cloud machine types that contain Cloud GPUs include some amount of Local SSD. Because Local SSD disks are attached physically to the Cloud GPUs, they provide low latency access with potentially millions of IOPS. In contrast, Cloud TPU-based instances don't include Local SSD.

Although Local SSD delivers high performance, each storage instance is ephemeral. Thus, the data stored on a Local SSD drive is lost when you stop or delete the instance. Because of the ephemeral nature of Local SSD, consider other types of storage when your data requires better durability.

However, when the amount of training data is very small, it's common to copy the training data from Cloud Storage to the Local SSD of a GPU. The reason is that Local SSD provides lower I/O latency and reduces training time.

For more information about Local SSD, see About Local SSDs. For more information about the amount of Local SSD capacity available with Cloud GPUs instance types, see GPU platforms.

Persistent Disk

Persistent Disk is a network block storage service with a comprehensive suite of data persistence and management capabilities. In addition to its use as a boot disk, you can use Persistent Disk with AI workloads, such as scratch storage. Persistent Disk is available in the following options:

  • Standard, which provides efficient and reliable block storage.
  • Balanced, which provides cost-effective and reliable block storage.
  • SSD, which provides fast and reliable block storage.
  • Extreme, which provides the highest performance block storage option with customizable IOPS.

For more information about Persistent Disk, see Persistent Disk.

Data transfer tools

When you perform AI and ML tasks, there are times when you need to copy your data from one location to another. For example, if your data starts in Cloud Storage, you might move it elsewhere to train the model, then copy the checkpoint snapshots or trained model back to Cloud Storage. You could also perform most of your tasks in Filestore, then move your data and model into Cloud Storage for archive purposes. This section discusses your options for moving data between storage services in Google Cloud.

Storage Transfer Service

With the Storage Transfer Service, you can transfer your data between Cloud Storage, Filestore, and NetApp Volumes. This fully-managed service also lets you copy data between your on-premises file storage and object storage repositories, your Google Cloud storage, and from other cloud providers. The Storage Transfer Service lets you copy your data securely from the source location to the target location, as well as perform periodic transfers of changed data. It also provides data integrity validation, automatic retries, and load balancing.

For more information about Storage Transfer Service, see What is Storage Transfer Service?

Command-line interface options

When you move data between Filestore and Cloud Storage, you can use the following tools:

  • gcloud storage (recommended): Create and manage Cloud Storage buckets and objects with optimal throughput and a full suite of gcloud CLI commands.
  • gsutil: Manage and maintain Cloud Storage components. Requires fine-tuning to achieve better throughput.

Map your storage choices to the AI and ML stages

This section expands upon the summary provided in Table 1 to explore the specific recommendations and guidance for each stage of an AI and ML workload. The goal is to help you understand the rationale for these choices and select the best storage options for each AI and ML stage. This analysis results in three primary recommendations that are explored in the section, Storage recommendations for AI and ML.

The following figure provides a decision tree that shows the recommended storage options for the four main stages of an AI and ML workload. The diagram is followed by a detailed explanation of each stage and the choices that you can make at each stage.

Figure 3: Storage choices for each AI and ML stage

A decision tree that shows the recommended storage options for the four main stages of an AI and ML workload.

Prepare

At this initial stage, you need to select whether you want to use Cloud Storage or Filestore as your persistent source of truth for your data. You can also select potential optimizations for data-intensive training. Know that different teams in your organization can have varying workload and dataset types that might result in those teams making different storage decisions. To accommodate these varied needs, you can mix and match your storage choices between Cloud Storage and Filestore accordingly.

Cloud Storage for the prepare stage

  • Your workload contains large files of 50 MB or more.
  • Your workload requires lower IOPS.
  • Your workload can tolerate higher storage latency in the tens of milliseconds.

  • You need to gain access to the dataset through Cloud Storage APIs, or Cloud Storage FUSE and a subset of file APIs.

To optimize your workload in Cloud Storage, you can select regional storage and place your bucket in the same region as your compute resources. However, if you need higher reliability, or if you use accelerators located in two different regions, you'll want to select dual-region storage.

Filestore for the prepare stage

You should select Filestore to prepare your data if any of the following conditions apply:

  • Your workload contains smaller file sizes of less than 50 MB.
  • Your workload requires higher IOPS.
  • Your workload needs lower latency of less than 1 millisecond to meet storage requirements for random I/O and metadata access.
  • Your users need a desktop-like experience with full POSIX support to view and manage the data.
  • Your users need to perform other tasks, such as software development.

Other considerations for the prepare stage

If you find it hard to choose an option at this stage, consider the following points to help you make your decision:

  • If you want to use other AI and ML frameworks, such as Dataflow, Spark, or BigQuery on the dataset, then Cloud Storage is a logical choice because of the custom integration it has with these types of frameworks.
  • Filestore has a maximum capacity of approximately 100 TB. If you need to train your model with datasets larger than this, or if you can't break the set into multiple 100 TB instances, then Cloud Storage is a better option.

During the data preparation phase, many users reorganize their data into large chunks to improve access efficiency and avoid random read requests. To further reduce the I/O performance requirements on the storage system, many users use pipelining, training optimization to increase the number of I/O threads, or both.

Train

At the train stage, you typically reuse the primary storage option that you selected for the prepare stage. If your primary storage choice can't handle the training workload alone, you might need to supplement the primary storage. You can add supplemental storage as needed, such as Local SSDs, to balance the workload.

In addition to providing recommendations for using either Cloud Storage or Filestore at this stage, this section also provides you with more details about these recommendations. The details include the following:

  • Guidance for file sizes and request sizes
  • Suggestions on when to supplement your primary storage choice
  • An explanation of the implementation details for the two key workloads at this stage—data loading, and checkpointing and restart

Cloud Storage for the train stage

The main reasons to select Cloud Storage when training your data include the following:

  • If you use Cloud Storage when you prepare your data, it's best to train your data in Cloud Storage.
  • Cloud Storage is a good choice for throughput, workloads that don't require high single-VM throughput, or workloads that use many threads to increase throughput as needed.

Cloud Storage with Local SSD or Filestore for the train stage

The main reason to select Cloud Storage with Local SSD or Filestore when training your data occurs when you need to support small I/O requests or small files. In this case, you can supplement your Cloud Storage training task by moving some of the data to Local SSD or Filestore Zonal.

Filestore for the train stage

The main reasons to select Filestore when training your data include the following:

  • If you use Filestore when you prepare your data, in most cases, you should continue to train your data in Filestore.
  • Filestore is a good choice for low latency, high per-client throughput, and applications that use a low number of threads but still require high performance.
  • If you need to supplement your training tasks in Filestore, consider creating a Local SSD cache as needed.

File sizes and request sizes

Once the dataset is ready for training, there are two main options that can help you evaluate the different storage options.

Data sets containing large files and accessed with large request sizes

In this option, the training job consists primarily of larger files of 50 MB or more. The training job ingests the files with 1 MB to 16 MB per request. We generally recommend Cloud Storage with Cloud Storage FUSE for this option because the files are large enough that Cloud Storage should be able to keep the accelerators supplied. Keep in mind that you might need hundreds to thousands of threads to achieve maximum performance with this option.

However, if you require full POSIX APIs for other applications, or your workload isn't appropriate for the high number of required threads, then Filestore is a good alternative.

Data sets containing small-to-medium sized files, or accessed with small request sizes

With this option, you can classify your training job in one of two ways:

  • Many small-to-medium sized files of less than 50 MB.
  • A dataset with larger files, but the data is read sequentially or randomly with relatively small read request sizes (for example, less than 1 MB). An example of this use case is when the system reads less than 100 KB at a time from a multi-gigabyte or multi-terabyte file.

If you already use Filestore for its POSIX capabilities, then we recommend keeping your data in Filestore for training. Filestore offers low I/O latency access to the data. This lower latency can reduce the overall training time and might lower the cost of training your model.

If you use Cloud Storage to store your data, then we recommend that you copy your data to Local SSD or Filestore prior to training.

Data loading

During data loading, Cloud GPUs or Cloud TPUs import batches of data repeatedly to train the model. This phase can be cache friendly, depending on the size of the batches and the order in which you request them. Your goal at this point is to train the model with maximum efficiency but at the lowest cost.

If the size of your training data scales to petabytes, the data might need to be re-read multiple times. Such a scale requires intensive processing by a GPU or TPU accelerator. However, you need to ensure that your Cloud GPUs and Cloud TPUs aren't idle, but process your data actively. Otherwise, you pay for an expensive, idle accelerator while you copy the data from one location to another.

For data loading, consider the following:

  • Parallelism: There are numerous ways to parallelize training, and each can have an impact on the overall storage performance required and