Storage services

This document describes use cases and recommendations for storage services in artificial intelligence (AI) and machine learning (ML) workloads.

Storage use cases

Storage services might be used in the following AI and ML workloads:

Preparing and loading data for training
Loading model weights for inference
Saving and restoring model checkpoints
Loading VM images
Logging data
Home directories
Loading application libraries, packages, and dependencies

Storage recommendations

The following storage solutions are recommended for optimizing AI and ML system performance:

Storage service	Features	Use cases
Cloud Storage	Overview: A highly scalable, highly durable, low cost object store. It's suitable for storing vast datasets required for training and model checkpoints, as well as hosting the final trained models. Cloud Storage with Cloud Storage FUSE is the recommended storage solution for most AI and ML use cases because it lets you scale your data storage with more cost efficiency than file system services. Supports large scale (up to EBs) training data for GPU and TPU clusters. Supports high-throughput (up to 1.25TB/s bandwidth or greater). To maximize your throughput in Cloud Storage, request more bandwidth. Through integration with Cloud Storage FUSE, Cloud Storage buckets can be mounted as local file systems. The Cloud Storage FUSE CSI driver also lets you mount buckets as local file systems in Google Kubernetes Engine (GKE) for scaled AI and ML workloads. Use Anywhere Cache to colocate storage in the same zone as compute workloads, providing higher throughput (up to 2.5TB/s), lower latency, and location flexibility when used with a multi-region bucket. For more information about using Cloud Storage FUSE for AI and ML workload, see Optimize AI and ML workloads with Cloud Storage FUSE.	Recommended for: Cost efficiency Data processing and preparation Model training and inference Saving and restoring model checkpoints Not recommended for: Applications that require full POSIX compliance Home directories
Google Cloud Managed Lustre	Overview: A high performance, fully managed parallel file system optimized for AI and high performance computing (HPC) applications. Suited for environments in which multiple compute nodes need fast and consistent access to shared data for simulations, modeling, and analysis. Scales to 8 PB capacity and up to 1 TB/s of throughput. Supports thousands of IOPS/TiB. Delivers ultra low sub-ms latency. Has full POSIX support which enables out of the box migration of on-premises AI workloads to Google Cloud. For more information about using Managed Lustre for AI and ML workload, see Optimize AI and ML workloads with Google Cloud Managed Lustre.	Recommended for: Migrating AI and ML workloads to the cloud Model simulations Model training and inference Saving and restoring model checkpoints Workloads with frequent small reads and writes Home directories Not recommended for: Workloads needing more than 8 PB of data

Storage service

Features

Use cases

Cloud Storage

Overview: A highly scalable, highly durable, low cost object store. It's suitable for storing vast datasets required for training and model checkpoints, as well as hosting the final trained models. Cloud Storage with Cloud Storage FUSE is the recommended storage solution for most AI and ML use cases because it lets you scale your data storage with more cost efficiency than file system services.

Supports large scale (up to EBs) training data for GPU and TPU clusters.
Supports high-throughput (up to 1.25TB/s bandwidth or greater). To maximize your throughput in Cloud Storage, request more bandwidth.
Through integration with Cloud Storage FUSE, Cloud Storage buckets can be mounted as local file systems. The Cloud Storage FUSE CSI driver also lets you mount buckets as local file systems in Google Kubernetes Engine (GKE) for scaled AI and ML workloads.
Use Anywhere Cache to colocate storage in the same zone as compute workloads, providing higher throughput (up to 2.5TB/s), lower latency, and location flexibility when used with a multi-region bucket.
For more information about using Cloud Storage FUSE for AI and ML workload, see Optimize AI and ML workloads with Cloud Storage FUSE.

Recommended for:

Cost efficiency
Data processing and preparation
Model training and inference
Saving and restoring model checkpoints

Not recommended for:

Applications that require full POSIX compliance
Home directories

Google Cloud Managed Lustre

Overview: A high performance, fully managed parallel file system optimized for AI and high performance computing (HPC) applications. Suited for environments in which multiple compute nodes need fast and consistent access to shared data for simulations, modeling, and analysis.

Scales to 8 PB capacity and up to 1 TB/s of throughput.
Supports thousands of IOPS/TiB.
Delivers ultra low sub-ms latency.
Has full POSIX support which enables out of the box migration of on-premises AI workloads to Google Cloud.
For more information about using Managed Lustre for AI and ML workload, see Optimize AI and ML workloads with Google Cloud Managed Lustre.

Recommended for:

Migrating AI and ML workloads to the cloud
Model simulations
Model training and inference
Saving and restoring model checkpoints
Workloads with frequent small reads and writes
Home directories

Not recommended for:

Workloads needing more than 8 PB of data

What's next

For more detailed information about the storage options for AI and ML workloads including training, checkpointing, and serving, see Design storage for AI and ML workloads in Google Cloud.