Storage services

This document describes use cases and recommendations for storage services in artificial intelligence (AI) and machine learning (ML) workloads.

Storage use cases

Storage services might be used in the following AI and ML workloads:

  • Preparing and loading data for training
  • Loading model weights for inference
  • Saving and restoring model checkpoints
  • Loading VM images
  • Logging data
  • Home directories
  • Loading application libraries, packages, and dependencies

Storage recommendations

The following storage solutions are recommended for optimizing AI and ML system performance:

Storage service Features Use cases
Cloud Storage

Overview: A highly scalable, highly durable, low cost object store. It's suitable for storing vast datasets required for training and model checkpoints, as well as hosting the final trained models. Cloud Storage with Cloud Storage FUSE is the recommended storage solution for most AI and ML use cases because it lets you scale your data storage with more cost efficiency than file system services.

  • Supports large scale (up to EBs) training data for GPU and TPU clusters.
  • Supports high-throughput (up to 1.25TB/s bandwidth or greater). To maximize your throughput in Cloud Storage, request more bandwidth.
  • Through integration with Cloud Storage FUSE, Cloud Storage buckets can be mounted as local file systems. The Cloud Storage FUSE CSI driver also lets you mount buckets as local file systems in Google Kubernetes Engine (GKE) for scaled AI and ML workloads.
  • Use Anywhere Cache to co-locate storage in the same zone as compute workloads, providing higher throughput (up to 2.5TB/s), lower latency, and location flexibility when used with a multi-region bucket.

Recommended for:

  • Cost efficiency
  • Data processing and preparation
  • Model training and inference
  • Saving and restoring model checkpoints

Not recommended for:

  • Applications that require full POSIX compliance
  • Home directories
Google Cloud Managed Lustre

Overview: A high performance, fully managed parallel file system optimized for AI and high performance computing (HPC) applications. Suited for environments in which multiple compute nodes need fast and consistent access to shared data for simulations, modeling, and analysis.

  • Scales to 1 PB (936 TiB) capacity and up to 1 TB/s of throughput.
  • Supports thousands of IOPS/TiB.
  • Delivers ultra low sub-ms latency.
  • Has full POSIX support which enables out of the box migration of on-premises AI workloads to Google Cloud.

Recommended for:

  • Migrating AI and ML workloads to the cloud
  • Model simulations
  • Model training and inference
  • Saving and restoring model checkpoints
  • Workloads with frequent small reads and writes
  • Home directories

Not recommended for:

  • Workloads needing more than 1 PB of data
Filestore (Zonal tier)

Overview: An easy-to-use managed network-attached storage (NAS) that's available over a Network File System (NFS) mount. Optimal for a wide range of enterprise workloads where extreme parallel performance isn't the primary driver. Filestore is not recommended for large-scale HPC or AI and ML workloads.

  • Scales to dozens of clients and 100TB of capacity.
  • Has full POSIX support which enables out of the box migration of on-premises AI workloads to Google Cloud.

Recommended for:

  • Home directories with small-scale development and test environments

Not recommended for:

  • Home directories in large-scale compute clusters
  • Production-scale HPC or AI and ML workloads
  • Data intensive workloads