Profile-based configurations for AI/ML workloads

This document describes how you can use profile-based configurations to streamline adoption and enhance the performance of Cloud Storage FUSE for your artificial intelligence or machine learning (AI/ML) workloads.

To help you streamline Cloud Storage FUSE configuration for your serving, checkpointing, or training workloads, you can apply pre-configured profiles based on your workload type using the profile field or --profile option. Using the field or option, you can specify a predefined, optimized set of Cloud Storage FUSE features for caching, threading, and buffer sizes, ensuring high performance with minimal effort for training, checkpointing, and serving workloads, with profile values aiml-training, aiml-checkpointing, and aiml-serving respectively.

Considerations

  • You can only set the --profile option or profile field during a mount operation. If you need to update the --profile option or profile field, you need to remount your Cloud Storage FUSE bucket.

  • When you use profile-based configurations, Cloud Storage FUSE sets the metadata cache capacity and time to live (TTL) to unlimited, meaning that entries are never evicted from the metadata cache. If your virtual machine doesn't have enough memory, you might experience Out of Memory (OOM) errors. Therefore, we recommend reviewing your memory capacity before you apply profile-based configurations. OOM errors are more likely to occur on machines with less than one TiB of memory.

  • When a Cloud Storage FUSE parameter is configured in multiple ways, the following order of precedence applies (from highest to lowest):

    1. Values set directly in a gcsfuse command or a Cloud Storage FUSE configuration file.
    2. Values set by a profile, where the profile is specified using the --profile option in a gcsfuse command or the profile field in a Cloud Storage FUSE configuration file.
    3. Default values automatically applied when Cloud Storage FUSE detects a high-performance machine type. For more information, see Automated configuration values for high-performance machine types.
  • Cloud Storage FUSE CSI volumes in Google Kubernetes Engine Pods don't support the profile field or --profile option.

  • File caching cannot be enabled using profile-based configurations because file caching requires the use of Cloud Storage FUSE configuration fields and Cloud Storage FUSE CLI options that can't be generalized. To enable file caching for serving, training, or checkpointing workloads, you must configure file caching options or fields explicitly.

Apply profile-based configurations for training workloads

The training-specific profile optimizes performance for high throughput reads of large datasets and prevents Cloud GPUs and Cloud TPU hardware from waiting for data.

To apply the training-specific profile, specify either profile: aiml-training using a Cloud Storage FUSE configuration file or --profile=aiml-training using the Cloud Storage FUSE CLI. The following configurations are then applied:

   # Create implicit directories locally when accessed:
   - implicit-dirs
   # Disable caching for lookups of files or directories that don't exist:
   - metadata-cache:negative-ttl-secs:0
   # Keep cached metadata (file attributes, types) indefinitely time-wise:
   - metadata-cache:ttl-secs:-1
   # Allow unlimited size for the file attribute (stat) cache:
   - metadata-cache:stat-cache-max-size-mb:-1
   # Allow unlimited size for the file/directory type cache:
   - metadata-cache:type-cache-max-size-mb:-1

Apply profile-based configurations for checkpointing workloads

The checkpointing-specific profile optimizes performance for high throughput writes for large files by drastically reducing the time it takes to save multi-gigabyte checkpoints, minimizing training pauses.

To apply the checkpointing-specific profile, specify either profile: aiml-checkpointing using a Cloud Storage FUSE configuration file or --profile=aiml-checkpointing using the Cloud Storage FUSE CLI. The following configurations are then applied:

  # Create implicit directories locally when accessed:
  - implicit-dirs
  # Disable caching for lookups of files/dirs that don't exist:
  - metadata-cache:negative-ttl-secs:0
  # Keep cached metadata (file attributes, types) indefinitely time-wise:
  - metadata-cache:ttl-secs:-1
  # Allow unlimited size for the file attribute (stat) cache:
  - metadata-cache:stat-cache-max-size-mb:-1
  # Allow unlimited size for the file/directory type cache:
  - metadata-cache:type-cache-max-size-mb:-1
  # Cache the entire file when any part is read sequentially:
  - file-cache:cache-file-for-range-read:true
  # Allow renaming directories with a lot of files in non-HNS buckets.
  - file-system:rename-dir-limit:200000

Apply profile-based configurations for serving workloads

Serving optimizes performance for serving workloads by improving data access and caching mechanisms.

To apply the serving-specific profile, specify either profile: aiml-serving using a Cloud Storage FUSE configuration file or --profile=aiml-serving using the Cloud Storage FUSE CLI. The following configurations are then applied:

  # Create implicit directories locally when accessed:
  - implicit-dirs
  # Disable caching for lookups of files/dirs that don't exist:
  - metadata-cache:negative-ttl-secs:0
  # Keep cached metadata (file attributes, types) indefinitely time-wise:
  - metadata-cache:ttl-secs:-1
  # Allow unlimited size for the file attribute (stat) cache:
  - metadata-cache:stat-cache-max-size-mb:-1
  # Allow unlimited size for the file/directory type cache:
  - metadata-cache:type-cache-max-size-mb:-1
  # Cache the entire file when any part is read sequentially:
  - file-cache:cache-file-for-range-read:true
  # Enable kernel-list-cache to make listing faster as this is a readonly file system hierarchy.
  - file-system:kernel-list-cache-ttl-secs:-1

What's next