Profile-based configurations for AI/ML workloads

This document describes how you can use profile-based configurations to streamline adoption and enhance the performance of Cloud Storage FUSE for your artificial intelligence or machine learning (AI/ML) workloads.

To help you streamline Cloud Storage FUSE configuration for your serving, checkpointing, or training workloads, you can apply pre-configured profiles based on your workload type using the profile field or --profile option. Using the field or option, you can specify a predefined, optimized set of Cloud Storage FUSE features for caching, threading, and buffer sizes, ensuring high performance with minimal effort for training, checkpointing, and serving workloads, with profile values aiml-training, aiml-checkpointing, and aiml-serving respectively.

Considerations

  • You can only set the --profile option or profile field during a mount operation. If you need to update the --profile option or profile field, you need to remount your Cloud Storage FUSE bucket.

  • When you use profile-based configurations, Cloud Storage FUSE sets the metadata cache capacity and time to live (TTL) to unlimited, meaning that entries are never evicted from the metadata cache. If your virtual machine doesn't have enough memory, you might experience Out of Memory (OOM) errors. Therefore, we recommend reviewing your memory capacity before you apply profile-based configurations. OOM errors are more likely to occur on machines with less than one TiB of memory.

  • When specifying configuration values using profiles, detected high-performance machine types, a gcsfuse command, or a Cloud Storage FUSE configuration file, the methods take precedence in the following order (where the top methods supersede the methods below it):

    1. Values set as part of a gcsfuse command or a Cloud Storage FUSE configuration file.

    2. Values set as the argument to the --profiles option in a gcsfuse command or the profile field in a Cloud Storage FUSE configuration file.

    3. Automated configuration values set when Cloud Storage FUSE detects that a high-performance machine type is being used. For more information, see Automated configuration values.

  • Cloud Storage FUSE CSI volumes in Google Kubernetes Engine Pods don't support the profile field or --profile option.

  • File caching cannot be enabled using profile-based configurations because file caching requires the use of Cloud Storage FUSE configuration fields and Cloud Storage FUSE CLI options that can't be generalized. To enable file caching for serving, training, or checkpointing workloads, you must configure file caching options or fields explicitly.

Apply profile-based configurations for training workloads

The training-specific profile optimizes performance for high throughput reads of large datasets and prevents Cloud GPUs and Cloud TPU hardware from waiting for data.

To apply the training-specific profile, specify either profile=aiml-training using a Cloud Storage FUSE configuration file or --profile=aiml-training using the the Cloud Storage FUSE CLI. The following configurations are then applied:

   # Create implicit directories locally when accessed:
   - implicit-dirs
   # Disable caching for lookups of files or directories that don't exist:
   - metadata-cache:negative-ttl-secs:0
   # Keep cached metadata (file attributes, types) indefinitely time-wise:
   - metadata-cache:ttl-secs:-1
   # Allow unlimited size for the file attribute (stat) cache:
   - metadata-cache:stat-cache-max-size-mb:-1
   # Allow unlimited size for the file/directory type cache:
   - metadata-cache:type-cache-max-size-mb:-1

Apply profile-based configurations for checkpointing workloads

The checkpointing-specific profile optimizes performance for high throughput writes for large files by drastically reducing the time it takes to save multi-gigabyte checkpoints, minimizing training pauses.

To apply the training-specific profile, specify either profile=aiml-checkpointing using a Cloud Storage FUSE configuration file or --profile=aiml-checkpointing using the the Cloud Storage FUSE CLI. The following configurations are then applied:

  # Create implicit directories locally when accessed:
  - implicit-dirs
  # Disable caching for lookups of files/dirs that don't exist:
  - metadata-cache:negative-ttl-secs:0
  # Keep cached metadata (file attributes, types) indefinitely time-wise:
  - metadata-cache:ttl-secs:-1
  # Allow unlimited size for the file attribute (stat) cache:
  - metadata-cache:stat-cache-max-size-mb:-1
  # Allow unlimited size for the file/directory type cache:
  - metadata-cache:type-cache-max-size-mb:-1
  # Cache the entire file when any part is read sequentially:
  - file-cache:cache-file-for-range-read:true
  # Allow renaming directories with a lot of files in non-HNS buckets.
  - file-system:rename-dir-limit:200000

Apply profile-based configurations for serving workloads

Serving optimizes performance for serving workloads by improving data access and caching mechanisms.

To apply the training-specific profile, specify either profile=aiml-serving using a Cloud Storage FUSE configuration file or --profile=aiml-serving using the the Cloud Storage FUSE CLI. The following configurations are then applied:

  # Create implicit directories locally when accessed:
  - implicit-dirs
  # Disable caching for lookups of files/dirs that don't exist:
  - metadata-cache:negative-ttl-secs:0
  # Keep cached metadata (file attributes, types) indefinitely time-wise:
  - metadata-cache:ttl-secs:-1
  # Allow unlimited size for the file attribute (stat) cache:
  - metadata-cache:stat-cache-max-size-mb:-1
  # Allow unlimited size for the file/directory type cache:
  - metadata-cache:type-cache-max-size-mb:-1
  # Cache the entire file when any part is read sequentially:
  - file-cache:cache-file-for-range-read:true
  # Enable kernel-list-cache to make listing faster as this is a readonly file system hierarchy.
  - file-system:kernel-list-cache-ttl-secs:-1

What's next