Compare Vertex AI custom training and Ray on Vertex AI
Vertex AI offers two options for custom training, Vertex AI custom training and Ray on Vertex AI. This page provides context for helping choose between these two options.
Vertex AI Training | Ray on Vertex AI | |
---|---|---|
Focus | General-purpose custom model training. | Scaling AI and Python applications, including model training, distributed applications, and model serving. |
Underlying framework | Supports various ML frameworks: for example, TensorFlow, PyTorch, scikit-learn. | Leverages the open-source Ray framework. Supports various frameworks: TensorFlow, PyTorch, scikit-learn, and Spark on Ray using RayDP). |
Flexibility | High flexibility in terms of code and environment. | High flexibility for building distributed applications; can use existing Ray code with minimal changes. |
Scalability | Supports distributed training across multiple machines. Offers scalable compute resources (CPUs, GPUs, TPUs). | Designed for high scalability using Ray's distributed computing capabilities (up to 2,000 nodes). Supports manual and autoscaling. |
Integration | Integrated with other Vertex AI services (Datasets, Vertex AI Experiments, and more). | Integrates with other Google Cloud services like Vertex AI Prediction and BigQuery. |
Ease of use | Easier to use for standard distributed training paradigms. | Requires familiarity with Ray framework concepts. |
Environment | Managed environment for running custom training code using prebuilt or custom containers. | Managed environment for running distributed applications using the Ray framework; simplifies management of the Ray cluster on Vertex AI. |
Hyperparameter tuning | Includes hyperparameter tuning capabilities. | Simplifies hyperparameter tuning with tools for efficient optimization and experiment management. |
Training pipelines | Supports complex ML workflows with multiple steps. | Not applicable. |
Key differences between Vertex AI custom training and Ray on Vertex AI
Vertex AI custom training is a broader service managing various training methods, while Ray on Vertex AI specifically uses the Ray distributed computing framework.
Vertex AI Training | Ray on Vertex AI | |
---|---|---|
Focus | Primarily focused on model development and training. Manages various training methods. | Designed for general-purpose distributed Python applications, including data processing, model serving, and scaling training. |
Underlying framework | Tied to the distributed capabilities of specific ML frameworks (for example, TensorFlow, PyTorch). | Uses Ray as the central distributed computing framework. Handles task distribution regardless of the underlying ML framework used within Ray tasks. |
Resource configuration | Configure resources for individual training jobs. | Manage Ray clusters on Vertex AI clusters; Ray handles the distribution of tasks within the cluster. |
Configuration of distribution | Configure the number and types of replicas for a specific training job. | Configure the size and composition of the Ray cluster on Vertex AI; Ray's scheduler dynamically distributes tasks and actors across available nodes. |
Scope of distribution | Generally focused on a single, potentially long-running training job. | Provides a more persistent and general-purpose distributed computing environment where you can run multiple distributed tasks and applications over the lifecycle of the Ray cluster. |
Summary
If you need to use the power of distributed computing with the Ray framework within the Google Cloud environment, Ray on Vertex AI is the service to use. Ray on Vertex AI can be considered a specific tool within the larger Vertex AI ecosystem, particularly useful for highly scalable and distributed workloads.
If you need a more general-purpose managed platform for various model training approaches, including automated options, custom code execution, and hyperparameter tuning, the broader Vertex AI custom training services are useful.