Vertex AI custom training
Vertex AI custom training provides the following strengths and capabilities:
- Custom training: Lets you run your own training code in a managed environment, using either prebuilt or custom containers. You have control over the training process, libraries, and infrastructure.
- Hyperparameter tuning: Helps optimize model performance by automatically searching for the best hyperparameter configurations.
- Training pipelines: Lets you define and run complex ML workflows with multiple steps.
- Distributed training: Supports training machine learning models across multiple machines (nodes) in a managed training cluster to accelerate the training process for large datasets and complex models.
- Lifecycle: Provides a managed platform for the entire model training lifecycle, from data preparation to model evaluation.
- Scalability: Offers scalable compute resources, including CPUs, GPUs, and TPUs. You can configure the number and type of machines for your training jobs.
- Integration: Integrated with other Vertex AI services like Datasets,Vertex AI Experiments, Vertex AI Experiments, and Vertex AI Experiments.
- Flexibility: Vertex AI custom training provides high flexibility in terms of code and environment.
Ray on Vertex AI
Ray on Vertex AI leverages the open-source Ray framework for scaling AI and Python applications within Vertex AI. Ray on Vertex AI, itself, provides infrastructure for distributed computing and parallel processing. It can be used for scaling training workloads, running distributed applications, and even serving models.
Ray on Vertex AI includes the following:
- Focus: Provides a managed environment for running distributed applications using the Ray framework. It simplifies the management of Ray clusters on Google Cloud.
- Scalability: Designed for high scalability using Ray's distributed computing capabilities. You can create Ray clusters with up to 2,000 nodes. Ray on Vertex AI supports both manual and autoscaling of Ray clusters based on resource needs.
- Ease of use: If you are already familiar with Ray, you can often use your existing Ray code with minimal changes. Vertex AI manages the underlying infrastructure for the Ray cluster. However, it might require a deeper understanding of Ray concepts compared to basic custom training jobs.
- Integration: Integrates with other Google Cloud services like Vertex AI Prediction and BigQuery. You can read and write data with BigQuery from your Ray cluster.
- Flexibility: Offers high flexibility for building distributed applications beyond just model training. Ray on Vertex AI supports various frameworks like TensorFlow, PyTorch, and scikit-learn. You can also run Spark on Ray using RayDP.
Key differences between Vertex AI custom training and Ray on Vertex AI
Vertex AI custom training is a broader service managing various training methods, while Ray on Vertex AI specifically uses the Ray distributed computing framework.
While Ray on Vertex AI can be used for scaling training, it's also designed for more general-purpose distributed Python applications, including data processing and model serving. Vertex AI training is primarily focused on model development and training.
Vertex AI training offers different levels of abstraction (AutoML being the highest, and custom training the lowest). Ray on Vertex AI provides a specific framework that requires understanding Ray's concepts.
With general Vertex AI custom training, you configure resources for individual training jobs. With Ray on Vertex AI, you manage Ray clusters, and Ray handles the distribution of tasks within the cluster.
Compare the distributed training options
The following compares Vertex AI and Ray on Vertex AI with respect to distributed training.
Framework-centric versus Ray-centric
- Vertex AI distributed training is typically tied to the distributed capabilities of a specific ML framework (for example, TensorFlow or PyTorch). You configure Vertex AI to provide the necessary infrastructure for that framework to manage the distribution.
- Ray on Vertex AI uses Ray as the central distributed computing framework. You structure your application using Ray's primitives, and Ray handles the distribution of work across the cluster, regardless of the underlying ML framework used within the Ray tasks.
Configuration of Distribution
- In Vertex AI distributed training, you configure the number and types of replicas for a specific training job.
- With Ray on Vertex AI, you configure the size and composition of the Ray cluster, and then Ray's scheduler dynamically distributes tasks and actors across the available nodes.
Scope of Distribution
- Vertex AI distributed training is generally focused on a single, potentially long-running training job.
- Ray on Vertex AI provides a more persistent and general-purpose distributed computing environment where you can run multiple distributed tasks and applications over the lifecycle of the Ray cluster.
Summary
If you need to use the power of distributed computing with the Ray framework within the Google Cloud environment, Ray on Vertex AI is the service to use. Ray on Vertex AI can be considered a specific tool within the larger Vertex AI ecosystem, particularly useful for highly scalable and distributed workloads.
If you need a more general-purpose managed platform for various model training approaches, including automated options, custom code execution, and hyperparameter tuning, the broader Vertex AI custom training services are useful.