For custom training jobs that request GPU resources, Dynamic Workload Scheduler lets you schedule the jobs based on when the requested GPU resources become available. This page shows you how to schedule custom training jobs by using Dynamic Workload Scheduler, and how to customize the scheduling behavior on Vertex AI.
Recommended use cases
We recommend using Dynamic Workload Scheduler to schedule custom training jobs in the following situations:
- The custom training job requests A100 or H100 GPUs and you want to run the job as soon as the requested resources become available. For example, when Vertex AI allocates the GPU resources outside of peak hours.
- Your workload requires multiple nodes and can't start running until all GPU nodes are provisioned and ready at the same time. For example, you're creating a distributed training job.
Requirements
To use Dynamic Workload Scheduler, your custom training job must meet the following requirements:
- Your custom training job requests A100 or H100 GPUs.
- Your custom training job has a maximum
timeout
of 7 days or less. - Your custom training job uses the same machine configuration for all worker pools.
Supported job types
All custom training job types are supported, including CustomJob
,
HyperparameterTuningjob
, and TrainingPipeline
.
Enable Dynamic Workload Scheduler in your custom training job
To enable Dynamic Workload Scheduler in your custom training job, set the
scheduling.strategy
API field to FLEX_START
when you create the job.
For details on how to create a custom training job, see the following links.
Configure the duration to wait for resource availability
You can configure how long your job can wait for resources in the
scheduling.maxWaitDuration
field. A value of 0
means that the job waits
indefinitely until the requested resources become available. The default value
is 1 day.
Examples
The following examples show you how to enable Dynamic Workload Scheduler for a customJob
.
Select the tab for the interface that you want to use.
gcloud
When submitting a job using the Google Cloud CLI, add the scheduling.strategy
field in the
config.yaml
file.
Example YAML configuration file:
workerPoolSpecs:
machineSpec:
machineType: n1-highmem-2
replicaCount: 1
containerSpec:
imageUri: gcr.io/ucaip-test/ucaip-training-test
args:
- port=8500
command:
- start
scheduling:
strategy: FLEX_START
maxWaitDuration: 1800s
python
When submitting a job using the Vertex AI SDK for Python, set the
scheduling_strategy
field in the relevant CustomJob
creation method.
from google.cloud.aiplatform_v1.types import custom_job as gca_custom_job_compat
def create_custom_job_with_dws_sample(
project: str,
location: str,
staging_bucket: str,
display_name: str,
script_path: str,
container_uri: str,
service_account: str,
experiment: str,
experiment_run: Optional[str] = None,
) -> None:
aiplatform.init(project=project, location=location, staging_bucket=staging_bucket, experiment=experiment)
job = aiplatform.CustomJob.from_local_script(
display_name=display_name,
script_path=script_path,
container_uri=container_uri,
enable_autolog=True,
)
job.run(
service_account=service_account,
experiment=experiment,
experiment_run=experiment_run,
max_wait_duration=1800,
scheduling_strategy=gca_custom_job_compat.Scheduling.Strategy.FLEX_START
)
REST
When submitting a job using the Vertex AI REST API, set the fields
scheduling.strategy
and scheduling.maxWaitDuration
when creating your
custom training job.
Example request JSON body:
{
"displayName": "MyDwsJob",
"jobSpec": {
"workerPoolSpecs": [
{
"machineSpec": {
"machineType": "a2-highgpu-1g",
"acceleratorType": "NVIDIA_TESLA_A100",
"acceleratorCount": 1
},
"replicaCount": 1,
"diskSpec": {
"bootDiskType": "pd-ssd",
"bootDiskSizeGb": 100
},
"containerSpec": {
"imageUri": "python:3.10",
"command": [
"sleep"
],
"args": [
"100"
]
}
}
],
"scheduling": {
"maxWaitDuration": "1800s",
"strategy": "FLEX_START"
}
}
}
Quota
When you submit a job using Dynamic Workload Scheduler, instead of consuming on-demand Vertex AI quota, Vertex AI consumes preemptible quota. For example, for Nvidia H100 GPUs, instead of consuming:
aiplatform.googleapis.com/custom_model_training_nvidia_h100_gpus
,
Vertex AI consumes:
aiplatform.googleapis.com/custom_model_training_preemptible_nvidia_h100_gpus
.
However, preemptible quota is used only in name. Your resources aren't preemptible and behave like standard resources.
Before submitting a job using Dynamic Workload Scheduler, ensure that your preemptible quotas have been increased to a sufficient amount. For details on Vertex AI quotas and instructions for making quota increase requests, see Vertex AI quotas and limits.
Billing
You're charged only for the duration that the job is running and not for the time that the job is waiting for resources to become available. For details, see Pricing.
What's Next
- Learn more about configuring compute resources for custom training jobs.
- Learn more about using distributed training for custom training jobs.
- Learn more about other scheduling options.