Schedule training jobs based on resource availability

For custom training jobs that request GPU resources, Dynamic Workload Scheduler lets you schedule the jobs based on when the requested GPU resources become available. This page shows you how to schedule custom training jobs by using Dynamic Workload Scheduler, and how to customize the scheduling behavior on Vertex AI.

We recommend using Dynamic Workload Scheduler to schedule custom training jobs in the following situations:

  • The custom training job requests A100 or H100 GPUs and you want to run the job as soon as the requested resources become available. For example, when Vertex AI allocates the GPU resources outside of peak hours.
  • Your workload requires multiple nodes and can't start running until all GPU nodes are provisioned and ready at the same time. For example, you're creating a distributed training job.

Requirements

To use Dynamic Workload Scheduler, your custom training job must meet the following requirements:

  • Your custom training job has a maximum timeout of 7 days or less.
  • Your custom training job uses the same machine configuration for all worker pools.

Supported job types

All custom training job types are supported, including CustomJob, HyperparameterTuningjob, and TrainingPipeline.

Enable Dynamic Workload Scheduler in your custom training job

To enable Dynamic Workload Scheduler in your custom training job, set the scheduling.strategy API field to FLEX_START when you create the job.

For details on how to create a custom training job, see the following links.

Configure the duration to wait for resource availability

You can configure how long your job can wait for resources in the scheduling.maxWaitDuration field. A value of 0 means that the job waits indefinitely until the requested resources become available. The default value is 1 day.

Examples

The following examples show you how to enable Dynamic Workload Scheduler for a customJob. Select the tab for the interface that you want to use.

gcloud

When submitting a job using the Google Cloud CLI, add the scheduling.strategy field in the config.yaml file.

Example YAML configuration file:

workerPoolSpecs:
  machineSpec:
    machineType: n1-highmem-2
  replicaCount: 1
  containerSpec:
    imageUri: gcr.io/ucaip-test/ucaip-training-test
    args:
    - port=8500
    command:
    - start
scheduling:
  strategy: FLEX_START
  maxWaitDuration: 1800s

python

When submitting a job using the Vertex AI SDK for Python, set the scheduling_strategy field in the relevant CustomJob creation method.

from google.cloud.aiplatform_v1.types import custom_job as gca_custom_job_compat

def create_custom_job_with_dws_sample(
    project: str,
    location: str,
    staging_bucket: str,
    display_name: str,
    script_path: str,
    container_uri: str,
    service_account: str,
    experiment: str,
    experiment_run: Optional[str] = None,
) -> None:
    aiplatform.init(project=project, location=location, staging_bucket=staging_bucket, experiment=experiment)

    job = aiplatform.CustomJob.from_local_script(
        display_name=display_name,
        script_path=script_path,
        container_uri=container_uri,
        enable_autolog=True,
    )

    job.run(
        service_account=service_account,
        experiment=experiment,
        experiment_run=experiment_run,
        max_wait_duration=1800,
        scheduling_strategy=gca_custom_job_compat.Scheduling.Strategy.FLEX_START
    )

REST

When submitting a job using the Vertex AI REST API, set the fields scheduling.strategy and scheduling.maxWaitDuration when creating your custom training job.

Example request JSON body:

{
  "displayName": "MyDwsJob",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "machineSpec": {
          "machineType": "a2-highgpu-1g",
          "acceleratorType": "NVIDIA_TESLA_A100",
          "acceleratorCount": 1
        },
        "replicaCount": 1,
        "diskSpec": {
          "bootDiskType": "pd-ssd",
          "bootDiskSizeGb": 100
        },
        "containerSpec": {
          "imageUri": "python:3.10",
          "command": [
            "sleep"
          ],
          "args": [
            "100"
          ]
        }
      }
    ],
    "scheduling": {
      "maxWaitDuration": "1800s",
      "strategy": "FLEX_START"
    }
  }
}

Quota

When you submit a job using Dynamic Workload Scheduler, instead of consuming on-demand Vertex AI quota, Vertex AI consumes preemptible quota. For example, for Nvidia H100 GPUs, instead of consuming:

aiplatform.googleapis.com/custom_model_training_nvidia_h100_gpus,

Vertex AI consumes:

aiplatform.googleapis.com/custom_model_training_preemptible_nvidia_h100_gpus.

However, preemptible quota is used only in name. Your resources aren't preemptible and behave like standard resources.

Before submitting a job using Dynamic Workload Scheduler, ensure that your preemptible quotas have been increased to a sufficient amount. For details on Vertex AI quotas and instructions for making quota increase requests, see Vertex AI quotas and limits.

Billing

You're charged only for the duration that the job is running and not for the time that the job is waiting for resources to become available. For details, see Pricing.

What's Next