Menjadwalkan tugas pelatihan berdasarkan ketersediaan resource
Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Untuk tugas pelatihan kustom yang meminta resource GPU, Dynamic Workload Scheduler memungkinkan Anda menjadwalkan tugas berdasarkan waktu ketersediaan resource GPU yang diminta.
Halaman ini menunjukkan cara menjadwalkan tugas pelatihan kustom menggunakan Dynamic Workload Scheduler,
dan cara menyesuaikan perilaku penjadwalan di Vertex AI.
Kasus penggunaan yang direkomendasikan
Sebaiknya gunakan Dynamic Workload Scheduler untuk menjadwalkan tugas pelatihan kustom dalam situasi berikut:
Tugas pelatihan kustom meminta GPU L4, A100, H100, H200, atau B200 dan Anda ingin menjalankan
tugas segera setelah resource yang diminta tersedia. Misalnya, saat Vertex AI mengalokasikan resource GPU di luar jam sibuk.
Workload Anda memerlukan beberapa node dan tidak dapat mulai berjalan hingga semua node GPU disediakan dan siap secara bersamaan. Misalnya, Anda
membuat tugas pelatihan terdistribusi.
Persyaratan
Untuk menggunakan Dynamic Workload Scheduler, tugas pelatihan kustom Anda harus memenuhi persyaratan berikut:
Tugas pelatihan kustom Anda meminta GPU L4, A100, H100, H200, atau B200.
Tugas pelatihan kustom Anda memiliki durasi maksimum timeout 7 hari atau kurang.
Tugas pelatihan kustom Anda menggunakan konfigurasi mesin yang sama untuk semua kumpulan worker.
Jenis pekerjaan yang didukung
Semua jenis tugas pelatihan kustom didukung, termasuk CustomJob,
HyperparameterTuningjob, dan TrainingPipeline.
Mengaktifkan Dynamic Workload Scheduler di tugas pelatihan kustom Anda
Untuk mengaktifkan Dynamic Workload Scheduler di tugas pelatihan kustom, tetapkan
kolom API scheduling.strategy ke FLEX_START saat Anda membuat tugas.
Untuk mengetahui detail tentang cara membuat tugas pelatihan kustom, lihat link berikut.
Mengonfigurasi durasi untuk menunggu ketersediaan resource
Anda dapat mengonfigurasi berapa lama tugas dapat menunggu resource di kolom scheduling.maxWaitDuration. Nilai 0 berarti tugas menunggu
tanpa batas waktu hingga resource yang diminta tersedia. Nilai defaultnya adalah 1 hari.
Contoh
Contoh berikut menunjukkan cara mengaktifkan Dynamic Workload Scheduler untuk customJob.
Pilih tab untuk antarmuka yang ingin Anda gunakan.
gcloud
Saat mengirimkan tugas menggunakan Google Cloud CLI, tambahkan kolom scheduling.strategy
di file
config.yaml.
Saat mengirimkan tugas menggunakan Vertex AI REST API, tetapkan kolom
scheduling.strategy dan scheduling.maxWaitDuration saat membuat
tugas pelatihan kustom.
Saat Anda mengirimkan tugas menggunakan Dynamic Workload Scheduler, alih-alih menggunakan kuota Vertex AI sesuai permintaan, Vertex AI akan menggunakan kuota dapat di-preempt. Misalnya, untuk GPU Nvidia H100, bukan mengonsumsi:
Namun, kuota dapat di-preempt hanya digunakan dalam nama. Resource Anda tidak dapat dihentikan dan berperilaku seperti resource standar.
Sebelum mengirimkan tugas menggunakan Dynamic Workload Scheduler, pastikan kuota yang dapat di-preempt
telah ditingkatkan ke jumlah yang cukup. Untuk mengetahui detail kuota Vertex AI dan petunjuk untuk membuat permintaan penambahan kuota, lihat Kuota dan batas Vertex AI.
Penagihan
Anda hanya ditagih untuk durasi tugas berjalan dan bukan untuk
waktu saat tugas menunggu ketersediaan resource. Untuk mengetahui detailnya, lihat Harga.
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-09-04 UTC."],[],[],null,["# Schedule training jobs based on resource availability\n\nFor custom training jobs that request GPU resources, Dynamic Workload Scheduler lets you\nschedule the jobs based on when the requested GPU resources become available.\nThis page shows you how to schedule custom training jobs by using Dynamic Workload Scheduler,\nand how to customize the scheduling behavior on Vertex AI.\n\nRecommended use cases\n---------------------\n\nWe recommend using Dynamic Workload Scheduler to schedule custom training jobs in the\nfollowing situations:\n\n- The custom training job requests L4, A100, H100, H200, or B200 GPUs and you want to run the job as soon as the requested resources become available. For example, when Vertex AI allocates the GPU resources outside of peak hours.\n- Your workload requires multiple nodes and can't start running until all GPU nodes are provisioned and ready at the same time. For example, you're creating a distributed training job.\n\nRequirements\n------------\n\nTo use Dynamic Workload Scheduler, your custom training job must meet the following\nrequirements:\n\n- Your custom training job requests L4, A100, H100, H200, or B200 GPUs.\n- Your custom training job has a maximum `timeout` of 7 days or less.\n- Your custom training job uses the same machine configuration for all worker pools.\n\n### Supported job types\n\nAll custom training job types are supported, including `CustomJob`,\n`HyperparameterTuningjob`, and `TrainingPipeline`.\n\nEnable Dynamic Workload Scheduler in your custom training job\n-------------------------------------------------------------\n\nTo enable Dynamic Workload Scheduler in your custom training job, set the\n`scheduling.strategy` API field to `FLEX_START` when you create the job.\n\nFor details on how to create a custom training job, see the following links.\n\n- [Create a `CustomJob`](/vertex-ai/docs/training/create-custom-job)\n- [Create a `HyperparameterTuningJob`](/vertex-ai/docs/training/hyperparameter-tuning-overview)\n- [Create a `TrainingPipeline`](/vertex-ai/docs/training/create-training-pipeline)\n\n### Configure the duration to wait for resource availability\n\nYou can configure how long your job can wait for resources in the\n`scheduling.maxWaitDuration` field. A value of `0` means that the job waits\nindefinitely until the requested resources become available. The default value\nis **1 day**.\n\n### Examples\n\nThe following examples show you how to enable Dynamic Workload Scheduler for a `customJob`.\nSelect the tab for the interface that you want to use. \n\n### gcloud\n\nWhen submitting a job using the Google Cloud CLI, add the `scheduling.strategy`\nfield in the\n[`config.yaml`](/sdk/gcloud/reference/ai/custom-jobs/create#--config) file.\n\nExample YAML configuration file: \n\n workerPoolSpecs:\n machineSpec:\n machineType: a2-highgpu-1g\n acceleratorType: NVIDIA_TESLA_A100\n acceleratorCount: 1\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/ucaip-test/ucaip-training-test\n args:\n - port=8500\n command:\n - start\n scheduling:\n strategy: FLEX_START\n maxWaitDuration: 7200s\n\n### Python\n\nWhen submitting a job using the Vertex AI SDK for Python, set the\n`scheduling_strategy` field in the relevant `CustomJob` creation method. \n\n from google.cloud.aiplatform_v1.types import custom_job as gca_custom_job_compat\n\n def create_custom_job_with_dws_sample(\n project: str,\n location: str,\n staging_bucket: str,\n display_name: str,\n script_path: str,\n container_uri: str,\n service_account: str,\n experiment: str,\n experiment_run: Optional[str] = None,\n ) -\u003e None:\n aiplatform.init(project=project, location=location, staging_bucket=staging_bucket, experiment=experiment)\n\n job = aiplatform.CustomJob.from_local_script(\n display_name=display_name,\n script_path=script_path,\n container_uri=container_uri,\n enable_autolog=True,\n machine_type=\"a2-highgpu-1g\",\n accelerator_type=\"NVIDIA_TESLA_A100\",\n accelerator_count=1,\n )\n\n job.run(\n service_account=service_account,\n experiment=experiment,\n experiment_run=experiment_run,\n max_wait_duration=1800,\n scheduling_strategy=gca_custom_job_compat.Scheduling.Strategy.FLEX_START\n )\n\n### REST\n\nWhen submitting a job using the Vertex AI REST API, set the fields\n`scheduling.strategy` and `scheduling.maxWaitDuration` when creating your\ncustom training job.\n\nExample request JSON body: \n\n {\n \"displayName\": \"MyDwsJob\",\n \"jobSpec\": {\n \"workerPoolSpecs\": [\n {\n \"machineSpec\": {\n \"machineType\": \"a2-highgpu-1g\",\n \"acceleratorType\": \"NVIDIA_TESLA_A100\",\n \"acceleratorCount\": 1\n },\n \"replicaCount\": 1,\n \"diskSpec\": {\n \"bootDiskType\": \"pd-ssd\",\n \"bootDiskSizeGb\": 100\n },\n \"containerSpec\": {\n \"imageUri\": \"python:3.10\",\n \"command\": [\n \"sleep\"\n ],\n \"args\": [\n \"100\"\n ]\n }\n }\n ],\n \"scheduling\": {\n \"maxWaitDuration\": \"1800s\",\n \"strategy\": \"FLEX_START\"\n }\n }\n }\n\nQuota\n-----\n\nWhen you submit a job using Dynamic Workload Scheduler, instead of consuming on-demand\nVertex AI quota, Vertex AI consumes *preemptible* quota. For\nexample, for Nvidia H100 GPUs, instead of consuming:\n\n`aiplatform.googleapis.com/custom_model_training_nvidia_h100_gpus`,\n\nVertex AI consumes:\n\n`aiplatform.googleapis.com/custom_model_training_preemptible_nvidia_h100_gpus`.\n\nHowever, *preemptible* quota is used only in name. Your resources aren't\npreemptible and behave like standard resources.\n\nBefore submitting a job using Dynamic Workload Scheduler, ensure that your preemptible quotas\nhave been increased to a sufficient amount. For details on\nVertex AI quotas and instructions for making quota increase requests, see\n[Vertex AI quotas and limits](/vertex-ai/docs/quotas).\n\nBilling\n-------\n\nYou're charged only for the duration that the job is running and not for the\ntime that the job is waiting for resources to become available. For details,\nsee [Pricing](/vertex-ai/pricing#custom-trained_models).\n\nWhat's Next\n-----------\n\n- Learn more about [configuring compute resources](/vertex-ai/docs/training/configure-compute) for custom training jobs.\n- Learn more about [using distributed training](/vertex-ai/docs/training/distributed-training) for custom training jobs.\n- Learn more about [other scheduling options](/vertex-ai/docs/reference/rest/v1/CustomJobSpec#scheduling)."]]