Create cluster

If you're interested in Vertex AI Managed Training, contact your sales representative for access.

This page provides the direct, API-driven method for creating and managing a Managed Training cluster. You'll learn how to define your cluster's complete configuration, including login nodes, high-performance GPU partitions like the A4, and Slurm orchestrator settings—within a JSON file. Also included is how to use curl and REST API calls to deploy this configuration, creating the cluster and managing its lifecycle with GET, LIST, UPDATE and DELETE operations.

Define the cluster configuration

Create a JSON file to define the complete configuration for your Managed Training cluster. The following examples provide templates for two common scenarios: a high-performance cluster utilizing GPU accelerators, and a general-purpose, CPU-only cluster. The first example configures an a4 partition for GPU workloads, while the second configures a cpu partition for CPU-based tasks. Choose the tab that best fits your workload and replace all the placeholder variables with the values for your environment.

If your organizational policy prohibits Public IP addresses on compute instances, deploy the Managed Training cluster with the enable_public_ips: false parameter and utilize Cloud NAT for internet egress.

GPU with Filestore

To create a GPU-enabled cluster, you must choose a persistent storage configuration. The following tabs provide two options based on your workload's performance requirements.

Select the tab that matches your needs to get the complete JSON specification for creating your cluster.

Standard (Filestore Only)

This is the standard configuration. It provides a Filestore instance that serves as the /home directory for the cluster, suitable for general use and storing user data.

The following example shows the content of vmds-gpu-filestore.json. This specification creates a cluster with a GPU partition. You can use this as a template and modify values such as the machineType or nodeCount to fit your needs.

The following list describes each parameter used in the vmds-gpu-filestore.json example. The parameters are grouped by the resource they configure: general cluster settings, the login node pool, worker node pools, and storage.

General settings

  • DISPLAY_NAME: A unique name for your Managed Training cluster. The string can only contain lowercase alphanumeric characters and is limited to 10 characters. The first character must be a letter. The string should be human-readable and easy to identify.
  • PROJECT_ID: Your Google Cloud project ID.
  • REGION: The Google Cloud region for the cluster and its resources.
  • ZONE: The Google Cloud zone for the cluster and its resources.

Login node pool

  • MACHINE_TYPE:: The machine type for the login node (for example, n2-standard-4)
  • MIN_NODE_COUNT: The minimum number of nodes in the login node pool.
  • MAX_NODE_COUNT: For the login node pool, the MAX_NODE_COUNT must be the same as the MIN_NODE_COUNT.
  • ENABLE_PUBLIC_IPS: A boolean (true or false) to determine if the login node has a public IP address.
  • BOOT_DISK_TYPE: The boot disk type for the login node (for example, pd-standard, pd-ssd).
  • BOOT_DISK_SIZE_GB: The boot disk size in GB for the login node.

Worker node pool configuration

  • ID: A unique identifier for the node pool within the cluster.
  • MACHINE_TYPE: The type of GPU accelerator to attach to the worker nodes. Supported types include a3-megagpu-8g, a3-ultragpu, and a4-highgpu-8g.
  • ACCELERATOR_TYPE: Corresponding accelerator types for the machine types are. NVIDIA_H100_MEGA_80GB,NVIDIA_H200_141GB, and NVIDIA_B200.
  • ACCELERATOR_COUNT: The number of accelerators for the node pool.
  • PROVISIONING_MODEL: The consumption mode for the node pool. For example, ON_DEMAND, SPOT, RESERVATION, or FLEX_START.
  • SPECIFIC_RESERVATION: The reservation affinity type for the node pool.
  • RESERVATION: The name of the reservation for the node pool.
  • MIN_NODE_COUNT: The minimum number of nodes in the node pool.
  • MAX_NODE_COUNT: The maximum number of nodes in the node pool.
  • ENABLE_PUBLIC_IPS: A boolean (true or false) to determine if the worker nodes have public IP addresses.
  • BOOT_DISK_TYPE: The boot disk type for the node pool.
  • BOOT_DISK_SIZE_GB: The boot disk size in GB for the node pool.

Storage configuration

  • FILESTORE: The name of the Filestore instance to mount as the /home directory.

 {
  "display_name": "DISPLAY_NAME",
  "name": "projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/",
  "network": {
    "network": "projects/PROJECT_ID/global/networks/NETWORK",
    "subnetwork": "projects/PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK"
  },
  "node_pools": [
    {
      "id": "login",
      "machine_spec": {
        "machine_type": "n2-standard-8"
      },
      "scaling_spec": {
        "min_node_count": MIN_NODE_COUNT,
        "max_node_count": MAX_NODE_COUNT
      },
      "enable_public_ips": true,
      "zone": "ZONE",
      "boot_disk": {
        "boot_disk_type": "pd-standard",
        "boot_disk_size_gb": 200
      },
    },
    {
      "id": "a4",
      "machine_spec": {
        "machine_type": "a4-highgpu-8g",
        "accelerator_type": "NVIDIA_B200",
        "provisioning_model": "RESERVATION",
        "accelerator_count": 8,
        "reservation_affinity": {
          "reservationAffinityType": "RESERVATION_AFFINITY_TYPE",
          "key": "compute.googleapis.com/reservation-name",
          "values": [
            "projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION"
          ]
        }
      },
      "scaling_spec": {
        "min_node_count": MIN_NODE_COUNT,
        "max_node_count": MAX_NODE_COUNT
      },
      "enable_public_ips": true,
      "zone": "ZONE",
      "boot_disk": {
        "boot_disk_type": "hyperdisk-balanced",
        "boot_disk_size_gb": 200
      },
    }
  ],
  "orchestrator_spec": {
    "slurm_spec": {
      "home_directory_storage": "projects/PROJECT_ID/locations/ZONE/instances/FILESTORE",
      "partitions": [
        {
          "id": "a4",
          "node_pool_ids": [
            "a4"
          ]
        }
      ],
      "login_node_pool_id": "login"
    }
  }
}
}

Advanced (Filestore + Lustre)

This advanced configuration includes the standard Filestore instance in addition to a high-performance Lustre file system. Choose this option if your training jobs require high-throughput access to large datasets.

The following list describes each parameter used in the vmds-gpu-filestore-lustre.json example. This advanced configuration adds one or more high-performance Lustre file systems to the standard Filestore setup. The parameters are grouped by the resource they configure: general cluster settings, the login node pool, workernode configuration, and storage configuration.

General settings

  • DISPLAY_NAME: A unique name for your Managed Training cluster. The string can only contain lowercase alphanumeric characters, must begin with a letter, and is limited to 10 characters.
  • PROJECT_ID: Your Google Cloud project ID.
  • REGION: The Google Cloud region where the cluster and its resources will be located.
  • ZONE: The Google Cloud zone where the cluster and its resources will be located.
  • NETWORK: The VPC network to use for the cluster's resources.
  • SUBNETWORK: The subnetwork to use for the cluster's resources.

Login node configuration

  • MACHINE_TYPE: The machine type for the login node (for example, n2-standard-4).
  • MIN_NODE_COUNT: The MIN_NODE_COUNT must be the same as the MAX_NODE_COUNT.
  • MAX_NODE_COUNT: For the login node pool, the MAX_NODE_COUNT must be the same as the MIN_NODE_COUNT.
  • ENABLE_PUBLIC_IPS: A boolean (true or false) to determine if the login node has a public IP address.
  • BOOT_DISK_TYPE: The boot disk type for the login node (for example, pd-standard, pd-ssd).
  • BOOT_DISK_SIZE_GB: The boot disk size in GB for the login node.

Worker node configuration

  • ID: A unique identifier for the node pool within the cluster.
  • MACHINE_TYPE: The machine type for the worker node. Supported values are a3-megagpu-8g, a3-ultragpu-8g, a4-highgpu-8g.
  • ACCELERATOR_TYPE: The corresponding GPU accelerator to attach to the worker nodes. Supported values are:
    • NVIDIA_H100_MEGA_80GB
    • NVIDIA_H200_141GB
    • NVIDIA_B200
  • ACCELERATOR_COUNT: The number of accelerators to attach to each worker node.
  • RESERVATION_AFFINITY_TYPE: The reservation affinity for the node pool (for example, SPECIFIC_RESERVATION).
  • RESERVATION_NAME: The name of the reservation to use for the node pool.
  • ENABLE_PUBLIC_IPS: A boolean (true or false) to determine if the worker node has a public IP address.
  • BOOT_DISK_TYPE: The boot disk type for the worker node.
  • BOOT_DISK_SIZE_GB: The boot disk size in GB for the worker nodes.

Storage configuration

  • FILESTORE: The name of the Filestore instance to mount as the /homedirectory.
  • LUSTRE: A list of pre-existing Lustre instances to mount on the cluster nodes for high-performance file access.

{
  "display_name": "DISPLAY_NAME",
  "name": "projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/",
  "network": {
    "network": "projects/PROJECT_ID/global/networks/NETWORK",
    "subnetwork": "projects/PROJECT_ID/regions/asia-sREGION/subnetworks/SUBNETWORK"
  },
  "node_pools": [
    {
      "id": "login",
      "machine_spec": {
        "machine_type": "n2-standard-8"
      },
      "scaling_spec": {
        "min_node_count": MIN_NODE_COUNT,
        "max_node_count": MAX_NODE_COUNT
      },
      "enable_public_ips": true,
      "zone": "ZONE",
      "boot_disk": {
        "boot_disk_type": "pd-standard",
        "boot_disk_size_gb": 200
      },
      "lustres": [
        "projects/PROJECT_ID/locations/ZONE/instances/LUSTRE"
      ]
    },
    {
      "id": "a4",
      "machine_spec": {
        "machine_type": "a4-highgpu-8g",
        "accelerator_type": "NVIDIA_B200",
        "accelerator_count": 8,
        "reservation_affinity": {
          "reservation_affinity_type": RESERVATION_AFFINITY_TYPE,
          "key": "compute.googleapis.com/reservation-name",
          "values": [
            "projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME"
          ]
        }
      },
      "provisioning_model": "RESERVATION",
      "scaling_spec": {
        "min_node_count": MIN_NODE_COUNT,
        "max_node_count": MAX_NODE_COUNT
      },
      "enable_public_ips": true,
      "zone": "ZONE",
      "boot_disk": {
        "boot_disk_type": "hyperdisk-balanced",
        "boot_disk_size_gb": 200
      },
      "lustres": [
        "projects/PROJECT_ID/locations/ZONE/instances/LUSTRE"
      ]
    }
  ],
  "orchestrator_spec": {
    "slurm_spec": {
      "home_directory_storage": "projects/PROJECT_ID/locations/ZONE/instances/FILESTORE",
      "partitions": [
        {
          "id": "a4",
          "node_pool_ids": [
            "a4"
          ]
        }
      ],
      "login_node_pool_id": "login"
    }
  }
}
  

CPU only cluster

To provision a Managed Training on reserved clusters environment, you must first define its complete configuration in a JSON file. This file acts as the blueprint for your cluster, specifying everything from its name and network settings to the hardware for its login and worker nodes.
The following section provides a detailed breakdown of each parameter available in the configuration file. For clarity, the parameters are organized into logical groups based on their function. Correctly defining these settings is the essential first step to deploying a cluster tailored to your specific training needs.

General settings

  • DISPLAY_NAME: A unique name for your Managed Training cluster. The string can only contain lowercase alphanumeric characters and is limited to 10 characters. The first character must be a letter. The string should be human-readable and easy to identify.
  • PROJECT_ID: Your Google Cloud project ID.
  • REGION: The Google Cloud region for the cluster and its resources.
  • ZONE: The Google Cloud zone for the cluster and its resources.

Network configuration

  • NETWORK:: The Google Cloud VPC network for the cluster and its resources.
  • SUBNETWORK: The Google Cloud subnetwork for the cluster and its resources.

Storage configuration

  • FILESTORE: The name of the Filestore instance to mount as the /home directory.

Login node pool

  • MACHINE_TYPE:: The machine type for the login node (for example, n2-standard-4)
  • MIN_NODE_COUNT: The minimum number of nodes in the login node pool.
  • MAX_NODE_COUNT: For the login node pool, the MAX_NODE_COUNT must be the same as the MIN_NODE_COUNT.
  • ENABLE_PUBLIC_IPS: A boolean (true or false) to determine if the login node has a public IP address.

Worker node pool configuration

  • MACHINE_TYPE: The machine type for the worker nodes. For example, a CPU machine type like n2-standard-4
  • PROVISIONING_MODEL: The provisioning model for the node pool. Options include ON_DEMAND, RESERVATION, or SPOT. Defaults to ON_DEMAND for CPU machine types.
  • MIN_NODE_COUNT: The minimum number of nodes in the node pool.
  • MAX_NODE_COUNT: The maximum number of nodes in the node pool.
  • ENABLE_PUBLIC_IPS: A boolean (true or false) to determine if the worker nodes have public IP addresses.
  • BOOT_DISK_TYPE: The boot disk type for the node pool.
  • BOOT_DISK_SIZE_GB: The boot disk size in GB for the node pool.

Here's an example of the content of cpu.json. This definition creates a CPU partition. You can modify this specification to meet your needs, for example, use a different CPU type or quantity.

{
  "display_name": "DISPLAY_NAME",
  "name": "projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/",
  "network": {
    "network": "projects/PROJECT_ID/global/networks/NETWORK",
    "subnetwork": "projects/PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK"
  },
  "node_pools": [
    {
      "id": "cpu",
      "machine_spec": {
        "machine_type": "n2-standard-8"
      },
      "scaling_spec": {
        "min_node_count": MIN_NODE_COUNT,
        "max_node_count": MAX_NODE_COUNT
      },
      "zone": "ZONE",
      "enable_public_ips": true,
      "boot_disk": {
        "boot_disk_type": "pd-standard",
        "boot_disk_size_gb": 120
      }
    },
    {
      "id": "login",
      "machine_spec": {
        "machine_type": "n2-standard-8",
      }
      "scaling_spec": {
        "min_node_count": MIN_NODE_COUNT,
        "max_node_count": MAX_NODE_COUNT
      },
      "zone": "ZONE",
      "enable_public_ips": true,
      "boot_disk": {
        "boot_disk_type": "pd-standard",
        "boot_disk_size_gb": 120
      }
    },
  ],
  "orchestrator_spec": {
    "slurm_spec": {
      "home_directory_storage": "projects/PROJECT_ID/locations/ZONE/instances/FILESTORE",
      "partitions": [
        {
          "id": "cpu",
          "node_pool_ids": [
            "cpu"
          ]
        }
      ],
      "login_node_pool_id": "login"
    }
  }
}

Once your cluster is defined in a JSON file, use the following REST API commands to deploy and manage the cluster. The examples use a gcurl alias, which is a convenient, authenticated shortcut for interacting with the API endpoints. These commands cover the full lifecycle, from initially deploying your cluster to updating a cluster getting its status, listing all clusters, and ultimately deleting the cluster.

Authentication

alias gcurl='curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json"'

Create a JSON file

Create a JSON file (for example, vmds-compute.json) to specify the configuration for your Model Training cluster.

Deployment of the JSON file

Update and apply the environment variables. Here's a list of the environment variables and their descriptions:

  • PROJECT_ID: Your Google Cloud project ID where the cluster will be created.
  • REGION: The Google Cloud region for the cluster and its resources.
  • ZONE: The Google Cloud zone where the cluster resources will be provisioned.
  • CLUSTER_ID: A unique identifier for your. Managed Training cluster, which is also used as a prefix for naming related resources.
      gcurl -X POST -d @vmds-cpu.json https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters?model_development_cluster_id=CLUSTER_ID
        

Once the deployment starts, an Operation ID will be generated. Be sure to copy this ID. You'll need it to validate your cluster in the next step.

  gcurl -X POST -d @vmds-cpu.json https://us-central1-aiplatform.googleapis.com/v1beta1/projects/managedtraining-project/locations/us-central1/modelDevelopmentClusters?model_development_cluster_id=training
  {
      "name": "projects/1059558423163/locations/us-central1/operations/2995239222190800896",
      "metadata": {
      "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.CreateModelDevelopmentClusterOperationMetadata",
      "genericMetadata": {
        "createTime": "2025-10-24T14:16:59.233332Z",
        "updateTime": "2025-10-24T14:16:59.233332Z"
      },
      "progressMessage": "Create Model Development Cluster request received, provisioning..."
  }
    

Validate cluster deployment

Track the deployment's progress using the operation ID provided when you deployed the cluster. For example, 2995239222190800896 is the operation ID in the example cited earlier.

    gcurl https://REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/operations/OPERATION_ID
    

In summary

Submitting your cluster configuration with the gcurl POST command initiates the provisioning of your cluster, which is an asynchronous, long-running operation. The API immediately returns a response containing an Operation ID. It's crucial to save this ID, since you'll use it in the following steps to monitor the deployment's progress, verify that the cluster has been created successfully, and manage its lifecycle.