Configurations

This document outlines the factors to consider when creating virtual machine (VM) instances and clusters that use Cluster Director. For more information about Cluster Director, see Cluster Director.

Overview

The following factors must be considered when creating VMs and clusters.

The deployment type for your cluster. To learn more, see Deployment types.
The maintenance scheduling type. See Maintenance scheduling types.
The deployment tool. See Cluster deployment tool.
The provisioning model. See Provisioning models.

Reservation block deployment types

Cluster Director has the option to provision blocks of densely allocated hosts. With dense deployments, you get the following benefits:

Hosts are allocated physically close to each other to minimize network hops, and are optimized for the lowest latency.
Non-blocking networking for consistent high bandwidth, low latency VM connectivity using Google's dynamic ML network fabric.
Access to network topology provides a hierarchical view of the relative proximity between VMs. This is useful for advanced job scheduling use cases.
Fine grained topology-aware placement when using orchestrators.
Fine-grained user control over maintenance schedules to maximize job scheduling and uptime while minimizing overall downtime.

Maintenance scheduling types

Cluster Director provides options for how you schedule host maintenance for VMs running in your cluster. VMs can either be grouped and have synchronized scheduling or can be loosely coupled and have independent scheduling.

Grouped maintenance scheduling

The maintenance scheduling type ensures that no matter when a VM is provisioned, at the same time or individually, all VMs running the same workload have the the same planned maintenance frequency. This tightly coupled maintenance is particularly useful for environments that use a job scheduler such as Slurm or Google Kubernetes Engine.

This maintenance scheduling type is ideal if you are running training or other highly parallelized-computing workloads. It lets you optimize your jobs by giving you complete control over your used and unused capacity.

Independent maintenance scheduling

This configuration ensures that VMs can have different maintenance schedules. This maintenance scheduling type is ideal if you are running inference or limited-scale training where the workloads run more efficiently when they have separate schedules. Limited scale training workloads are small in scale, can tolerate lower SLOs, and run for short durations.

The independent maintenance scheduling type will be available for future releases.

Cluster deployment

Cluster Toolkit is open-source software offered by Google Cloud that provides the recommended deployment tool for Cluster Director. Cluster Toolkit can deploy both GKE or Slurm clusters.

Alternatively, you can choose to provision your VM groups by using either the Bulk API or managed instance groups (MIGs). With these alternatives, you can incorporate your own workload scheduler as needed. You can also provision single VMs by using the instance create methods.

Provisioning models

You can choose from the following provisioning models to obtain resources to create VMs or clusters, based on the consumption option:

Reservation-bound: you use this model to reserve resources at a discounted price for a future date and duration. On your chosen date, use the reserved resources to create VMs or clusters.
Flex-start (Preview): you use this model to request discounted resources for up to seven days. Compute Engine provides your requested resources when they're available.
Spot: you use this model to immediately obtain deeply-discounted resources based on availability. However, Compute Engine might stop or delete your VMs to reclaim capacity.

Reservation-bound provisioning model

The reservation-bound provisioning model links your created VMs to the capacity you reserved in yourCluster Director. When you reserve capacity, Compute Engine creates an empty reservation. Then, at the reservation start time, the following occurs:

Compute Engine adds your reserved number of VMs to the reservation. You have exclusive access to the reserved capacity until the reservation end time.
Google Cloud charges you for the reserved capacity until the end time, whether you use the capacity or not.

You can then use the reserved resources to create VMs without additional charges. You only pay for resources that aren't included in the reservation, such as disks or IP addresses.

To specify the reservation-bound provisioning model when creating VMs or MIGs, do the following:

In the Google Cloud console, in the Provisioning model list, select Reservation-bound.
In the gcloud CLI, include the --provisioning-model=RESERVATION_BOUND flag in the command.
In the Compute Engine API, include the "provisioningModel": "RESERVATION_BOUND" field in the request body.

For more information about setting these parameters when you create VMs or MIGs after you reserve capacity, see VM and cluster creation overview. If you use Cluster Toolkit to deploy your clusters, then the cluster blueprint sets the provisioning model for you.

Flex-start provisioning model

The flex-start provisioning model adds VMs to a MIG when your requested capacity is available. The MIG adds VMs simultaneously, preventing charges for partial capacity delivery. This model uses a secured capacity pool, increasing your chance to obtain GPUs.

To specify the flex-start provisioning model when creating an instance template for the MIG, do the following:

In the Google Cloud console, in the Provisioning model list, select Flex-start.
In the gcloud CLI, include the --provisioning-model=FLEX_START flag in the command.
In the Compute Engine API, include the "provisioningModel": "FLEX_START" field in the request body.

For more information about setting these parameters, create VMs or clusters using one of the following options:

Create MIGs with resize requests
Create Slurm clusters
Create GKE clusters:
- Create a cluster with the default configuration
- Create a custom cluster

Spot provisioning model

The Spot provisioning model lets you create discounted VMs based on availability. However, Compute Engine might stop or delete VMs to reclaim capacity. This process is called preemption.

To specify the spot provisioning model when creating VMs or MIGs, do the following:

In the Google Cloud console, in the Provisioning model list, select Spot.
In the gcloud CLI, include the --provisioning-model=SPOT flag in the command.
In the Compute Engine API, include the "provisioningModel": "SPOT" field in the request body.

For more information about setting these parameters when you create VMs or MIGs, see VM and cluster creation overview.

What's next?

Reserve capacity