建立及使用 Spot VM


本頁面說明如何建立及管理Spot VM,包括:

  • 如何建立、啟動及識別 Spot VM
  • 如何偵測、處理及測試 Spot VM 的先占
  • Spot VM 的最佳做法

Spot VM 是採用Spot 佈建模型的虛擬機器 (VM) 執行個體。與標準 VM 價格相比,Spot VM 提供 60% 至 91% 的折扣優惠。不過,Compute Engine 隨時可能會搶佔 Spot VM 來回收資源。建議僅針對能承受 VM 搶佔影響的容錯應用程式使用 Spot VM。決定建立 Spot VM 之前,請務必確保您的應用程式可以處理先占

事前準備

  • 請參閱 Spot VM 概念說明文件
    • 查看 Spot VM 的限制定價
    • 為避免 Spot VM 消耗標準 VM 的 CPU、GPU 和磁碟配額,建議您為 Spot VM 申請先占配額
  • 如果尚未設定,請先設定驗證機制。驗證是指驗證身分,以便存取 Google Cloud 服務和 API 的程序。如要在本機開發環境中執行程式碼或範例,您可以選取下列任一選項,向 Compute Engine 進行驗證:

    Select the tab for how you plan to use the samples on this page:

    Console

    When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

    gcloud

    1. After installing the Google Cloud CLI, initialize it by running the following command:

      gcloud init

      If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

    2. Set a default region and zone.

    Terraform

    To use the Terraform samples on this page in a local development environment, install and initialize the gcloud CLI, and then set up Application Default Credentials with your user credentials.

    1. Install the Google Cloud CLI.
    2. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

    3. To initialize the gcloud CLI, run the following command:

      gcloud init
    4. If you're using a local shell, then create local authentication credentials for your user account:

      gcloud auth application-default login

      You don't need to do this if you're using Cloud Shell.

      If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

    For more information, see Set up authentication for a local development environment.

    REST

    To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.

      After installing the Google Cloud CLI, initialize it by running the following command:

      gcloud init

      If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

    For more information, see Authenticate for using REST in the Google Cloud authentication documentation.

建立 Spot VM

使用 Google Cloud 控制台、gcloud CLI 或 Compute Engine API 建立 Spot VM。Spot VM 是指任何已設定為使用 Spot 佈建模式的 VM:

  • 在 Google Cloud 控制台中將「VM 佈建模式」設為「Spot」
  • gcloud CLI 中的 --provisioning-model=SPOT
  • Compute Engine API 中的 "provisioningModel": "SPOT"

主控台

  1. 前往 Google Cloud 控制台的「Create an instance」(建立執行個體) 頁面。

    前往「Create an instance」(建立執行個體)

  2. 在導覽選單中,按一下「進階」。在隨即顯示的「Advanced」窗格中,完成下列步驟:

    1. 在「Provisioning model」(佈建模型) 部分,從「VM provisioning model」(VM 佈建模型) 清單中選取「Spot」
    2. 選用:如要選取 Compute Engine 搶先取得 VM 時發生的終止動作,請完成下列步驟:

      1. 展開「VM 佈建模式進階設定」部分。
      2. 在「On VM termination」清單中,選取下列其中一個選項:
        • 如要在先佔期間停止 VM,請選取「停止」 (預設)。
        • 如要在優先取得期間刪除 VM,請選取「刪除」
  3. 選用:指定其他設定選項。詳情請參閱「建立執行個體時的設定選項」。

  4. 如要建立並啟動 VM,請按一下 [Create] (建立)

gcloud

如要透過 gcloud CLI 建立 VM,請使用 gcloud compute instances create 指令。如要建立 Spot VM,您必須加入 --provisioning-model=SPOT 旗標。您也可以選擇加入 --instance-termination-action 旗標,為 Spot VM 指定終止動作。

gcloud compute instances create VM_NAME \
    --provisioning-model=SPOT \
    --instance-termination-action=TERMINATION_ACTION

更改下列內容:

  • VM_NAME:新 VM 的名稱
  • TERMINATION_ACTION:選用:指定 Compute Engine 在先占 VM 時採取的動作,可選為 STOP (預設行為) 或 DELETE

如要進一步瞭解建立 VM 時可指定的選項,請參閱「執行個體建立期間的設定選項」。舉例來說,如要使用指定的機器類型和映像檔建立 Spot VM,請使用下列指令:

gcloud compute instances create VM_NAME \
    --provisioning-model=SPOT \
    [--image=IMAGE | --image-family=IMAGE_FAMILY] \
    --image-project=IMAGE_PROJECT \
    --machine-type=MACHINE_TYPE \
    --instance-termination-action=TERMINATION_ACTION

更改下列內容:

  • VM_NAME:新 VM 的名稱
  • IMAGE:指定下列任一值:
    • IMAGE:公開圖片或圖片系列的特定版本。例如特定圖片為 --image=debian-10-buster-v20200309
    • 映像檔系列。這會從最新的未淘汰作業系統映像檔建立 VM。舉例來說,如果您指定 --image-family=debian-10,Compute Engine 會使用 Debian 10 映像檔系列中最新版本的 OS 映像檔建立 VM。
  • IMAGE_PROJECT:包含圖片的專案。例如,如果您將 debian-10 指定為映像檔系列,請將 debian-cloud 指定為映像檔專案。
  • MACHINE_TYPE:新 VM 的預先定義自訂機器類型。
  • TERMINATION_ACTION:選用:指定 Compute Engine 先占 VM 時要採取的動作,可選 STOP (預設行為) 或 DELETE

    如要取得區域中可用的機器類型清單,請搭配 --zones 標記使用 gcloud compute machine-types list 指令

Terraform

您可以使用 Terraform 資源,透過排程區塊建立短期執行個體


resource "google_compute_instance" "spot_vm_instance" {
  name         = "spot-instance-name"
  machine_type = "f1-micro"
  zone         = "us-central1-c"

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-11"
    }
  }

  scheduling {
    preemptible                 = true
    automatic_restart           = false
    provisioning_model          = "SPOT"
    instance_termination_action = "STOP"
  }

  network_interface {
    # A default network is created for all GCP projects
    network = "default"
    access_config {
    }
  }
}

REST

如要透過 Compute Engine API 建立 VM,請使用 instances.insert 方法。您必須為 VM 指定機器類型和名稱。您也可以選擇為開機磁碟指定映像檔。

如要建立 Spot VM,您必須加入 "provisioningModel": spot 欄位。您也可以選擇加入 "instanceTerminationAction" 欄位,為 Spot VM 指定終止動作。

POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances
{
 "machineType": "zones/ZONE/machineTypes/MACHINE_TYPE",
 "name": "VM_NAME",
 "disks": [
   {
     "initializeParams": {
       "sourceImage": "projects/IMAGE_PROJECT/global/images/IMAGE"
     },
     "boot": true
   }
 ]
 "scheduling":
 {
     "provisioningModel": "SPOT",
     "instanceTerminationAction": "TERMINATION_ACTION"
 },
 ...
}

更改下列內容:

  • PROJECT_ID:要建立 VM 的專案專案 ID
  • ZONE:建立 VM 的區域。區域也必須支援新 VM 要使用的機器類型。
  • MACHINE_TYPE:新 VM 的預先定義自訂機器類型。
  • VM_NAME:新 VM 的名稱
  • IMAGE_PROJECT:包含圖片的專案。舉例來說,如果您將 family/debian-10 指定為映像檔系列,請將 debian-cloud 指定為映像檔專案。
  • IMAGE:指定下列任一值:
    • 公開映像檔的特定版本。舉例來說,特定圖片為 "sourceImage": "projects/debian-cloud/global/images/debian-10-buster-v20200309",其中 debian-cloudIMAGE_PROJECT
    • 映像檔系列。這會從最新的未淘汰作業系統映像檔建立 VM。舉例來說,如果您指定 "sourceImage": "projects/debian-cloud/global/images/family/debian-10",其中 debian-cloudIMAGE_PROJECT,Compute Engine 就會根據 Debian 10 映像檔系列中的最新 OS 映像檔建立 VM。
  • TERMINATION_ACTION:選用:指定 Compute Engine 在先占 VM 時採取的動作,可選為 STOP (預設行為) 或 DELETE

如要進一步瞭解建立 VM 時可指定的選項,請參閱「執行個體建立期間的設定選項」。

Go


import (
	"context"
	"fmt"
	"io"

	compute "cloud.google.com/go/compute/apiv1"
	"cloud.google.com/go/compute/apiv1/computepb"
	"google.golang.org/protobuf/proto"
)

// createSpotInstance creates a new Spot VM instance with Debian 10 operating system.
func createSpotInstance(w io.Writer, projectID, zone, instanceName string) error {
	// projectID := "your_project_id"
	// zone := "europe-central2-b"
	// instanceName := "your_instance_name"

	ctx := context.Background()
	imagesClient, err := compute.NewImagesRESTClient(ctx)
	if err != nil {
		return fmt.Errorf("NewImagesRESTClient: %w", err)
	}
	defer imagesClient.Close()

	instancesClient, err := compute.NewInstancesRESTClient(ctx)
	if err != nil {
		return fmt.Errorf("NewInstancesRESTClient: %w", err)
	}
	defer instancesClient.Close()

	req := &computepb.GetFromFamilyImageRequest{
		Project: "debian-cloud",
		Family:  "debian-11",
	}

	image, err := imagesClient.GetFromFamily(ctx, req)
	if err != nil {
		return fmt.Errorf("getImageFromFamily: %w", err)
	}

	diskType := fmt.Sprintf("zones/%s/diskTypes/pd-standard", zone)
	disks := []*computepb.AttachedDisk{
		{
			AutoDelete: proto.Bool(true),
			Boot:       proto.Bool(true),
			InitializeParams: &computepb.AttachedDiskInitializeParams{
				DiskSizeGb:  proto.Int64(10),
				DiskType:    proto.String(diskType),
				SourceImage: proto.String(image.GetSelfLink()),
			},
			Type: proto.String(computepb.AttachedDisk_PERSISTENT.String()),
		},
	}

	req2 := &computepb.InsertInstanceRequest{
		Project: projectID,
		Zone:    zone,
		InstanceResource: &computepb.Instance{
			Name:        proto.String(instanceName),
			Disks:       disks,
			MachineType: proto.String(fmt.Sprintf("zones/%s/machineTypes/%s", zone, "n1-standard-1")),
			NetworkInterfaces: []*computepb.NetworkInterface{
				{
					Name: proto.String("global/networks/default"),
				},
			},
			Scheduling: &computepb.Scheduling{
				ProvisioningModel: proto.String(computepb.Scheduling_SPOT.String()),
			},
		},
	}
	op, err := instancesClient.Insert(ctx, req2)
	if err != nil {
		return fmt.Errorf("insert: %w", err)
	}

	if err = op.Wait(ctx); err != nil {
		return fmt.Errorf("unable to wait for the operation: %w", err)
	}

	instance, err := instancesClient.Get(ctx, &computepb.GetInstanceRequest{
		Project:  projectID,
		Zone:     zone,
		Instance: instanceName,
	})

	if err != nil {
		return fmt.Errorf("createInstance: %w", err)
	}

	fmt.Fprintf(w, "Instance created: %v\n", instance)
	return nil
}

Java


import com.google.cloud.compute.v1.AccessConfig;
import com.google.cloud.compute.v1.AccessConfig.Type;
import com.google.cloud.compute.v1.Address.NetworkTier;
import com.google.cloud.compute.v1.AttachedDisk;
import com.google.cloud.compute.v1.AttachedDiskInitializeParams;
import com.google.cloud.compute.v1.ImagesClient;
import com.google.cloud.compute.v1.InsertInstanceRequest;
import com.google.cloud.compute.v1.Instance;
import com.google.cloud.compute.v1.InstancesClient;
import com.google.cloud.compute.v1.NetworkInterface;
import com.google.cloud.compute.v1.Scheduling;
import com.google.cloud.compute.v1.Scheduling.ProvisioningModel;
import java.io.IOException;
import java.util.UUID;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class CreateSpotVm {
  public static void main(String[] args)
          throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "your-project-id";
    // Name of the virtual machine to check.
    String instanceName = "your-instance-name";
    // Name of the zone you want to use. For example: "us-west3-b"
    String zone = "your-zone";

    createSpotInstance(projectId, instanceName, zone);
  }

  // Create a new Spot VM instance with Debian 11 operating system.
  public static Instance createSpotInstance(String projectId, String instanceName, String zone)
          throws IOException, ExecutionException, InterruptedException, TimeoutException {
    String image;
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (ImagesClient imagesClient = ImagesClient.create()) {
      image = imagesClient.getFromFamily("debian-cloud", "debian-11").getSelfLink();
    }
    AttachedDisk attachedDisk = buildAttachedDisk(image, zone);
    String machineTypes = String.format("zones/%s/machineTypes/%s", zone, "n1-standard-1");

    // Send an instance creation request to the Compute Engine API and wait for it to complete.
    Instance instance =
            createInstance(projectId, zone, instanceName, attachedDisk, true, machineTypes, false);

    System.out.printf("Spot instance '%s' has been created successfully", instance.getName());

    return instance;
  }

  // disks: a list of compute_v1.AttachedDisk objects describing the disks
  //     you want to attach to your new instance.
  // machine_type: machine type of the VM being created. This value uses the
  //     following format: "zones/{zone}/machineTypes/{type_name}".
  //     For example: "zones/europe-west3-c/machineTypes/f1-micro"
  // external_access: boolean flag indicating if the instance should have an external IPv4
  //     address assigned.
  // spot: boolean value indicating if the new instance should be a Spot VM or not.
  private static Instance createInstance(String projectId, String zone, String instanceName,
                                         AttachedDisk disk, boolean isSpot, String machineType,
                                         boolean externalAccess)
          throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (InstancesClient client = InstancesClient.create()) {
      Instance instanceResource =
              buildInstanceResource(instanceName, disk, machineType, externalAccess, isSpot);

      InsertInstanceRequest build = InsertInstanceRequest.newBuilder()
              .setProject(projectId)
              .setRequestId(UUID.randomUUID().toString())
              .setZone(zone)
              .setInstanceResource(instanceResource)
              .build();
      client.insertCallable().futureCall(build).get(60, TimeUnit.SECONDS);

      return client.get(projectId, zone, instanceName);
    }
  }

  private static Instance buildInstanceResource(String instanceName, AttachedDisk disk,
                                                String machineType, boolean externalAccess,
                                                boolean isSpot) {
    NetworkInterface networkInterface =
            networkInterface(externalAccess);
    Instance.Builder builder = Instance.newBuilder()
            .setName(instanceName)
            .addDisks(disk)
            .setMachineType(machineType)
            .addNetworkInterfaces(networkInterface);

    if (isSpot) {
      // Set the Spot VM setting
      Scheduling.Builder scheduling = builder.getScheduling()
              .toBuilder()
              .setProvisioningModel(ProvisioningModel.SPOT.name())
              .setInstanceTerminationAction("STOP");
      builder.setScheduling(scheduling);
    }

    return builder.build();
  }

  private static NetworkInterface networkInterface(boolean externalAccess) {
    NetworkInterface.Builder build = NetworkInterface.newBuilder()
            .setNetwork("global/networks/default");

    if (externalAccess) {
      AccessConfig.Builder accessConfig = AccessConfig.newBuilder()
              .setType(Type.ONE_TO_ONE_NAT.name())
              .setName("External NAT")
              .setNetworkTier(NetworkTier.PREMIUM.name());
      build.addAccessConfigs(accessConfig.build());
    }

    return build.build();
  }

  private static AttachedDisk buildAttachedDisk(String sourceImage, String zone) {
    AttachedDiskInitializeParams initializeParams = AttachedDiskInitializeParams.newBuilder()
            .setSourceImage(sourceImage)
            .setDiskSizeGb(10)
            .setDiskType(String.format("zones/%s/diskTypes/pd-standard", zone))
            .build();
    return AttachedDisk.newBuilder()
            .setInitializeParams(initializeParams)
            // Remember to set auto_delete to True if you want the disk to be deleted
            // when you delete your VM instance.
            .setAutoDelete(true)
            .setBoot(true)
            .build();
  }
}

Python

from __future__ import annotations

import re
import sys
from typing import Any
import warnings

from google.api_core.extended_operation import ExtendedOperation
from google.cloud import compute_v1


def get_image_from_family(project: str, family: str) -> compute_v1.Image:
    """
    Retrieve the newest image that is part of a given family in a project.

    Args:
        project: project ID or project number of the Cloud project you want to get image from.
        family: name of the image family you want to get image from.

    Returns:
        An Image object.
    """
    image_client = compute_v1.ImagesClient()
    # List of public operating system (OS) images: https://cloud.google.com/compute/docs/images/os-details
    newest_image = image_client.get_from_family(project=project, family=family)
    return newest_image


def disk_from_image(
    disk_type: str,
    disk_size_gb: int,
    boot: bool,
    source_image: str,
    auto_delete: bool = True,
) -> compute_v1.AttachedDisk:
    """
    Create an AttachedDisk object to be used in VM instance creation. Uses an image as the
    source for the new disk.

    Args:
         disk_type: the type of disk you want to create. This value uses the following format:
            "zones/{zone}/diskTypes/(pd-standard|pd-ssd|pd-balanced|pd-extreme)".
            For example: "zones/us-west3-b/diskTypes/pd-ssd"
        disk_size_gb: size of the new disk in gigabytes
        boot: boolean flag indicating whether this disk should be used as a boot disk of an instance
        source_image: source image to use when creating this disk. You must have read access to this disk. This can be one
            of the publicly available images or an image from one of your projects.
            This value uses the following format: "projects/{project_name}/global/images/{image_name}"
        auto_delete: boolean flag indicating whether this disk should be deleted with the VM that uses it

    Returns:
        AttachedDisk object configured to be created using the specified image.
    """
    boot_disk = compute_v1.AttachedDisk()
    initialize_params = compute_v1.AttachedDiskInitializeParams()
    initialize_params.source_image = source_image
    initialize_params.disk_size_gb = disk_size_gb
    initialize_params.disk_type = disk_type
    boot_disk.initialize_params = initialize_params
    # Remember to set auto_delete to True if you want the disk to be deleted when you delete
    # your VM instance.
    boot_disk.auto_delete = auto_delete
    boot_disk.boot = boot
    return boot_disk


def wait_for_extended_operation(
    operation: ExtendedOperation, verbose_name: str = "operation", timeout: int = 300
) -> Any:
    """
    Waits for the extended (long-running) operation to complete.

    If the operation is successful, it will return its result.
    If the operation ends with an error, an exception will be raised.
    If there were any warnings during the execution of the operation
    they will be printed to sys.stderr.

    Args:
        operation: a long-running operation you want to wait on.
        verbose_name: (optional) a more verbose name of the operation,
            used only during error and warning reporting.
        timeout: how long (in seconds) to wait for operation to finish.
            If None, wait indefinitely.

    Returns:
        Whatever the operation.result() returns.

    Raises:
        This method will raise the exception received from `operation.exception()`
        or RuntimeError if there is no exception set, but there is an `error_code`
        set for the `operation`.

        In case of an operation taking longer than `timeout` seconds to complete,
        a `concurrent.futures.TimeoutError` will be raised.
    """
    result = operation.result(timeout=timeout)

    if operation.error_code:
        print(
            f"Error during {verbose_name}: [Code: {operation.error_code}]: {operation.error_message}",
            file=sys.stderr,
            flush=True,
        )
        print(f"Operation ID: {operation.name}", file=sys.stderr, flush=True)
        raise operation.exception() or RuntimeError(operation.error_message)

    if operation.warnings:
        print(f"Warnings during {verbose_name}:\n", file=sys.stderr, flush=True)
        for warning in operation.warnings:
            print(f" - {warning.code}: {warning.message}", file=sys.stderr, flush=True)

    return result


def create_instance(
    project_id: str,
    zone: str,
    instance_name: str,
    disks: list[compute_v1.AttachedDisk],
    machine_type: str = "n1-standard-1",
    network_link: str = "global/networks/default",
    subnetwork_link: str = None,
    internal_ip: str = None,
    external_access: bool = False,
    external_ipv4: str = None,
    accelerators: list[compute_v1.AcceleratorConfig] = None,
    preemptible: bool = False,
    spot: bool = False,
    instance_termination_action: str = "STOP",
    custom_hostname: str = None,
    delete_protection: bool = False,
) -> compute_v1.Instance:
    """
    Send an instance creation request to the Compute Engine API and wait for it to complete.

    Args:
        project_id: project ID or project number of the Cloud project you want to use.
        zone: name of the zone to create the instance in. For example: "us-west3-b"
        instance_name: name of the new virtual machine (VM) instance.
        disks: a list of compute_v1.AttachedDisk objects describing the disks
            you want to attach to your new instance.
        machine_type: machine type of the VM being created. This value uses the
            following format: "zones/{zone}/machineTypes/{type_name}".
            For example: "zones/europe-west3-c/machineTypes/f1-micro"
        network_link: name of the network you want the new instance to use.
            For example: "global/networks/default" represents the network
            named "default", which is created automatically for each project.
        subnetwork_link: name of the subnetwork you want the new instance to use.
            This value uses the following format:
            "regions/{region}/subnetworks/{subnetwork_name}"
        internal_ip: internal IP address you want to assign to the new instance.
            By default, a free address from the pool of available internal IP addresses of
            used subnet will be used.
        external_access: boolean flag indicating if the instance should have an external IPv4
            address assigned.
        external_ipv4: external IPv4 address to be assigned to this instance. If you specify
            an external IP address, it must live in the same region as the zone of the instance.
            This setting requires `external_access` to be set to True to work.
        accelerators: a list of AcceleratorConfig objects describing the accelerators that will
            be attached to the new instance.
        preemptible: boolean value indicating if the new instance should be preemptible
            or not. Preemptible VMs have been deprecated and you should now use Spot VMs.
        spot: boolean value indicating if the new instance should be a Spot VM or not.
        instance_termination_action: What action should be taken once a Spot VM is terminated.
            Possible values: "STOP", "DELETE"
        custom_hostname: Custom hostname of the new VM instance.
            Custom hostnames must conform to RFC 1035 requirements for valid hostnames.
        delete_protection: boolean value indicating if the new virtual machine should be
            protected against deletion or not.
    Returns:
        Instance object.
    """
    instance_client = compute_v1.InstancesClient()

    # Use the network interface provided in the network_link argument.
    network_interface = compute_v1.NetworkInterface()
    network_interface.network = network_link
    if subnetwork_link:
        network_interface.subnetwork = subnetwork_link

    if internal_ip:
        network_interface.network_i_p = internal_ip

    if external_access:
        access = compute_v1.AccessConfig()
        access.type_ = compute_v1.AccessConfig.Type.ONE_TO_ONE_NAT.name
        access.name = "External NAT"
        access.network_tier = access.NetworkTier.PREMIUM.name
        if external_ipv4:
            access.nat_i_p = external_ipv4
        network_interface.access_configs = [access]

    # Collect information into the Instance object.
    instance = compute_v1.Instance()
    instance.network_interfaces = [network_interface]
    instance.name = instance_name
    instance.disks = disks
    if re.match(r"^zones/[a-z\d\-]+/machineTypes/[a-z\d\-]+$", machine_type):
        instance.machine_type = machine_type
    else:
        instance.machine_type = f"zones/{zone}/machineTypes/{machine_type}"

    instance.scheduling = compute_v1.Scheduling()
    if accelerators:
        instance.guest_accelerators = accelerators
        instance.scheduling.on_host_maintenance = (
            compute_v1.Scheduling.OnHostMaintenance.TERMINATE.name
        )

    if preemptible:
        # Set the preemptible setting
        warnings.warn(
            "Preemptible VMs are being replaced by Spot VMs.", DeprecationWarning
        )
        instance.scheduling = compute_v1.Scheduling()
        instance.scheduling.preemptible = True

    if spot:
        # Set the Spot VM setting
        instance.scheduling.provisioning_model = (
            compute_v1.Scheduling.ProvisioningModel.SPOT.name
        )
        instance.scheduling.instance_termination_action = instance_termination_action

    if custom_hostname is not None:
        # Set the custom hostname for the instance
        instance.hostname = custom_hostname

    if delete_protection:
        # Set the delete protection bit
        instance.deletion_protection = True

    # Prepare the request to insert an instance.
    request = compute_v1.InsertInstanceRequest()
    request.zone = zone
    request.project = project_id
    request.instance_resource = instance

    # Wait for the create operation to complete.
    print(f"Creating the {instance_name} instance in {zone}...")

    operation = instance_client.insert(request=request)

    wait_for_extended_operation(operation, "instance creation")

    print(f"Instance {instance_name} created.")
    return instance_client.get(project=project_id, zone=zone, instance=instance_name)


def create_spot_instance(
    project_id: str, zone: str, instance_name: str
) -> compute_v1.Instance:
    """
    Create a new Spot VM instance with Debian 10 operating system.

    Args:
        project_id: project ID or project number of the Cloud project you want to use.
        zone: name of the zone to create the instance in. For example: "us-west3-b"
        instance_name: name of the new virtual machine (VM) instance.

    Returns:
        Instance object.
    """
    newest_debian = get_image_from_family(project="debian-cloud", family="debian-11")
    disk_type = f"zones/{zone}/diskTypes/pd-standard"
    disks = [disk_from_image(disk_type, 10, True, newest_debian.self_link)]
    instance = create_instance(project_id, zone, instance_name, disks, spot=True)
    return instance

如要建立多個具有相同屬性的 Spot VM,您可以建立執行個體範本,然後使用該範本建立代管執行個體群組 (MIG)。詳情請參閱最佳做法

啟動 Spot VM

與其他 VM 一樣,Spot VM 會在建立時啟動。同樣地,如果 Spot VM 已停止,您可以重新啟動 VM,以恢復 RUNNING 狀態。只要有足夠的容量,您可以隨意停止及重新啟動先占 Spot VM。詳情請參閱「VM 執行個體生命週期」。

如果 Compute Engine 停止自動調度資源的代管執行個體群組 (MIG) 或 Google Kubernetes Engine (GKE) 叢集中的一或多個 Spot VM,則群組會在資源再次可用時重新啟動 VM。

找出 VM 的佈建模式和終止動作

找出 VM 的佈建模型,瞭解該 VM 是標準 VM、Spot VM 還是先占 VM。對於 Spot VM,您也可以指定終止動作。您可以使用 Google Cloud 控制台、gcloud CLI 或 Compute Engine API 來判斷 VM 的佈建模型和終止動作。

主控台

  1. 前往「VM instances」(VM 執行個體) 頁面。

    前往 VM 執行個體頁面

  2. 按一下要識別的 VM 的「名稱」,「VM instance details」(VM 執行個體詳細資料) 頁面隨即開啟。

  3. 前往頁面底部的「管理」部分。在「可用性政策」子區段中,勾選下列選項:

    • 如果 VM 佈建模式設為 Spot,則該 VM 為 Spot VM。
      • 「On VM termination」會指出 Compute Engine 在先占 VM 時採取的動作,可選擇「Stop」或「Delete」 VM。
    • 否則,如果 VM 佈建模式設為「標準」或「-」
      • 如果「Preemptibility」選項設為「On」,則 VM 為先占 VM。
      • 否則,VM 就是標準 VM。

gcloud

如要透過 gcloud CLI 描述 VM,請使用 gcloud compute instances describe 指令

gcloud compute instances describe VM_NAME

其中 VM_NAME 是您要檢查的 VM 名稱

在輸出內容中,檢查 scheduling 欄位來識別 VM:

  • 如果輸出內容包含 provisioningModel 欄位,且已設為 SPOT (類似於以下範例),表示 VM 為 Spot VM。

    ...
    scheduling:
    ...
    provisioningModel: SPOT
    instanceTerminationAction: TERMINATION_ACTION
    ...
    

    其中 TERMINATION_ACTION 表示 Compute Engine 先占 VM 時要採取的動作,可停止 (STOP) 或刪除 (DELETE) VM。如果缺少 instanceTerminationAction 欄位,則預設值為 STOP

  • 否則,如果輸出內容包含設為 standardprovisioningModel 欄位,或是輸出內容省略 provisioningModel 欄位:

    • 如果輸出內容包含 preemptible 欄位,且已設為 true,表示 VM 為可先占的 VM。
    • 否則,VM 就是標準 VM。

REST

如要透過 Compute Engine API 說明 VM,請使用 instances.get 方法

GET https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME

更改下列內容:

  • PROJECT_ID:VM 所在專案的專案 ID
  • ZONE:VM 所在的區域
  • VM_NAME:您要檢查的 VM 名稱

在輸出內容中,檢查 scheduling 欄位來識別 VM:

  • 如果輸出內容包含 provisioningModel 欄位,且已設為 SPOT (類似於以下範例),表示 VM 為 Spot VM。

    {
      ...
      "scheduling":
      {
         ...
         "provisioningModel": "SPOT",
         "instanceTerminationAction": "TERMINATION_ACTION"
         ...
      },
      ...
    }
    

    其中 TERMINATION_ACTION 表示 Compute Engine 先占 VM 時要採取的動作,可停止 (STOP) 或刪除 (DELETE) VM。如果缺少 instanceTerminationAction 欄位,則預設值為 STOP

  • 否則,如果輸出內容包含設為 standardprovisioningModel 欄位,或是輸出內容省略 provisioningModel 欄位:

    • 如果輸出內容包含 preemptible 欄位,且已設為 true,表示 VM 為可先占的 VM。
    • 否則,VM 就是標準 VM。

Go


import (
	"context"
	"fmt"
	"io"

	compute "cloud.google.com/go/compute/apiv1"
	"cloud.google.com/go/compute/apiv1/computepb"
)

// isSpotVM checks if a given instance is a Spot VM or not.
func isSpotVM(w io.Writer, projectID, zone, instanceName string) (bool, error) {
	// projectID := "your_project_id"
	// zone := "europe-central2-b"
	// instanceName := "your_instance_name"
	ctx := context.Background()
	client, err := compute.NewInstancesRESTClient(ctx)
	if err != nil {
		return false, fmt.Errorf("NewInstancesRESTClient: %w", err)
	}
	defer client.Close()

	req := &computepb.GetInstanceRequest{
		Project:  projectID,
		Zone:     zone,
		Instance: instanceName,
	}

	instance, err := client.Get(ctx, req)
	if err != nil {
		return false, fmt.Errorf("GetInstance: %w", err)
	}

	isSpot := instance.GetScheduling().GetProvisioningModel() == computepb.Scheduling_SPOT.String()

	var isSpotMessage string
	if !isSpot {
		isSpotMessage = " not"
	}
	fmt.Fprintf(w, "Instance %s is%s spot\n", instanceName, isSpotMessage)

	return instance.GetScheduling().GetProvisioningModel() == computepb.Scheduling_SPOT.String(), nil
}

Java


import com.google.cloud.compute.v1.Instance;
import com.google.cloud.compute.v1.InstancesClient;
import com.google.cloud.compute.v1.Scheduling;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeoutException;

public class CheckIsSpotVm {
  public static void main(String[] args)
          throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "your-project-id";
    // Name of the virtual machine to check.
    String instanceName = "your-route-name";
    // Name of the zone you want to use. For example: "us-west3-b"
    String zone = "your-zone";

    boolean isSpotVm = isSpotVm(projectId, instanceName, zone);
    System.out.printf("Is %s spot VM instance - %s", instanceName, isSpotVm);
  }

  // Check if a given instance is Spot VM or not.
  public static boolean isSpotVm(String projectId, String instanceName, String zone)
          throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (InstancesClient client = InstancesClient.create()) {
      Instance instance = client.get(projectId, zone, instanceName);

      return instance.getScheduling().getProvisioningModel()
              .equals(Scheduling.ProvisioningModel.SPOT.name());
    }
  }
}

Python

from google.cloud import compute_v1


def is_spot_vm(project_id: str, zone: str, instance_name: str) -> bool:
    """
    Check if a given instance is Spot VM or not.
    Args:
        project_id: project ID or project number of the Cloud project you want to use.
        zone: name of the zone you want to use. For example: "us-west3-b"
        instance_name: name of the virtual machine to check.
    Returns:
        The Spot VM status of the instance.
    """
    instance_client = compute_v1.InstancesClient()
    instance = instance_client.get(
        project=project_id, zone=zone, instance=instance_name
    )
    return (
        instance.scheduling.provisioning_model
        == compute_v1.Scheduling.ProvisioningModel.SPOT.name
    )

管理 Spot VM 的先占

如要瞭解如何管理 Spot VM 的預取權,請參閱下列章節:

使用關閉指令碼處理先占

當 Compute Engine 先占 Spot VM 時,您可以使用關機指令碼,在 VM 先占前嘗試執行清理動作。例如,您可以完善地停止運作中的程序,並將查核點檔案複製到 Cloud Storage。值得注意的是,預先中斷通知的關機期間最長時間,比使用者啟動的關機時間短。如要進一步瞭解先占通知的關機期間,請參閱 Spot VM 概念說明文件中的先占程序

以下是關閉指令碼範例,您可以將其新增至執行中的 Spot VM,或在建立新 Spot VM 時新增。這個指令碼的執行時機是執行個體開始關閉時,且作業系統的一般 kill 指令終止所有剩餘程序之前。在完善地停止所需程式之後,指令碼會將查核點檔案平行上傳至 Cloud Storage 值區。

#!/bin/bash

MY_PROGRAM="PROGRAM_NAME" # For example, "apache2" or "nginx"
MY_USER="LOCAL_USER"
CHECKPOINT="/home/$MY_USER/checkpoint.out"
BUCKET_NAME="BUCKET_NAME" # For example, "my-checkpoint-files" (without gs://)

echo "Shutting down!  Seeing if ${MY_PROGRAM} is running."

# Find the newest copy of $MY_PROGRAM
PID="$(pgrep -n "$MY_PROGRAM")"

if [[ "$?" -ne 0 ]]; then
  echo "${MY_PROGRAM} not running, shutting down immediately."
  exit 0
fi

echo "Sending SIGINT to $PID"
kill -2 "$PID"

# Portable waitpid equivalent
while kill -0 "$PID"; do
   sleep 1
done

echo "$PID is done, copying ${CHECKPOINT} to gs://${BUCKET_NAME} as ${MY_USER}"

su "${MY_USER}" -c "gcloud storage cp $CHECKPOINT gs://${BUCKET_NAME}/"

echo "Done uploading, shutting down."

這個指令碼假設:

  • 已建立至少具備 Cloud Storage 讀取/寫入權限的 VM。如需建立具有適當範圍的 VM 操作說明,請參閱驗證說明文件

  • 您有現有的 Cloud Storage 值區並有權限對其進行寫入。

如要將這個指令碼新增至 VM,請設定指令碼,使其能與 VM 上的應用程式搭配使用,並將它新增至 VM 的中繼資料。

  1. 複製或下載關機指令碼:

    • 複製上述關機指令碼,並替換下列項目:

      • PROGRAM_NAME 是您要關閉的程序或程式名稱。例如 apache2nginx
      • LOCAL_USER 是您以其身分登入虛擬機器的使用者名稱。
      • BUCKET_NAME 是您想儲存程式檢查點檔案之 Cloud Storage 值區的名稱。請注意,本例中的值區名稱開頭不是 gs://
    • 下載關閉指令碼到本機工作站,然後在檔案中取代下列變數:

      • [PROGRAM_NAME] 是您要關閉之程序或程式的名稱。例如 apache2nginx
      • [LOCAL_USER] 是您以其身分登入虛擬機器的使用者名稱。
      • [BUCKET_NAME] 是您想儲存程式查核點檔案之 Cloud Storage 值區的名稱。請注意,本例中的值區名稱開頭不是 gs://
  2. 將關閉指令碼新增至新的 VM現有的 VM

偵測 Spot VM 先占

使用 Google Cloud 控制台gcloud CLICompute Engine API,判斷 Compute Engine 是否已先占 Spot VM。

主控台

您可以查看系統活動記錄,檢查 VM 是否遭到先占。

  1. 前往 Google Cloud 控制台的「Logs」頁面。

    前往「Logs」

  2. 選取您的專案並點選 [繼續]

  3. compute.instances.preempted 新增至 [filter by label or text search] (按標籤或搜尋字詞篩選) 欄位。

  4. 或者,如果您想查看特定 VM 的先占作業,也可以輸入 VM 名稱。

  5. 按下 Enter 鍵,套用指定篩選器。Google Cloud 控制台會將記錄清單更新為僅顯示 VM 遭到先占的作業。

  6. 選取清單中的作業,即可查看遭到先占的 VM 詳細資料。

gcloud

使用 gcloud compute operations list 指令搭配 filter 參數,取得專案的先占事件清單。

gcloud compute operations list \
    --filter="operationType=compute.instances.preempted"

您可以視需要使用其他篩選器參數,進一步指定結果範圍。舉例來說,如果只想查看代管執行個體群組中的執行個體先占事件,請使用下列指令:

gcloud compute operations list \
    --filter="operationType=compute.instances.preempted AND targetLink:instances/BASE_INSTANCE_NAME"

其中 BASE_INSTANCE_NAME 是指定做為此受管理執行個體群組中所有 VM 名稱前置字串的基本名稱。

輸出結果會與下列內容相似:

NAME                  TYPE                         TARGET                                        HTTP_STATUS STATUS TIMESTAMP
systemevent-xxxxxxxx  compute.instances.preempted  us-central1-f/instances/example-instance-xxx  200         DONE   2015-04-02T12:12:10.881-07:00

compute.instances.preempted 作業類型代表 VM 執行個體遭到先佔。您可以使用 gcloud compute operations describe 指令取得特定先占作業的詳細資訊。

gcloud compute operations describe SYSTEM_EVENT \
    --zone=ZONE

更改下列內容:

  • SYSTEM_EVENT:來自 gcloud compute operations list 指令輸出的系統事件,例如 systemevent-xxxxxxxx
  • ZONE:系統事件的區域,例如 us-central1-f

輸出結果會與下列內容相似:

...
operationType: compute.instances.preempted
progress: 100
selfLink: https://compute.googleapis.com/compute/v1/projects/my-project/zones/us-central1-f/operations/systemevent-xxxxxxxx
startTime: '2015-04-02T12:12:10.881-07:00'
status: DONE
statusMessage: Instance was preempted.
...

REST

如要取得特定專案和區域的最近系統作業清單,請使用 zoneOperations.get 方法

GET https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/operations

更改下列內容:

如要將回應範圍限制為僅顯示先占運作,您可以在 API 要求中新增篩選條件:

operationType="compute.instances.preempted"

或者,如要查看特定 VM 的先占作業,請在篩選器中新增 targetLink 參數:

operationType="compute.instances.preempted" AND
targetLink="https://www.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME

請替換下列項目: + PROJECT_ID專案 ID。+ ZONE區域。+ VM_NAME:這個區域和專案中的特定 VM 名稱。

回應會包含近期運算的清單。例如,預取的運作方式如下所示:

{
  "kind": "compute#operation",
  "id": "15041793718812375371",
  "name": "systemevent-xxxxxxxx",
  "zone": "https://www.googleapis.com/compute/v1/projects/my-project/zones/us-central1-f",
  "operationType": "compute.instances.preempted",
  "targetLink": "https://www.googleapis.com/compute/v1/projects/my-project/zones/us-central1-f/instances/example-instance",
  "targetId": "12820389800990687210",
  "status": "DONE",
  "statusMessage": "Instance was preempted.",
  ...
}

或者,您也可以從 VM 內部判斷 VM 是否遭到先占。如果您想在關閉指令碼中,以與一般關閉不同的方式處理因 Compute Engine 先占導致的關閉情形,此方法很實用。只要在中繼資料伺服器中,查看 VM 的預設中繼資料中是否有 preempted 值即可判斷。

例如,從 VM 內使用 curl 來取得 preempted 的值:

curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted" -H "Metadata-Flavor: Google"
TRUE

如果這個值為 TRUE,則 VM 遭到 Compute Engine 先占,否則此值會是 FALSE

如果您想在關閉指令碼以外的地方使用它,請將 ?wait_for_change=true 附加至網址。這會執行等待 HTTP GET 要求,而該要求只會在中繼資料變更,且執行個體遭到先占時傳回。

curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted?wait_for_change=true" -H "Metadata-Flavor: Google"
TRUE

測試先占設定

您可以在 VM 上執行模擬維護作業,以強制先占這些 VM。使用這項功能可測試應用程式處理 Spot VM 的方式。請參閱「模擬主機維護事件」一文,瞭解如何在執行個體上測試維護事件。

您也可以停止 VM 執行個體,藉以模擬 VM 先占行為,這個模擬操作可用來取代模擬維護作業,並避免超過配額限制。

最佳做法

以下提供一些最佳做法,協助您充分發揮 Spot VM 的效用。

  • 使用執行個體範本。您可以使用執行個體範本建立多個具有相同屬性的 Spot VM,而非一次建立一個。使用 MIG 時,必須選取執行個體範本。或者,您也可以使用大量執行個體 API 建立多個 Spot VM。

  • 使用 MIG 在地區層級發布及自動重新建立 Spot VM。使用 MIG,讓 Spot VM 上的工作負載更具彈性和復原能力。舉例來說,您可以使用區域性 MIG 將 VM 分散至多個可用區,有助於減少資源可用性錯誤。此外,您也可以使用自動修復功能,在 Spot VM 遭到先占後自動重建。

  • 挑選較小的機器類型。Spot VM 的資源來自多餘和備份 Google Cloud 容量。較小的機器類型通常較容易取得 Spot VM 容量,也就是指具備較少 vCPU 和記憶體等資源的機器類型。您可以選取較小的自訂機器類型,為 Spot VM 提供更多容量,但較小的預先定義機器類型更有可能提供容量。舉例來說,與 n2-standard-32 預先定義機器類型的容量相比,n2-custom-24-96 自訂機器類型的容量更有可能達到,但 n2-standard-16 預先定義機器類型的容量更有可能達到。

  • 在離峰時段執行大型 Spot VM 叢集。 Google Cloud 資料中心的負載量會因地點和時段而異,但通常在夜間和週末時最低。因此,晚上和週末是執行大量 Spot VM 叢集的最佳時段。

  • 將應用程式設計為容錯且能承受先占。請務必做好準備,以因應不同時間點的優先順序模式變更。舉例來說,如果某個區域發生部分中斷情形,系統可能會先占用大量 Spot VM,為需要在復原期間遷移的標準 VM 騰出空間。在這個短暫的時間內,預取率會與其他日子有顯著差異。如果應用程式假設預取作業一律以小型群組執行,您可能無法為這類事件做好準備。

  • 重新嘗試建立先前遭先占的 Spot VM。如果您的 Spot VM 已遭到先占,請先嘗試建立新的 Spot VM 一兩次,再退而求其次使用標準 VM。視您的需求為何,可以考慮混合使用叢集中的標準 VM 和 Spot VM,以確保工作能夠按照適當的速度進行。

  • 使用關閉指令碼。請使用關閉指令碼管理關閉與先占通知,該指令碼要能夠儲存工作進度以接續上次進度,而不用從頭開始。

後續步驟