Use reservations with prediction

This document explains how to use Compute Engine reservations to gain a high level of assurance that your prediction jobs have the necessary resources to run.

To ensure that your prediction jobs have the virtual machine (VM) instances resources available when they need them, use Compute Engine reservations. Reservations provide a high level of assurance in obtaining capacity for one or more VMs with the specified hardware configuration. A reservation for a VM incurs the costs of that VM from when you create the reservation and until you delete the reservation. But, while you're consuming that VM, the total cost is equivalent to a VM without a reservation. To learn more, see Reservations of Compute Engine zonal resources.

Limitations and requirements

When using Compute Engine reservations with Vertex AI, consider the following limitations and requirements:

  • Vertex AI can only consume reservations of VMs that have GPUs attached.
  • Vertex AI can't consume reservations of VMs that have Local SSD disks manually attached.
  • Using Compute Engine reservations with Vertex AI is only supported for custom training and prediction.
  • A reservation's VM properties must match exactly with your Vertex AI workload to consume the reservation. For example, if a reservation specifies an a2-ultragpu-8g machine type, then the Vertex AI workload can only consume the reservation if it also uses an a2-ultragpu-8g machine type. See Requirements.
  • To consume a shared reservation of GPU VMs, you must consume it using its owner project or a consumer project with which the reservation is shared. See How shared reservations work.
  • To support regular updates of your Vertex AI deployments, we recommend increasing your VM count by at least 1 additional VM for each concurrent deployment.
  • The following services and capabilities aren't supported when using Compute Engine reservations with Vertex AI prediction:

    • Federal Risk and Authorization Management Program (FedRAMP) compliance

Billing

When using Compute Engine reservations, you're billed for the following:

  • Compute Engine pricing for the Compute Engine resources, including any applicable committed use discounts (CUDs). See Compute Engine pricing.
  • Vertex AI prediction management fees in addition to your infrastructure usage. See Prediction pricing.

Before you begin

Allow a reservation to be consumed

Before consuming a reservation of GPU VMs, you must set its sharing policy to allow Vertex AI to consume the reservation. To do so, use one of the following methods:

Allow consumption while creating a reservation

When creating a single-project or shared reservation of GPU VMs, you can allow Vertex AI to consume the reservation as follows:

  • If you're using the Google Cloud console, then, in the Google Cloud services section, select Share reservation.
  • If you're using the Google Cloud CLI, then include the --reservation-sharing-policy flag set to ALLOW_ALL.
  • If you're using the REST API, then, in the request body, include the serviceShareType field set to ALLOW_ALL.

Allow consumption in an existing reservation

To allow Vertex AI to consume an existing reservation of GPU VMs, see Modify the sharing policy of a reservation.

Get predictions by using a reservation

To create a model deployment that consumes a Compute Engine reservation of GPU VMs, use the REST API or Vertex AI SDK for Python.

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: The region where you are using Vertex AI.
  • PROJECT_ID: the project where the reservation was created in. To consume a shared reservation from another project, you must share the reservation with that project. For more information, see Modify the consumer projects in a shared reservation.
  • ENDPOINT_ID: The ID for the endpoint.
  • MODEL_ID: The ID for the model to be deployed.
  • DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
  • MACHINE_TYPE: the machine type to use for each node in this deployment. Its default setting is n1-standard-2. For more information about the supported machine types, see Configure compute resources for prediction.
  • ACCELERATOR_TYPE: the type of accelerator to attach to the machine. For more information about the type of GPU that each machine type supports, see GPUs for compute workloads.
  • ACCELERATOR_COUNT: the number of accelerators to attach to the machine.
  • RESERVATION_AFFINITY_TYPE: Must be ANY, SPECIFIC_RESERVATION, or NONE.
    • ANY means that the VMs of your customJob automatically can consume any reservation with matching properties.
    • SPECIFIC_RESERVATION means that the VMs of your customJob can consume only a reservation that the VMs specifically targets by name.
    • NONE means that the VMs of your customJob can't consume any reservation. Specifying NONE has the same effect as omitting a reservation affinity specification.
  • RESERVATION_NAME: the name of your reservation.
  • MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
  • MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
  • TRAFFIC_SPLIT_THIS_MODEL: the percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
  • DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
  • TRAFFIC_SPLIT_MODEL_N: the traffic split percentage value for the deployed model ID key.
  • PROJECT_NUMBER: Your project's automatically generated project number.

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT/locations/LOCATION_ID/models/MODEL_ID",
    "displayName": "DEPLOYED_MODEL_NAME",
    "dedicatedResources": {
      "machineSpec": {
        "machineType": "MACHINE_TYPE",
        "acceleratorType": "ACCELERATOR_TYPE",
        "acceleratorCount": ACCELERATOR_COUNT,
        "reservationAffinity": {
          "reservationAffinityType": "RESERVATION_AFFINITY_TYPE",
          "key": "compute.googleapis.com/reservation-name",
          "values": [
            "projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME"
          ]
        }
      },
      "minReplicaCount": MIN_REPLICA_COUNT,
      "maxReplicaCount": MAX_REPLICA_COUNT
    },
  },
  "trafficSplit": {
    "0": TRAFFIC_SPLIT_THIS_MODEL,
    "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1,
    "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2
  },
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.DeployModelOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-10-19T17:53:16.502088Z",
      "updateTime": "2020-10-19T17:53:16.502088Z"
    }
  }
}

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

Before running any of the following scripts, make the following replacements:

  • DEPLOYED_NAME: a name for the deployed model.
  • TRAFFIC_SPLIT: the traffic split percentage value for the deployed model ID key.
  • MACHINE_TYPE: the machine used for each node of this deployment. Its default setting is n1-standard-2. Learn more about machine types.
  • ACCELERATOR_TYPE: the type of accelerator to attach to the machine. For more information about the type of GPU that each machine type supports, see GPUs for compute workloads.
  • ACCELERATOR_COUNT: the number of accelerators to attach to the machine.
  • PROJECT_ID: the project where the reservation was created in. To consume a shared reservation from another project, you must share the reservation with that project. For more information, see Modify the consumer projects in a shared reservation.
  • ZONE: the zone where the reservation is located.
  • RESERVATION_NAME: the name of your reservation.
  • MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
  • MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.

Depending on the type of reservation that you want to consume, do one of the following:

  • To consume a specific reservation:
    endpoint5.deploy(
        model = model,
        deployed_model_display_name=DEPLOYED_NAME,
        traffic_split=TRAFFIC_SPLIT,
        machine_type="MACHINE_TYPE",
        accelerator_type="ACCELERATOR_TYPE",
        accelerator_count=ACCELERATOR_COUNT,
        reservation_affinity_type="SPECIFIC_RESERVATION",
        reservation_affinity_key="compute.googleapis.com/reservation-name",
        reservation_affinity_values=["projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME"],
        min_replica_count=MIN_REPLICA_COUNT,
        max_replica_count=MAX_REPLICA_COUNT,
        sync=True
    )
  • To consume an automatically consumed reservation:
    endpoint5.deploy(
        model = model,
        deployed_model_display_name=DEPLOYED_NAME,
        traffic_split=TRAFFIC_SPLIT,
        machine_type="MACHINE_TYPE",
        accelerator_type="ACCELERATOR_TYPE",
        accelerator_count=ACCELERATOR_COUNT,
        reservation_affinity_type="ANY_RESERVATION",
        min_replica_count=MIN_REPLICA_COUNT,
        max_replica_count=MAX_REPLICA_COUNT,
        sync=True
    )

What's next