Using Spot VMs or reservations to deploy a Vertex AI Llama-3.1 endpoint on Cloud GPUs

This guide shows you how to deploy the Meta-Llama-3.1-8B model on a Vertex AI endpoint using different provisioning options to balance cost and availability.

This document covers the following topics:

Choose your deployment option

Vertex AI offers different ways to provision resources for your model endpoints. The following table compares the options covered in this guide to help you choose the best one for your workload.

Option Description Pros Cons Use Case
Spot VMs Virtual machines that are acquired at a lower price than standard VMs but can be preempted if the resources are needed elsewhere. Significant cost savings. No guarantee of availability; instances can be stopped with little warning. Fault-tolerant, stateless, or batch-processing workloads that can handle interruptions.
ANY_RESERVATION The deployment consumes capacity from any available reservation in the project that matches the required machine type and zone. Provides guaranteed capacity without needing to specify a particular reservation name. Simplifies deployment configuration. Less control over which specific reservation is used. Workloads that require guaranteed capacity but don't need to be tied to a specific, named reservation.
SPECIFIC_RESERVATION The deployment consumes capacity from a single, named reservation that you specify. Provides guaranteed capacity with precise control over which reservation is used. Requires more specific configuration during deployment. Production workloads where you need to ensure capacity is drawn from a designated reservation, often for capacity planning or billing attribution.

For more information, see Spot VMs or Reservations of Compute Engine resources.

Before you begin

Before you deploy the model, complete the following prerequisites in the Colab Enterprise notebook.

  1. Make sure that you have the following:

    • A Google Cloud project with billing enabled.
    • The Vertex AI and Compute Engine APIs enabled.
    • Sufficient quota for the machine type and accelerator you plan to use, such as NVIDIA L4 GPUs. To check your quotas, go to Quotas and system limits in the Google Cloud console.
    • A Hugging Face account and a User Access Token with read access.
    • If you use shared reservations, you need IAM permissions granted between projects. The notebook provides instructions for setting these permissions.
  2. Set up your environment: Open the notebook and set the PROJECT_ID, SHARED_PROJECT_ID (if applicable), BUCKET_URI, and REGION variables.

    If you plan to use a reservation from a different project, set the SHARED_PROJECT_ID variable. The notebook grants the compute.viewer role to the P4SA (Principal Service Account) of both projects. This cross-project permission lets the Vertex AI endpoint in your primary project use the reservation capacity in the shared project.

  3. Configure Hugging Face authentication: To download the Llama-3.1 model, provide your Hugging Face User Access Token in the HF_TOKEN variable in the notebook. If you don't provide a token, the deployment fails with a Cannot access gated repository for URL error.

    Hugging Face Access Token Settings Figure 1: Hugging Face Access Token Settings

Deploy the Llama-3.1 model

After you complete the prerequisites, you can deploy the model using either Spot VMs for cost optimization or reservations for guaranteed capacity.

Deploy using Spot VMs

To deploy the Llama model to a Spot VM for fault-tolerant workloads, go to the Spot VM Vertex AI Endpoint Deployment section in the Colab notebook and set is_spot=True.

base_model_name = "Meta-Llama-3.1-8B"
hf_model_id = "meta-llama/" + base_model_name

if "8b" in base_model_name.lower():
    accelerator_type = "NVIDIA_L4"
    machine_type = "g2-standard-12"
    accelerator_count = 1
    max_loras = 5
else:
    raise ValueError(
        f"Recommended GPU setting not found for: {accelerator_type} and {base_model_name}."
    )

common_util.check_quota(
    project_id=PROJECT_ID,
    region=REGION,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    is_for_training=False,
)

gpu_memory_utilization = 0.95
max_model_len = 8192

models["vllm_gpu_spotvm"], endpoints["vllm_gpu_spotvm"] = deploy_model_vllm(
    model_name=common_util.get_job_name_with_datetime(prefix="llama3_1-serve-spotvm"),
    model_id=hf_model_id,
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    gpu_memory_utilization=gpu_memory_utilization,
    max_model_len=max_model_len,
    max_loras=max_loras,
    enforce_eager=True,
    enable_lora=True,
    use_dedicated_endpoint=False,
    model_type="llama3.1",
    is_spot=True,
)

Deploy using reservations

To get guaranteed resource availability for production workloads, deploy the model using a reservation.

1. Create and share a reservation

Reservations can be for a single project (default) or shared across multiple projects to improve resource utilization. This guide uses shared reservations. For more information, see How shared reservations work.

  • Create the reservation: In the Set Up Reservations for Vertex AI Predictions section of the notebook, set the RES_ZONE, RESERVATION_NAME, RES_MACHINE_TYPE, RES_ACCELERATOR_TYPE, and RES_ACCELERATOR_COUNT variables.

    RES_ZONE = "a"
    RES_ZONE = f"{REGION}-{RES_ZONE}"
    
    RESERVATION_NAME = "shared-reservation-1"
    RESERVATION_NAME = f"{PROJECT_ID}-{RESERVATION_NAME}"
    RES_MACHINE_TYPE = "g2-standard-12"
    RES_ACCELERATOR_TYPE = "nvidia-l4"
    RES_ACCELERATOR_COUNT = 1
    rev_names.append(RESERVATION_NAME)
    
    create_reservation(
        res_project_id=PROJECT_ID,
        res_zone=RES_ZONE,
        res_name=RESERVATION_NAME,
        res_machine_type=RES_MACHINE_TYPE,
        res_accelerator_type=RES_ACCELERATOR_TYPE,
        res_accelerator_count=RES_ACCELERATOR_COUNT,
        shared_project_id=SHARED_PROJECT_ID,
    )
    
  • Share the reservation: In the Google Cloud console, configure the reservation to Share with other Google Services.

    Share reservation with other Google services Figure 2: Share reservation with other Google services

2. Deploy with ANY_RESERVATION

To deploy the endpoint using any available matching reservation, go to the Deploy Llama-3.1 Endpoint with ANY_RESERVATION section of the notebook and set reservation_affinity_type="ANY_RESERVATION".

hf_model_id = "meta-llama/Meta-Llama-3.1-8B"

models["vllm_gpu_any_reserve"], endpoints["vllm_gpu_any_reserve"] = deploy_model_vllm(
    model_name=common_util.get_job_name_with_datetime(
        prefix=f"llama3_1-serve-any-{RESERVATION_NAME}"
    ),
    model_id=hf_model_id,
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_COUNT,
    model_type="llama3.1",
    reservation_affinity_type="ANY_RESERVATION",
)

3. Deploy with SPECIFIC_RESERVATION

To deploy the endpoint using a specific, named reservation, go to the Deploy Llama-3.1 Endpoint with SPECIFIC_RESERVATION section of the notebook. Set reservation_affinity_type="SPECIFIC_RESERVATION" and specify the reservation_name, reservation_project, and reservation_zone.

hf_model_id = "meta-llama/Meta-Llama-3.1-8B"

MACHINE_TYPE = "g2-standard-12"
ACCELERATOR_TYPE = "NVIDIA_L4"
ACCELERATOR_COUNT = 1

(
    models["vllm_gpu_specific_reserve"],
    endpoints["vllm_gpu_specific_reserve"],
) = deploy_model_vllm(
    model_name=common_util.get_job_name_with_datetime(
        prefix=f"llama3_1-serve-specific-{RESERVATION_NAME}"
    ),
    model_id=hf_model_id,
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_COUNT,
    model_type="llama3.1",
    reservation_name=RESERVATION_NAME,
    reservation_affinity_type="SPECIFIC_RESERVATION",
    reservation_project=PROJECT_ID,
    reservation_zone=RES_ZONE,
)

Test your endpoint

After you deploy the endpoint, use the notebook to send prompts to it to verify that it's working correctly.

If you used a reservation, you can verify in the Google Cloud console that the reservation is being consumed by Vertex AI online prediction.

Check reservation is used by Vertex online prediction Figure 3: Check reservation is used by Vertex online prediction

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, delete the models, endpoints, and reservations that you created. The Clean Up section in the Colab notebook provides code to automate this process.

Troubleshooting

  • Hugging Face Token Errors: Double-check that your Hugging Face token has read permissions and is correctly set in the notebook.
  • Quota Errors: Verify that you have sufficient GPU quota in the region you are deploying to. If needed, request a quota increase.
  • Reservation Conflicts: Make sure that the machine type and accelerator configuration of your endpoint deployment match the settings of your reservation. Also, verify that the reservations are enabled to be shared with Google Services.

Next steps