This guide shows you how to deploy the Meta-Llama-3.1-8B model on a Vertex AI endpoint using different provisioning options to balance cost and availability. This document covers the following topics: Vertex AI offers different ways to provision resources for your model endpoints. The following table compares the options covered in this guide to help you choose the best one for your workload. For more information, see Spot VMs or
Reservations of Compute Engine resources. Before you deploy the model, complete the following prerequisites in the Colab Enterprise notebook. Make sure that you have the following: Set up your environment:
Open the notebook and set the If you plan to use a reservation from a different project, set the Configure Hugging Face authentication:
To download the Llama-3.1 model, provide your Hugging Face User Access Token in the After you complete the prerequisites, you can deploy the model using either Spot VMs for cost optimization or reservations for guaranteed capacity. To deploy the Llama model to a Spot VM for fault-tolerant workloads, go to the Spot VM Vertex AI Endpoint Deployment section in the Colab notebook and set To get guaranteed resource availability for production workloads, deploy the model using a reservation. 1. Create and share a reservation Reservations can be for a single project (default) or shared across multiple projects to improve resource utilization. This guide uses shared reservations. For more information, see How shared reservations work. Create the reservation: In the Set Up Reservations for Vertex AI Predictions section of the notebook, set the Share the reservation: In the Google Cloud console, configure the reservation to Share with other Google Services. 2. Deploy with To deploy the endpoint using any available matching reservation, go to the Deploy Llama-3.1 Endpoint with 3. Deploy with To deploy the endpoint using a specific, named reservation, go to the Deploy Llama-3.1 Endpoint with After you deploy the endpoint, use the notebook to send prompts to it to verify that it's working correctly. If you used a reservation, you can verify in the Google Cloud console that the reservation is being consumed by Vertex AI online prediction. To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, delete the models, endpoints, and reservations that you created. The Clean Up section in the Colab notebook provides code to automate this process.
ANY_RESERVATION
, and SPECIFIC_RESERVATION
.Choose your deployment option
Option
Description
Pros
Cons
Use Case
Spot VMs
Virtual machines that are acquired at a lower price than standard VMs but can be preempted if the resources are needed elsewhere.
Significant cost savings.
No guarantee of availability; instances can be stopped with little warning.
Fault-tolerant, stateless, or batch-processing workloads that can handle interruptions.
ANY_RESERVATION
The deployment consumes capacity from any available reservation in the project that matches the required machine type and zone.
Provides guaranteed capacity without needing to specify a particular reservation name. Simplifies deployment configuration.
Less control over which specific reservation is used.
Workloads that require guaranteed capacity but don't need to be tied to a specific, named reservation.
SPECIFIC_RESERVATION
The deployment consumes capacity from a single, named reservation that you specify.
Provides guaranteed capacity with precise control over which reservation is used.
Requires more specific configuration during deployment.
Production workloads where you need to ensure capacity is drawn from a designated reservation, often for capacity planning or billing attribution.
Before you begin
PROJECT_ID
, SHARED_PROJECT_ID
(if applicable), BUCKET_URI
, and REGION
variables.SHARED_PROJECT_ID
variable. The notebook grants the compute.viewer
role to the P4SA (Principal Service Account) of both projects. This cross-project permission lets the Vertex AI endpoint in your primary project use the reservation capacity in the shared project.HF_TOKEN
variable in the notebook. If you don't provide a token, the deployment fails with a Cannot access gated repository for URL
error.
Figure 1: Hugging Face Access Token Settings
Deploy the Llama-3.1 model
Deploy using Spot VMs
is_spot=True
.base_model_name = "Meta-Llama-3.1-8B"
hf_model_id = "meta-llama/" + base_model_name
if "8b" in base_model_name.lower():
accelerator_type = "NVIDIA_L4"
machine_type = "g2-standard-12"
accelerator_count = 1
max_loras = 5
else:
raise ValueError(
f"Recommended GPU setting not found for: {accelerator_type} and {base_model_name}."
)
common_util.check_quota(
project_id=PROJECT_ID,
region=REGION,
accelerator_type=accelerator_type,
accelerator_count=accelerator_count,
is_for_training=False,
)
gpu_memory_utilization = 0.95
max_model_len = 8192
models["vllm_gpu_spotvm"], endpoints["vllm_gpu_spotvm"] = deploy_model_vllm(
model_name=common_util.get_job_name_with_datetime(prefix="llama3_1-serve-spotvm"),
model_id=hf_model_id,
base_model_id=hf_model_id,
service_account=SERVICE_ACCOUNT,
machine_type=machine_type,
accelerator_type=accelerator_type,
accelerator_count=accelerator_count,
gpu_memory_utilization=gpu_memory_utilization,
max_model_len=max_model_len,
max_loras=max_loras,
enforce_eager=True,
enable_lora=True,
use_dedicated_endpoint=False,
model_type="llama3.1",
is_spot=True,
)
Deploy using reservations
RES_ZONE
, RESERVATION_NAME
, RES_MACHINE_TYPE
, RES_ACCELERATOR_TYPE
, and RES_ACCELERATOR_COUNT
variables.RES_ZONE = "a"
RES_ZONE = f"{REGION}-{RES_ZONE}"
RESERVATION_NAME = "shared-reservation-1"
RESERVATION_NAME = f"{PROJECT_ID}-{RESERVATION_NAME}"
RES_MACHINE_TYPE = "g2-standard-12"
RES_ACCELERATOR_TYPE = "nvidia-l4"
RES_ACCELERATOR_COUNT = 1
rev_names.append(RESERVATION_NAME)
create_reservation(
res_project_id=PROJECT_ID,
res_zone=RES_ZONE,
res_name=RESERVATION_NAME,
res_machine_type=RES_MACHINE_TYPE,
res_accelerator_type=RES_ACCELERATOR_TYPE,
res_accelerator_count=RES_ACCELERATOR_COUNT,
shared_project_id=SHARED_PROJECT_ID,
)
Figure 2: Share reservation with other Google services
ANY_RESERVATION
ANY_RESERVATION
section of the notebook and set reservation_affinity_type="ANY_RESERVATION"
.hf_model_id = "meta-llama/Meta-Llama-3.1-8B"
models["vllm_gpu_any_reserve"], endpoints["vllm_gpu_any_reserve"] = deploy_model_vllm(
model_name=common_util.get_job_name_with_datetime(
prefix=f"llama3_1-serve-any-{RESERVATION_NAME}"
),
model_id=hf_model_id,
base_model_id=hf_model_id,
service_account=SERVICE_ACCOUNT,
machine_type=MACHINE_TYPE,
accelerator_type=ACCELERATOR_TYPE,
accelerator_count=ACCELERATOR_COUNT,
model_type="llama3.1",
reservation_affinity_type="ANY_RESERVATION",
)
SPECIFIC_RESERVATION
SPECIFIC_RESERVATION
section of the notebook. Set reservation_affinity_type="SPECIFIC_RESERVATION"
and specify the reservation_name
, reservation_project
, and reservation_zone
.hf_model_id = "meta-llama/Meta-Llama-3.1-8B"
MACHINE_TYPE = "g2-standard-12"
ACCELERATOR_TYPE = "NVIDIA_L4"
ACCELERATOR_COUNT = 1
(
models["vllm_gpu_specific_reserve"],
endpoints["vllm_gpu_specific_reserve"],
) = deploy_model_vllm(
model_name=common_util.get_job_name_with_datetime(
prefix=f"llama3_1-serve-specific-{RESERVATION_NAME}"
),
model_id=hf_model_id,
base_model_id=hf_model_id,
service_account=SERVICE_ACCOUNT,
machine_type=MACHINE_TYPE,
accelerator_type=ACCELERATOR_TYPE,
accelerator_count=ACCELERATOR_COUNT,
model_type="llama3.1",
reservation_name=RESERVATION_NAME,
reservation_affinity_type="SPECIFIC_RESERVATION",
reservation_project=PROJECT_ID,
reservation_zone=RES_ZONE,
)
Test your endpoint
Figure 3: Check reservation is used by Vertex online prediction
Clean up
Troubleshooting
read
permissions and is correctly set in the notebook.Next steps
Using Spot VMs or reservations to deploy a Vertex AI Llama-3.1 endpoint on Cloud GPUs
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-27 UTC.