Overview
Vertex AI Prediction supports multi-host GPU deployment for serving models that exceed the memory capacity of a single GPU node, such as DeepSeek-V3, DeepSeek-R1, and Meta LLama3.1 405 (non-quantized version).
This guide describes how to serve a DeepSeek-V3 model using multi-host graphical processing units (GPUs) on Vertex AI Prediction with vLLM. Setup for other models is similar. For more information, see vLLM serving for text and multimodal language models.
Before you begin, ensure you are familiar with the following:
Use the Pricing Calculator to generate a cost estimate based on your projected usage.
Containers
To support multi-host deployments, this guide uses a prebuilt vLLM container image with Ray integration from Model Garden. Ray enables the distributed processing required to run models across multiple GPU nodes. This container also supports serving streaming requests by using the Chat Completions API.
If desired, you can create your own vLLM multi-node image. Note that this custom container image needs to be compatible with Vertex AI Prediction.
Before you begin
Before you begin your model deployment, complete the prerequisites listed in this section.
Set up a Google Cloud project
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI API.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Request GPU quota
To deploy DeepSeek-V3, you need two
a3-highgpu-8g
VMs
with eight H100 GPUs each, for a total of 16
H100 GPUs. It's likely that you'll need to request an H100 GPU quota increase, as
the default value is less than 16.
To view the H100 GPU quota, go to the Google Cloud console Quotas & System Limits page.
Upload the model
To upload your model as a
Model
resource to Vertex AI Prediction, run thegcloud ai models upload
command as follows:gcloud ai models upload \ --region=LOCATION \ --project=PROJECT_ID \ --display-name=MODEL_DISPLAY_NAME \ --container-image-uri=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250130_0916_RC01 \ --container-args='^;^/vllm-workspace/ray_launcher.sh;python;-m;vllm.entrypoints.api_server;--host=0.0.0.0;--port=8080;--model=deepseek-ai/DeepSeek-V3;--tensor-parallel-size=16;--pipeline-parallel-size=1;--gpu-memory-utilization=0.9;--trust-remote-code;--max-model-len=32768' \ --container-deployment-timeout-seconds=4500 \ --container-ports=8080 \ --container-env-vars=MODEL_ID=deepseek-ai/DeepSeek-V3
Make the following replacements:
LOCATION
: the region where you are using Vertex AIPROJECT_ID
: the ID of your Google Cloud projectMODEL_DISPLAY_NAME
: the display name you want for your model
Create a dedicated online prediction endpoint
To support chat completion requests, the Model Garden container requires a dedicated endpoint. Dedicated endpoints are in preview and don't support Google Cloud CLI, so you need to use the REST API to create the endpoint.
To create the dedicated endpoint, run the following command:
PROJECT_ID=PROJECT_ID REGION=LOCATION ENDPOINT="${REGION}-aiplatform.googleapis.com" curl \ -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints \ -d '{ "displayName": "ENDPOINT_DISPLAY_NAME", "dedicatedEndpointEnabled": true }'
Make the following replacements:
ENDPOINT_DISPLAY_NAME
: the display name for your endpoint
Deploy the model
Get the endpoint ID for the online prediction endpoint by running the
gcloud ai endpoints list
command:ENDPOINT_ID=$(gcloud ai endpoints list \ --project=PROJECT_ID \ --region=LOCATION \ --filter=display_name~'ENDPOINT_DISPLAY_NAME' \ --format="value(name)")
Get the model ID for your model by running the
gcloud ai models list
command:MODEL_ID=$(gcloud ai models list \ --project=PROJECT_ID \ --region=LOCATION \ --filter=display_name~'MODEL_DISPLAY_NAME' \ --format="value(name)")
Deploy the model to the endpoint by running the
gcloud ai deploy-model
command:gcloud alpha ai endpoints deploy-model $ENDPOINT_ID \ --project=PROJECT_ID \ --region=LOCATION \ --model=$MODEL_ID \ --display-name="DEPLOYED_MODEL_NAME" \ --machine-type=a3-highgpu-8g \ --traffic-split=0=100 \ --accelerator=type=nvidia-h100-80gb,count=8 \ --multihost-gpu-node-count=2
Replace DEPLOYED_MODEL_NAME with a name for the deployed model. This can be the same as the model display name (MODEL_DISPLAY_NAME).
Deploying large models like DeepSeek-V3 can take longer than the default deployment timeout. If the
deploy-model
command times out, the deployment process continues to run in the background.The
deploy-model
command returns an operation ID that can be used to check when the operation is finished. You can poll for the status of the operation until the response includes"done": true
. Use the following command to poll the status:gcloud ai operations describe \ --region=LOCATION \ OPERATION_ID
Replace OPERATION_ID with the operation ID that was returned by the previous command.
Get online predictions from the deployed model
This section describes how to send an online prediction request to the dedicated public endpoint where the DeepSeek-V3 model is deployed.
Get the project number by running the
gcloud projects describe
command:PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
Send a raw predict request:
curl \ -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict \ -d '{ "prompt": "Write a short story about a robot.", "stream": false, "max_tokens": 50, "temperature": 0.7 }'
Send a chat completion request:
curl \ -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}/chat/completions \ -d '{"stream":false, "messages":[{"role": "user", "content": "Summer travel plan to Paris"}], "max_tokens": 40,"temperature":0.4,"top_k":10,"top_p":0.95, "n":1}'
To enable streaming, change the value of
"stream"
fromfalse
totrue
.
Clean up
To avoid incurring further Vertex AI charges, delete the Google Cloud resources that you created during this tutorial:
To undeploy the model from the endpoint and delete the endpoint, run the following commands:
ENDPOINT_ID=$(gcloud ai endpoints list \ --region=LOCATION \ --filter=display_name=ENDPOINT_DISPLAY_NAME \ --format="value(name)") DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \ --region=LOCATION \ --format="value(deployedModels.id)") gcloud ai endpoints undeploy-model $ENDPOINT_ID \ --region=LOCATION \ --deployed-model-id=$DEPLOYED_MODEL_ID gcloud ai endpoints delete $ENDPOINT_ID \ --region=LOCATION \ --quiet
To delete your model, run the following commands:
MODEL_ID=$(gcloud ai models list \ --region=LOCATION \ --filter=display_name=DEPLOYED_MODEL_NAME \ --format="value(name)") gcloud ai models delete $MODEL_ID \ --region=LOCATION \ --quiet
What's next
- For comprehensive reference information about multi-host GPU deployment on Vertex AI Prediction with vLLM, see vLLM serving for text and multimodal language models.
- Learn to create your own vLLM multi-node image. Note that your custom container image needs to be compatible with Vertex AI Prediction.