Serve a DeepSeek-V3 model using multi-host GPU deployment

Overview

Vertex AI Prediction supports multi-host GPU deployment for serving models that exceed the memory capacity of a single GPU node, such as DeepSeek-V3, DeepSeek-R1, and Meta LLama3.1 405 (non-quantized version).

This guide describes how to serve a DeepSeek-V3 model using multi-host graphical processing units (GPUs) on Vertex AI Prediction with vLLM. Setup for other models is similar. For more information, see vLLM serving for text and multimodal language models.

Before you begin, ensure you are familiar with the following:

Use the Pricing Calculator to generate a cost estimate based on your projected usage.

Containers

To support multi-host deployments, this guide uses a prebuilt vLLM container image with Ray integration from Model Garden. Ray enables the distributed processing required to run models across multiple GPU nodes. This container also supports serving streaming requests by using the Chat Completions API.

If desired, you can create your own vLLM multi-node image. Note that this custom container image needs to be compatible with Vertex AI Prediction.

Before you begin

Before you begin your model deployment, complete the prerequisites listed in this section.

Set up a Google Cloud project

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Vertex AI API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Vertex AI API.

    Enable the API

  8. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Request GPU quota

To deploy DeepSeek-V3, you need two a3-highgpu-8g VMs with eight H100 GPUs each, for a total of 16 H100 GPUs. It's likely that you'll need to request an H100 GPU quota increase, as the default value is less than 16.

  1. To view the H100 GPU quota, go to the Google Cloud console Quotas & System Limits page.

    Go to Quotas & System Limits

  2. Request a quota adjustment.

Upload the model

  1. To upload your model as a Model resource to Vertex AI Prediction, run the gcloud ai models upload command as follows:

    gcloud ai models upload \
        --region=LOCATION \
        --project=PROJECT_ID \
        --display-name=MODEL_DISPLAY_NAME \
        --container-image-uri=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250130_0916_RC01 \
        --container-args='^;^/vllm-workspace/ray_launcher.sh;python;-m;vllm.entrypoints.api_server;--host=0.0.0.0;--port=8080;--model=deepseek-ai/DeepSeek-V3;--tensor-parallel-size=16;--pipeline-parallel-size=1;--gpu-memory-utilization=0.9;--trust-remote-code;--max-model-len=32768' \
        --container-deployment-timeout-seconds=4500 \
        --container-ports=8080 \
        --container-env-vars=MODEL_ID=deepseek-ai/DeepSeek-V3
    

    Make the following replacements:

    • LOCATION: the region where you are using Vertex AI
    • PROJECT_ID: the ID of your Google Cloud project
    • MODEL_DISPLAY_NAME: the display name you want for your model

Create a dedicated online prediction endpoint

To support chat completion requests, the Model Garden container requires a dedicated endpoint. Dedicated endpoints are in preview and don't support Google Cloud CLI, so you need to use the REST API to create the endpoint.

  1. To create the dedicated endpoint, run the following command:

    PROJECT_ID=PROJECT_ID
    REGION=LOCATION
    ENDPOINT="${REGION}-aiplatform.googleapis.com"
    
    curl \
      -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      https://${ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints \
      -d '{
        "displayName": "ENDPOINT_DISPLAY_NAME",
        "dedicatedEndpointEnabled": true
        }'
    

    Make the following replacements:

    • ENDPOINT_DISPLAY_NAME: the display name for your endpoint

Deploy the model

  1. Get the endpoint ID for the online prediction endpoint by running the gcloud ai endpoints list command:

    ENDPOINT_ID=$(gcloud ai endpoints list \
     --project=PROJECT_ID \
     --region=LOCATION \
     --filter=display_name~'ENDPOINT_DISPLAY_NAME' \
     --format="value(name)")
    
  2. Get the model ID for your model by running the gcloud ai models list command:

    MODEL_ID=$(gcloud ai models list \
     --project=PROJECT_ID \
     --region=LOCATION \
     --filter=display_name~'MODEL_DISPLAY_NAME' \
     --format="value(name)")
    
  3. Deploy the model to the endpoint by running the gcloud ai deploy-model command:

    gcloud alpha ai endpoints deploy-model $ENDPOINT_ID \
     --project=PROJECT_ID \
     --region=LOCATION \
     --model=$MODEL_ID \
     --display-name="DEPLOYED_MODEL_NAME" \
     --machine-type=a3-highgpu-8g \
     --traffic-split=0=100 \
     --accelerator=type=nvidia-h100-80gb,count=8 \
     --multihost-gpu-node-count=2
    

    Replace DEPLOYED_MODEL_NAME with a name for the deployed model. This can be the same as the model display name (MODEL_DISPLAY_NAME).

    Deploying large models like DeepSeek-V3 can take longer than the default deployment timeout. If the deploy-model command times out, the deployment process continues to run in the background.

    The deploy-model command returns an operation ID that can be used to check when the operation is finished. You can poll for the status of the operation until the response includes "done": true. Use the following command to poll the status:

    gcloud ai operations describe \
    --region=LOCATION \
    OPERATION_ID
    

    Replace OPERATION_ID with the operation ID that was returned by the previous command.

Get online predictions from the deployed model

This section describes how to send an online prediction request to the dedicated public endpoint where the DeepSeek-V3 model is deployed.

  1. Get the project number by running the gcloud projects describe command:

    PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
    
  2. Send a raw predict request:

    curl \
    -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict \
    -d '{
       "prompt": "Write a short story about a robot.",
       "stream": false,
       "max_tokens": 50,
       "temperature": 0.7
       }'
    
  3. Send a chat completion request:

    curl \
    -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}/chat/completions \
    -d '{"stream":false, "messages":[{"role": "user", "content": "Summer travel plan to Paris"}], "max_tokens": 40,"temperature":0.4,"top_k":10,"top_p":0.95, "n":1}'
    

    To enable streaming, change the value of "stream" from false to true.

Clean up

To avoid incurring further Vertex AI charges, delete the Google Cloud resources that you created during this tutorial:

  1. To undeploy the model from the endpoint and delete the endpoint, run the following commands:

    ENDPOINT_ID=$(gcloud ai endpoints list \
       --region=LOCATION \
       --filter=display_name=ENDPOINT_DISPLAY_NAME \
       --format="value(name)")
    
    DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
       --region=LOCATION \
       --format="value(deployedModels.id)")
    
    gcloud ai endpoints undeploy-model $ENDPOINT_ID \
      --region=LOCATION \
      --deployed-model-id=$DEPLOYED_MODEL_ID
    
    gcloud ai endpoints delete $ENDPOINT_ID \
       --region=LOCATION \
       --quiet
    
  2. To delete your model, run the following commands:

    MODEL_ID=$(gcloud ai models list \
       --region=LOCATION \
       --filter=display_name=DEPLOYED_MODEL_NAME \
       --format="value(name)")
    
    gcloud ai models delete $MODEL_ID \
       --region=LOCATION \
       --quiet
    

What's next