Call MaaS APIs for open models

Many open models on Vertex AI offer fully managed and serverless models as APIs using the Vertex AI Chat Completions API. For these models, there's no need to provision or manage infrastructure.

You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.

This page shows how to make streaming and non-streaming calls to open models that support the OpenAI chat completions API. For Llama-specific considerations, see Request Llama predictions.

Before you begin

To use open models with Vertex AI, you must perform the following steps. The Vertex AI API (aiplatform.googleapis.com) must be enabled to use Vertex AI. If you already have an existing project with the Vertex AI API enabled, you can use that project instead of creating a new project.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the Vertex AI API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the Vertex AI API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  8. Go to the Model Garden model card for the model you want to use, then click Enable to enable the model for use in your project.

    Go to Model Garden

Make a streaming call to an open model

The following sample makes a streaming call to an open model:

Python

Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Before running this sample, make sure to set the OPENAI_BASE_URL environment variable. For more information, see Authentication and credentials.

from openai import OpenAI
client = OpenAI()

stream = client.chat.completions.create(
    model="MODEL",
    messages=[{"role": "ROLE", "content": "CONTENT"}],
    max_tokens=MAX_OUTPUT_TOKENS,
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")
  • MODEL: The model name you want to use, for example deepseek-ai/deepseek-v3.1-maas.
  • ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
  • CONTENT: The content, such as text, of the user or assistant message.
  • MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

    Specify a lower value for shorter responses and a higher value for potentially longer responses.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • LOCATION: A region that supports open models.
  • MODEL: The model name you want to use, for example deepseek-ai/deepseek-v2.
  • ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
  • CONTENT: The content, such as text, of the user or assistant message.
  • MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

    Specify a lower value for shorter responses and a higher value for potentially longer responses.

  • STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions

Request JSON body:

{
  "model": "MODEL",
  "messages": [
    {
      "role": "ROLE",
      "content": "CONTENT"
    }
  ],
  "max_tokens": MAX_OUTPUT_TOKENS,
  "stream": true
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Make a non-streaming call to an open model

The following sample makes a non-streaming call to an open model:

Python

Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Before running this sample, make sure to set the OPENAI_BASE_URL environment variable. For more information, see Authentication and credentials.

from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
    model="MODEL",
    messages=[{"role": "ROLE", "content": "CONTENT"}],
    max_tokens=MAX_OUTPUT_TOKENS,
    stream=False,
)
print(completion.choices[0].message)
  • MODEL: The model name you want to use, for example deepseek-ai/deepseek-v3.1-maas.
  • ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
  • CONTENT: The content, such as text, of the user or assistant message.
  • MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

    Specify a lower value for shorter responses and a higher value for potentially longer responses.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • LOCATION: A region that supports open models.
  • MODEL: The model name you want to use, for example deepseek-ai/deepseek-v2.
  • ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
  • CONTENT: The content, such as text, of the user or assistant message.
  • MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

    Specify a lower value for shorter responses and a higher value for potentially longer responses.

  • STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions

Request JSON body:

{
  "model": "MODEL",
  "messages": [
    {
      "role": "ROLE",
      "content": "CONTENT"
    }
  ],
  "max_tokens": MAX_OUTPUT_TOKENS,
  "stream": false
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Regional and global endpoints

For regional endpoints, requests are served from your specified region. In cases where you have data residency requirements or if a model doesn't support the global endpoint, use the regional endpoints.

When you use the global endpoint, Google can process and serve your requests from any region that is supported by the model that you are using. This might result in higher latency in some cases. The global endpoint helps improve overall availability and helps reduce errors.

There is no price difference with the regional endpoints when you use the global endpoint. However, the global endpoint quotas and supported model capabilities can differ from the regional endpoints. For more information, view the related third-party model page.

Specify the global endpoint

To use the global endpoint, set the region to global.

For example, the request URL for a curl command uses the following format: https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/endpoints/openapi

For the Vertex AI SDK, a regional endpoint is the default. Set the region to GLOBAL to use the global endpoint.

Restrict global API endpoint usage

To help enforce the use of regional endpoints, use the constraints/gcp.restrictEndpointUsage organization policy constraint to block requests to the global API endpoint. For more information, see Restricting endpoint usage.

What's next