Many open models on Vertex AI offer fully managed and serverless models as APIs using the Vertex AI Chat Completions API. For these models, there's no need to provision or manage infrastructure.
You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.
This page shows how to make streaming and non-streaming calls to open models that support the OpenAI chat completions API. For Llama-specific considerations, see Request Llama predictions.
Before you begin
To use open models with Vertex AI, you must perform the
following steps. The Vertex AI API
(aiplatform.googleapis.com
) must be enabled to use
Vertex AI. If you already have an existing project with the
Vertex AI API enabled, you can use that project instead of creating a
new project.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator
), which contains theresourcemanager.projects.create
permission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin
), which contains theserviceusage.services.enable
permission. Learn how to grant roles. -
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator
), which contains theresourcemanager.projects.create
permission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin
), which contains theserviceusage.services.enable
permission. Learn how to grant roles. - Go to the Model Garden model card for the model you want to use, then click Enable to enable the model for use in your project.
Make a streaming call to an open model
The following sample makes a streaming call to an open model:
Python
Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Before running this sample, make sure to set the OPENAI_BASE_URL
environment variable.
For more information, see Authentication and credentials.
from openai import OpenAI client = OpenAI() stream = client.chat.completions.create( model="MODEL", messages=[{"role": "ROLE", "content": "CONTENT"}], max_tokens=MAX_OUTPUT_TOKENS, stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="")
- MODEL: The model name you want to use,
for example
deepseek-ai/deepseek-v3.1-maas
. - ROLE: The role associated with a
message. You can specify a
user
or anassistant
. The first message must use theuser
role. The models operate with alternatinguser
andassistant
turns. If the final message uses theassistant
role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - CONTENT: The content, such as
text, of the
user
orassistant
message. - MAX_OUTPUT_TOKENS:
Maximum number of tokens that can be generated in the response. A token is
approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
REST
After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
- LOCATION: A region that supports open models.
- MODEL: The model name you want to use,
for example
deepseek-ai/deepseek-v2
. - ROLE: The role associated with a
message. You can specify a
user
or anassistant
. The first message must use theuser
role. The models operate with alternatinguser
andassistant
turns. If the final message uses theassistant
role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - CONTENT: The content, such as
text, of the
user
orassistant
message. - MAX_OUTPUT_TOKENS:
Maximum number of tokens that can be generated in the response. A token is
approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
- STREAM: A boolean that specifies
whether the response is streamed or not. Stream your response to reduce the
end-use latency perception. Set to
true
to stream the response andfalse
to return the response all at once.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions
Request JSON body:
{ "model": "MODEL", "messages": [ { "role": "ROLE", "content": "CONTENT" } ], "max_tokens": MAX_OUTPUT_TOKENS, "stream": true }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Make a non-streaming call to an open model
The following sample makes a non-streaming call to an open model:
Python
Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Before running this sample, make sure to set the OPENAI_BASE_URL
environment variable.
For more information, see Authentication and credentials.
from openai import OpenAI client = OpenAI() completion = client.chat.completions.create( model="MODEL", messages=[{"role": "ROLE", "content": "CONTENT"}], max_tokens=MAX_OUTPUT_TOKENS, stream=False, ) print(completion.choices[0].message)
- MODEL: The model name you want to use,
for example
deepseek-ai/deepseek-v3.1-maas
. - ROLE: The role associated with a
message. You can specify a
user
or anassistant
. The first message must use theuser
role. The models operate with alternatinguser
andassistant
turns. If the final message uses theassistant
role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - CONTENT: The content, such as
text, of the
user
orassistant
message. - MAX_OUTPUT_TOKENS:
Maximum number of tokens that can be generated in the response. A token is
approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
REST
After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
- LOCATION: A region that supports open models.
- MODEL: The model name you want to use,
for example
deepseek-ai/deepseek-v2
. - ROLE: The role associated with a
message. You can specify a
user
or anassistant
. The first message must use theuser
role. The models operate with alternatinguser
andassistant
turns. If the final message uses theassistant
role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - CONTENT: The content, such as
text, of the
user
orassistant
message. - MAX_OUTPUT_TOKENS:
Maximum number of tokens that can be generated in the response. A token is
approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
- STREAM: A boolean that specifies
whether the response is streamed or not. Stream your response to reduce the
end-use latency perception. Set to
true
to stream the response andfalse
to return the response all at once.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions
Request JSON body:
{ "model": "MODEL", "messages": [ { "role": "ROLE", "content": "CONTENT" } ], "max_tokens": MAX_OUTPUT_TOKENS, "stream": false }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Regional and global endpoints
For regional endpoints, requests are served from your specified region. In cases where you have data residency requirements or if a model doesn't support the global endpoint, use the regional endpoints.
When you use the global endpoint, Google can process and serve your requests from any region that is supported by the model that you are using. This might result in higher latency in some cases. The global endpoint helps improve overall availability and helps reduce errors.
There is no price difference with the regional endpoints when you use the global endpoint. However, the global endpoint quotas and supported model capabilities can differ from the regional endpoints. For more information, view the related third-party model page.
Specify the global endpoint
To use the global endpoint, set the region to global
.
For example, the request URL for a curl command uses the following format:
https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/endpoints/openapi
For the Vertex AI SDK, a regional endpoint is the default. Set the
region to GLOBAL
to use the global endpoint.
Restrict global API endpoint usage
To help enforce the use of regional endpoints, use the
constraints/gcp.restrictEndpointUsage
organization policy constraint to block
requests to the global API endpoint. For more information, see Restricting
endpoint usage.
What's next
- Learn how to use Function calling.
- Learn about Structured output.
- Learn about Batch predictions.