You can use curl commands to send requests to the Vertex AI endpoint
using the following model names: To use Llama models with Vertex AI, you must perform the
following steps. The Vertex AI API
( In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI API.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI API.
The following sample makes a streaming call to a Llama model.
After you
set up your environment,
you can use REST to test a text prompt. The following sample sends a request to the publisher
model endpoint.
Before using any of the request data,
make the following replacements:
Specify a lower value for shorter responses and a higher value for potentially longer
responses.
HTTP method and URL:
Request JSON body:
To send your request, choose one of these options:
Save the request body in a file named
Save the request body in a file named You should receive a JSON response similar to the following.
llama-4-maverick-17b-128e-instruct-maas
llama-4-scout-17b-16e-instruct-maas
llama-3.3-70b-instruct-maas
llama-3.2-90b-vision-instruct-maas
llama-3.1-405b-instruct-maas
llama-3.1-70b-instruct-maas
llama-3.1-8b-instruct-maas
Before you begin
aiplatform.googleapis.com
) must be enabled to use
Vertex AI. If you already have an existing project with the
Vertex AI API enabled, you can use that project instead of creating a
new project.
Make a streaming call to a Llama model
REST
user
or an assistant
.
The first message must use the user
role. The models
operate with alternating user
and assistant
turns.
If the final message uses the assistant
role, then the response
content continues immediately from the content in that message. You can use
this to constrain part of the model's response.user
or assistant
message.true
to stream the response
and false
to return the response all at once.POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions
{
"model": "meta/MODEL",
"messages": [
{
"role": "ROLE",
"content": "CONTENT"
}
],
"max_tokens": MAX_OUTPUT_TOKENS,
"stream": true,
"extra_body": {
"google": {
"model_safety_settings": {
"enabled": ENABLE_LLAMA_GUARD,
"llama_guard_settings": {}
}
}
}
}
curl
request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"PowerShell
request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content
Make a unary call to a Llama model
The following sample makes a unary call to a Llama model.
REST
After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
- LOCATION: A region that supports Llama models.
- MODEL: The model name you want to use.
- ROLE: The role associated with a
message. You can specify a
user
or anassistant
. The first message must use theuser
role. The models operate with alternatinguser
andassistant
turns. If the final message uses theassistant
role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - CONTENT: The content, such as
text, of the
user
orassistant
message. - MAX_OUTPUT_TOKENS:
Maximum number of tokens that can be generated in the response. A token is
approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
- STREAM: A boolean that specifies
whether the response is streamed or not. Stream your response to reduce the
end-use latency perception. Set to
true
to stream the response andfalse
to return the response all at once. - ENABLE_LLAMA_GUARD: A boolean that specifies whether to enable Llama Guard on your inputs and outputs. By default, Llama Guard is enabled and flags responses if it determines they are unsafe.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions
Request JSON body:
{ "model": "meta/MODEL", "messages": [ { "role": "ROLE", "content": "CONTENT" } ], "max_tokens": MAX_OUTPUT_TOKENS, "stream": false, "extra_body": { "google": { "model_safety_settings": { "enabled": ENABLE_LLAMA_GUARD, "llama_guard_settings": {} } } } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Flagged responses
By default, Llama Guard 3 8B is enabled on all predictions that you make with Llama 3.3 and Llama 3.1 models. By default, Llama Guard 3 11B vision is enabled on all predictions that you make with for Llama 3.2 models. Llama Guard helps safeguard responses by checking inputs and outputs. If Llama Guard determines they are unsafe, it flags the response.
If you want to disable Llama Guard, modify the model safety setting. For more
information, see the model_safety_settings
field in the
streaming or unary example.
Use Vertex AI Studio
For Llama models, you can use Vertex AI Studio to quickly prototype and test generative AI models in the Google Cloud console. As an example, you can use Vertex AI Studio to compare Llama model responses with other supported models such as Google's Gemini.
For more information, see Quickstart: Send text prompts to Gemini using Vertex AI Studio.
Llama model region availability and quotas
For Llama models, a quota applies for each region where the model is available. The quota is specified in queries per minute (QPM).
Model | Region | Quotas | Context length |
---|---|---|---|
Llama 4 Maverick 17B-128E | |||
us-east5 |
|
524,288 | |
Llama 4 Scout 17B-16E | |||
us-east5 |
|
1,310,720 | |
Llama 3.3 70B | |||
us-central1 |
|
128,000 | |
Llama 3.2 90B | |||
us-central1 |
|
128,000 | |
Llama 3.1 405B | |||
us-central1 |
|
128,000 | |
Llama 3.1 70B | |||
us-central1 |
|
128,000 | |
Llama 3.1 8B | |||
us-central1 |
|
128,000 |
If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.