Llama models on Vertex AI offer fully managed and serverless models as APIs. To use a Llama model on Vertex AI, send a request directly to the Vertex AI API endpoint. Because Llama models use a managed API, there's no need to provision or manage infrastructure.
You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.
Available Llama models
The following Llama models are available from Meta to use in Vertex AI. To access a Llama model, go to its Model Garden model card.
Llama 3.2
Llama 3.2 lets developers to build and deploy the latest generative AI models and applications that use the latest Llama's capabilities, such as image reasoning. Llama 3.2 is also designed to be more accessible for on-device applications.
Go to the Llama 3.2 model cardThere are no charges during the Preview period. If you require a production-ready service, use the self-hosted Llama models.
Considerations
When using llama-3.2-90b-vision-instruct-maas
, there are no restriction when you send
text-only prompts. However, if you include an image in your prompt, the image
must be at beginning of your prompt, and you can include only one image. You
cannot, for example, include some text and then an image.
Llama 3.1
Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
Llama 3.1 405B is Generally Available. You are charged as you use the model (pay as you go). For pay-as-you-go pricing, see Llama model pricing on the Vertex AI pricing page.
The other Llama 3.1 models are in Preview. There are no charges for the Preview models. If you require a production-ready service, use the self-hosted Llama models.
Go to the Llama 3.1 model cardUse Llama models
When you send requests to use Llama's models, use the following model names:
- For Llama 3.2 90B (Preview), use
llama-3.2-90b-vision-instruct-maas
. - For Llama 3.1 405B (GA), use
llama3-405b-instruct-maas
. - For Llama 3.1 70B (Preview), use
llama3-70b-instruct-maas
. - For Llama 3.1 8B (Preview), use
llama3-8b-instruct-maas
.
We recommend that you use the model versions that include a suffix that
starts with an @
symbol because of the possible differences between
model versions. If you don't specify a model version, the latest version is
always used, which can inadvertently affect your workflows when a model version
changes.
Before you begin
To use Llama models with Vertex AI, you must perform the
following steps. The Vertex AI API
(aiplatform.googleapis.com
) must be enabled to use
Vertex AI. If you already have an existing project with the
Vertex AI API enabled, you can use that project instead of creating a
new project.
Make sure you have the required permissions to enable and use partner models. For more information, see Grant the required permissions.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI API.
- Go to one of the following Model Garden model cards, then click enable:
Make a streaming call to a Llama model
The following sample makes a streaming call to a Llama model.
REST
After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
- LOCATION: A region that supports Llama models.
- MODEL: The model name you want to use.
- ROLE: The role associated with a
message. You can specify a
user
or anassistant
. The first message must use theuser
role. The models operate with alternatinguser
andassistant
turns. If the final message uses theassistant
role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - CONTENT: The content, such as
text, of the
user
orassistant
message. - MAX_OUTPUT_TOKENS:
Maximum number of tokens that can be generated in the response. A token is
approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
- STREAM: A boolean that specifies
whether the response is streamed or not. Stream your response to reduce the
end-use latency perception. Set to
true
to stream the response andfalse
to return the response all at once. - ENABLE_LLAMA_GUARD: A boolean that specifies whether to enable Llama Guard on your inputs and outputs. By default, Llama Guard is enabled and flags responses if it determines they are unsafe.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions
Request JSON body:
{ "model": "meta/MODEL", "messages": [ { "role": "ROLE", "content": "CONTENT" } ], "max_tokens": MAX_OUTPUT_TOKENS, "stream": true, "extra_body": { "google": { "model_safety_settings": { "enabled": ENABLE_LLAMA_GUARD, "llama_guard_settings": {} } } } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Make a unary call to a Llama model
The following sample makes a unary call to a Llama model.
REST
After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
Before using any of the request data, make the following replacements:
- LOCATION: A region that supports Llama models.
- MODEL: The model name you want to use.
- ROLE: The role associated with a
message. You can specify a
user
or anassistant
. The first message must use theuser
role. The models operate with alternatinguser
andassistant
turns. If the final message uses theassistant
role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - CONTENT: The content, such as
text, of the
user
orassistant
message. - MAX_OUTPUT_TOKENS:
Maximum number of tokens that can be generated in the response. A token is
approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
- STREAM: A boolean that specifies
whether the response is streamed or not. Stream your response to reduce the
end-use latency perception. Set to
true
to stream the response andfalse
to return the response all at once. - ENABLE_LLAMA_GUARD: A boolean that specifies whether to enable Llama Guard on your inputs and outputs. By default, Llama Guard is enabled and flags responses if it determines they are unsafe.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions
Request JSON body:
{ "model": "meta/MODEL", "messages": [ { "role": "ROLE", "content": "CONTENT" } ], "max_tokens": MAX_OUTPUT_TOKENS, "stream": false, "extra_body": { "google": { "model_safety_settings": { "enabled": ENABLE_LLAMA_GUARD, "llama_guard_settings": {} } } } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Flagged responses
By default, Llama Guard is enabled on all predictions that you make with Llama 3.1 models. Llama Guard helps safeguard responses by checking inputs and outputs. If Llama Guard determines they are unsafe, it flags the response.
If you want to disable Llama Guard, modify the model safety setting. For more
information, see the model_safety_settings
field in the
streaming or unary example.
Llama model region availability and quotas
For Llama models, a quota applies for each region where the model is available. The quota is specified in queries per minute (QPM).
The supported regions, default quotas, and maximum context length for each Llama model is listed in the following tables:
Llama 3.2 90B (Preview)
Region | Quota system | Supported context length |
---|---|---|
us-central1 |
30 QPM | 128,000 tokens |
Llama 3.1 405B (GA)
Region | Quota system | Supported context length |
---|---|---|
us-central1 |
60 QPM | 128,000 tokens |
Llama 3.1 70B (Preview)
Region | Quota system | Supported context length |
---|---|---|
us-central1 |
60 QPM | 128,000 tokens |
Llama 3.1 8B (Preview)
Region | Quota system | Supported context length |
---|---|---|
us-central1 |
60 QPM | 128,000 tokens |
If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.