Request Llama predictions

You can use curl commands to send requests to the Vertex AI endpoint using the following model names:

For Llama 4 Maverick 17B-128E, use llama-4-maverick-17b-128e-instruct-maas
For Llama 4 Scout 17B-16E, use llama-4-scout-17b-16e-instruct-maas
For Llama 3.3 70B, use llama-3.3-70b-instruct-maas
For Llama 3.2 90B, use llama-3.2-90b-vision-instruct-maas
For Llama 3.1 405B, use llama-3.1-405b-instruct-maas
For Llama 3.1 70B, use llama-3.1-70b-instruct-maas
For Llama 3.1 8B, use llama-3.1-8b-instruct-maas

To learn how to make streaming and non-streaming calls to Llama models, see Call MaaS APIs for open models.

Before you begin

To use Llama models with Vertex AI, you must perform the following steps. The Vertex AI API (aiplatform.googleapis.com) must be enabled to use Vertex AI. If you already have an existing project with the Vertex AI API enabled, you can use that project instead of creating a new project.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Go to one of the following Model Garden model cards, then click Enable:

Make a streaming call to a Llama model

The following sample makes a streaming call to a Llama model.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

LOCATION: A region that supports Llama models.
MODEL: The model name you want to use.
ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
CONTENT: The content, such as text, of the user or assistant message.
MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
ENABLE_LLAMA_GUARD: A boolean that specifies whether to enable Llama Guard on your inputs and outputs. By default, Llama Guard is enabled and flags responses if it determines they are unsafe.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions

Request JSON body:

{
  "model": "meta/MODEL",
  "messages": [
    {
      "role": "ROLE",
      "content": "CONTENT"
    }
  ],
  "max_tokens": MAX_OUTPUT_TOKENS,
  "stream": true,
  "extra_body": {
    "google": {
      "model_safety_settings": {
        "enabled": ENABLE_LLAMA_GUARD,
        "llama_guard_settings": {}
      }
    }
  }
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

data: {
  "choices": [
    {
      "delta": {
        "content": "CONTENT",
        "role": "assistant",
        "refusal: "REFUSAL_REASON" #If using Llama Guard and response was flagged by Llama Guard
      },
      "index": 0
    }
  ],
  "model": "meta/MODEL_NAME",
  "object": "chat.completion.chunk"
}

data: {
  "choices": [
    {
      "delta": {
        "content": "CONTENT",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "model": "meta/MODEL_NAME",
  "object": "chat.completion.chunk",
  "usage": {
    "completion_tokens": 131,
    "prompt_tokens": 14,
    "total_tokens": 145
  }
}

Make a unary call to a Llama model

The following sample makes a unary call to a Llama model.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

LOCATION: A region that supports Llama models.
MODEL: The model name you want to use.
ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
CONTENT: The content, such as text, of the user or assistant message.
MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
ENABLE_LLAMA_GUARD: A boolean that specifies whether to enable Llama Guard on your inputs and outputs. By default, Llama Guard is enabled and flags responses if it determines they are unsafe.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions

Request JSON body:

{
  "model": "meta/MODEL",
  "messages": [
    {
      "role": "ROLE",
      "content": "CONTENT"
    }
  ],
  "max_tokens": MAX_OUTPUT_TOKENS,
  "stream": false,
  "extra_body": {
    "google": {
      "model_safety_settings": {
        "enabled": ENABLE_LLAMA_GUARD,
        "llama_guard_settings": {}
      }
    }
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "CONTENT",
        "role": "assistant",
        "refusal: "REFUSAL_REASON" #If using Llama Guard and response was flagged by Llama Guard
      }
    }
  ],
  "model": "meta/llama3-405b-instruct-maas",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 367,
    "prompt_tokens": 14,
    "total_tokens": 381
  }
}

Flagged responses

By default, Llama Guard 3 8B is enabled on all predictions that you make with Llama 3.3 and Llama 3.1 models. By default, Llama Guard 3 11B vision is enabled on all predictions that you make with for Llama 3.2 models. Llama Guard helps safeguard responses by checking inputs and outputs. If Llama Guard determines they are unsafe, it flags the response.

If you want to disable Llama Guard, modify the model safety setting. For more information, see the model_safety_settings field in the streaming or unary example.

Use Vertex AI Studio

For Llama models, you can use Vertex AI Studio to quickly prototype and test generative AI models in the Google Cloud console. As an example, you can use Vertex AI Studio to compare Llama model responses with other supported models such as Google's Gemini.

For more information, see Quickstart: Send text prompts to Gemini using Vertex AI Studio.

Llama model region availability and quotas

For Llama models, a quota applies for each region where the model is available. The quota is specified in queries per minute (QPM).

Model	Region	Quotas	Context length
Llama 4 Maverick 17B-128E
Llama 4 Maverick 17B-128E	`us-east5`		524,288
Llama 4 Scout 17B-16E
Llama 4 Scout 17B-16E	`us-east5`		1,310,720
Llama 3.3 70B
Llama 3.3 70B	`us-central1`	QPM: 100	128,000
Llama 3.2 90B
Llama 3.2 90B	`us-central1`	QPM: 30	128,000
Llama 3.1 405B
Llama 3.1 405B	`us-central1`	QPM: 60	128,000
Llama 3.1 70B
Llama 3.1 70B	`us-central1`	QPM: 60	128,000
Llama 3.1 8B
Llama 3.1 8B	`us-central1`	QPM: 60	128,000

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see the Cloud Quotas overview.

Request Llama predictions Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Make a streaming call to a Llama model

REST

curl

PowerShell

Response

Make a unary call to a Llama model

REST

curl

PowerShell

Response

Flagged responses

Use Vertex AI Studio

Llama model region availability and quotas

Request Llama predictions