Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Request predictions with Claude models

This guide shows how to send requests to Anthropic's Claude models using Vertex AI, covering the following topics:

Before you begin: Describes the initial setup steps required to enable and use the Claude models.
Choose an interaction method: Provides a comparison of the different ways to send requests to the models.
Use the Anthropic SDK: Explains how to make API requests programmatically using the Anthropic SDK.
Use a curl command: Shows how to send requests from the command line using curl.
Tool use (function calling): Demonstrates how to enhance model capabilities with function calling.
Use Vertex AI Studio: Details how to prototype and test models using the Google Cloud console UI.
Anthropic Claude quotas and region availability: Lists the quotas and available regions for each model.

The following diagram summarizes the overall workflow:

Before you begin

To use the Anthropic Claude models with Vertex AI, perform the following steps. You must enable the Vertex AI API (aiplatform.googleapis.com) to use Vertex AI. If you already have a project with the Vertex AI API enabled, you can use that project instead of creating a new one.

Make sure that you have the required permissions to enable and use partner models. For more information, see Grant the required permissions.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

Go to one of the following Model Garden model cards, then click Enable:

Anthropic recommends that you enable 30-day logging of your prompt and completion activity to record any model misuse. To enable logging, see Log requests and responses.

Choose an interaction method

You can send requests to Claude models in several ways. The following table provides an overview of the available options to help you decide which is best for your use case.

Method	Description	Use Case
Anthropic SDK	Programmatic access using the official Python or TypeScript SDKs.	Integrating Claude models into your applications.
curl command	Send requests directly to the REST API endpoint from your command line.	Quick testing, scripting, and environments without an SDK.
Vertex AI Studio	A web-based UI in the Google Cloud console for interactive prompting.	Rapid prototyping, model comparison, and no-code exploration.

You can use Anthropic's SDK or curl commands to send requests to the Vertex AI endpoint using the following model names:

For Claude Opus 4.1, use claude-opus-4-1@20250805
For Claude Opus 4, use claude-opus-4@20250514
For Claude Sonnet 4, use claude-sonnet-4@20250514
For Claude 3.7 Sonnet, use claude-3-7-sonnet@20250219
For Claude 3.5 Sonnet v2, use claude-3-5-sonnet-v2@20241022
For Claude 3.5 Haiku, use claude-3-5-haiku@20241022
For Claude 3.5 Sonnet, use claude-3-5-sonnet@20240620
For Claude 3 Opus, use claude-3-opus@20240229
For Claude 3 Haiku, use claude-3-haiku@20240307

Anthropic Claude model versions must be used with a suffix that starts with an @ symbol (such as claude-3-7-sonnet@20250219 or claude-3-5-haiku@20241022) to guarantee consistent behavior.

Use the Anthropic SDK

You can make API requests to the Anthropic Claude models using the Anthropic Claude SDK. To learn more, see the following:

Make a streaming call to a Claude model using the Anthropic Vertex SDK

The following code sample uses the Anthropic Vertex SDK to perform a streaming call to a Claude model.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.

# TODO(developer): Vertex AI SDK - uncomment below & run
# pip3 install --upgrade --user google-cloud-aiplatform
# gcloud auth application-default login
# pip3 install -U 'anthropic[vertex]'

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"

from anthropic import AnthropicVertex

client = AnthropicVertex(project_id=PROJECT_ID, region="us-east5")
result = []

with client.messages.stream(
    model="claude-3-5-sonnet-v2@20241022",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "Send me a recipe for banana bread.",
        }
    ],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
        result.append(text)

# Example response:
# Here's a simple recipe for delicious banana bread:
# Ingredients:
# - 2-3 ripe bananas, mashed
# - 1/3 cup melted butter
# ...
# ...
# 8. Bake for 50-60 minutes, or until a toothpick inserted into the center comes out clean.
# 9. Let cool in the pan for a few minutes, then remove and cool completely on a wire rack.

Make a unary call to a Claude model using the Anthropic Vertex SDK

The following code sample uses the Anthropic Vertex SDK to perform a unary call to a Claude model.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.

# TODO(developer): Vertex AI SDK - uncomment below & run
# pip3 install --upgrade --user google-cloud-aiplatform
# gcloud auth application-default login
# pip3 install -U 'anthropic[vertex]'

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"

from anthropic import AnthropicVertex

client = AnthropicVertex(project_id=PROJECT_ID, region="us-east5")
message = client.messages.create(
    model="claude-3-5-sonnet-v2@20241022",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "Send me a recipe for banana bread.",
        }
    ],
)
print(message.model_dump_json(indent=2))
# Example response:
# {
#   "id": "msg_vrtx_0162rhgehxa9rvJM5BSVLZ9j",
#   "content": [
#     {
#       "text": "Here's a simple recipe for delicious banana bread:\n\nIngredients:\n- 2-3 ripe bananas...
#   ...

Use a curl command

You can use a curl command to send a request to the Vertex AI endpoint for a specific Claude model.

Anthropic Claude model versions must be used with a suffix that starts with an @ symbol (such as claude-3-7-sonnet@20250219 or claude-3-5-haiku@20241022) to guarantee consistent behavior.

The following section shows you how to create a curl command and includes a sample curl command.

REST

To test a text prompt by using the Vertex AI API, send a POST request to the publisher model endpoint.

The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.

Before using any of the request data, make the following replacements:

LOCATION: A region that supports Anthropic Claude models. To use the global endpoint, see Specify the global endpoint.
MODEL: The model name you want to use.
ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. Claude models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
CONTENT: The content, such as text, of the user or assistant message.
MAX_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
TOP_P (Optional): Top-P changes how the model selects tokens for output. Tokens are selected from the most probable to least probable until the sum of their probabilities equals the top-P value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1 and the top-P value is 0.5, then the model will select either A or B as the next token by using temperature and excludes C as a candidate.
Specify a lower value for less random responses and a higher value for more random responses.
TOP_K(Optional): Top-K changes how the model selects tokens for output. A top-K of 1 means the next selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding), while a top-K of 3 means that the next token is selected from among the three most probable tokens by using temperature.
For each token selection step, the top-K tokens with the highest probabilities are sampled. Then tokens are further filtered based on top-P with the final token selected using temperature sampling.

Specify a lower value for less random responses and a higher value for more random responses.
TYPE: For Claude 3.7 Sonnet and later Claude models, to enable extended thinking mode, specify enabled.
BUDGET_TOKENS: If you enable extended thinking, you must specify the number of tokens that the model can use for its internal reasoning as part of the output. Larger budgets can enable more thorough analysis for complex problems and improve response quality. You must specify a value greater than or equal to 1024 but less than MAX_TOKENS.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:streamRawPredict

Request JSON body:

{
  "anthropic_version": "vertex-2023-10-16",
  "messages": [
   {
    "role": "ROLE",
    "content": "CONTENT"
   }],
  "max_tokens": MAX_TOKENS,
  "stream": STREAM,
  "thinking": {
    "type": "TYPE",
    "budget_tokens": BUDGET_TOKENS
  }
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:streamRawPredict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:streamRawPredict" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

{
  "id":"msg_012NDLxqh6LsztWCU7zTb14C",
  "type":"message",
  "role":"assistant",
  "content":[{
    "type":"text",
    "text":"Hello! Nice to meet you."
  }],
  "model":"claude-2.1",
  "stop_reason":"end_turn",
  "stop_sequence":null,
  "usage":{
    "input_tokens":11,
    "output_tokens":11
  }
}

Example curl command

MODEL_ID="MODEL"
LOCATION="us-central1"
PROJECT_ID="PROJECT_ID"

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/anthropic/models/${MODEL_ID}:streamRawPredict -d \
'{
  "anthropic_version": "vertex-2023-10-16",
  "messages": [{
    "role": "user",
    "content": "Hello!"
  }],
  "max_tokens": 50,
  "stream": true}'

Tool use (function calling)

The Anthropic Claude models support tool use and function calling to enhance a model's capabilities. For more information, see the Tool use overview in the Anthropic documentation.

The following samples demonstrate how to use tools with an SDK or a curl command. The samples search for nearby restaurants in San Francisco that are open.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.

# TODO(developer): Vertex AI SDK - uncomment below & run
# pip3 install --upgrade --user google-cloud-aiplatform
# gcloud auth application-default login
# pip3 install -U 'anthropic[vertex]'
from anthropic import AnthropicVertex

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"

client = AnthropicVertex(project_id=PROJECT_ID, region="us-east5")
message = client.messages.create(
    model="claude-3-5-sonnet-v2@20241022",
    max_tokens=1024,
    tools=[
        {
            "name": "text_search_places_api",
            "description": "returns information about a set of places based on a string",
            "input_schema": {
                "type": "object",
                "properties": {
                    "textQuery": {
                        "type": "string",
                        "description": "The text string on which to search",
                    },
                    "priceLevels": {
                        "type": "array",
                        "description": "Price levels to query places, value can be one of [PRICE_LEVEL_INEXPENSIVE, PRICE_LEVEL_MODERATE, PRICE_LEVEL_EXPENSIVE, PRICE_LEVEL_VERY_EXPENSIVE]",
                    },
                    "openNow": {
                        "type": "boolean",
                        "description": "whether those places are open for business.",
                    },
                },
                "required": ["textQuery"],
            },
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "What are some affordable and good Italian restaurants open now in San Francisco??",
        }
    ],
)
print(message.model_dump_json(indent=2))
# Example response:
# {
#   "id": "msg_vrtx_018pk1ykbbxAYhyWUdP1bJoQ",
#   "content": [
#     {
#       "text": "To answer your question about affordable and good Italian restaurants
#       that are currently open in San Francisco....
# ...

REST

The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.

Before using any of the request data, make the following replacements:

LOCATION: A region that supports Anthropic Claude models. To use the global endpoint, see Specify the global endpoint.
MODEL: The model name to use.
ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. Claude models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
CONTENT: The content, such as text, of the user or assistant message.
MAX_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:rawPredict

Request JSON body:


{
  "anthropic_version": "vertex-2023-10-16",
  "max_tokens": MAX_TOKENS,
  "stream": STREAM,
  "tools": [
    {
      "name": "text_search_places_api",
      "description": "Returns information about a set of places based on a string",
      "input_schema": {
        "type": "object",
        "properties": {
          "textQuery": {
            "type": "string",
            "description": "The text string on which to search"
          },
          "priceLevels": {
            "type": "array",
            "description": "Price levels to query places, value can be one of [PRICE_LEVEL_INEXPENSIVE, PRICE_LEVEL_MODERATE, PRICE_LEVEL_EXPENSIVE, PRICE_LEVEL_VERY_EXPENSIVE]",
          },
          "openNow": {
            "type": "boolean",
            "description": "Describes whether a place is open for business at
            the time of the query."
          },
        },
        "required": ["textQuery"]
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "What are some affordable and good Italian restaurants that are open now in San Francisco??"
    }
  ]
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:rawPredict"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:rawPredict" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

{
  "id": "msg_vrtx_01ErR7VMNQdnvDt3n7Nmc4ER",
  "type": "message",
  "role": "assistant",
  "model": "claude-3-opus-20240229",
  "content": [
    {
      "type": "text",
      "text": "\nTo find affordable and good Italian restaurants that are currently open in San Francisco, the text_search_places_api tool seems most relevant. \n\nThe required textQuery parameter can be inferred as \"Italian restaurants in San Francisco\", since the user specified Italian restaurants and the location of San Francisco.\n\nTwo optional parameters are also relevant:\nopenNow - this should be set to true, since the user specified they want restaurants open now\npriceLevels - to find affordable restaurants, this can be set to [PRICE_LEVEL_INEXPENSIVE, PRICE_LEVEL_MODERATE]\n\nWith the textQuery provided and the two optional parameters that can help narrow the results to match the user's criteria, we have enough information to make a good call to the text_search_places_api tool to try to answer the user's request.\n"
    },
    {
      "type": "tool_use",
      "id": "toolu_vrtx_01TAJCTkxe8HhRoaQ69N4ouP",
      "name": "text_search_places_api",
      "input": {
        "textQuery": "Italian restaurants in San Francisco",
        "openNow": true,
        "priceLevels": [
          "PRICE_LEVEL_INEXPENSIVE",
          "PRICE_LEVEL_MODERATE"
        ]
      }
    }
  ],
  "stop_reason": "tool_use",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 727,
    "output_tokens": 308
  }
}

Use Vertex AI Studio

For some of the Anthropic Claude models, you can use Vertex AI Studio to prototype and test generative AI models in the Google Cloud console. As an example, you can use Vertex AI Studio to compare Claude model responses with other supported models such as Google Gemini.

For more information, see Quickstart: Send text prompts to Gemini using Vertex AI Studio.

Anthropic Claude quotas and region availability

Claude models have regional quotas and, for models that support a global endpoint, a global quota. The quota is specified in queries per minute (QPM) and tokens per minute (TPM). TPM includes both input and output tokens.

Maximum quotas can vary by account to maintain service performance and ensure acceptable use. In some cases, access might be restricted. To view your project's quotas, see the Quotas & Systems Limits page in the Google Cloud console. You also need the following quotas:

online_prediction_requests_per_base_model and global_online_prediction_requests_per_base_model defines your QPM quota.
For TPM, there are three quota values that apply to certain models:
- For models that count input and output tokens together, online_prediction_tokens_per_minute_per_base_model and global_online_prediction_tokens_per_minute_per_base_model defines the model TPM quota.
- For models that count input and output tokens separately, online_prediction_input_tokens_per_minute_per_base_model and global_online_prediction_input_tokens_per_minute_per_base_model defines the input TPM quota and online_prediction_output_tokens_per_minute_per_base_model and global_online_prediction_output_tokens_per_minute_per_base_model defines the output TPM quota.
To see which models count input and output tokens separately, see Quotas by model and region.

Input

The following list defines the input tokens that can count towards your input TPM quota. The input tokens that each model counts can vary. To see which input tokens a model counts, see Quotas by model and region.

Input tokens includes all input tokens, including cache read and cache write tokens.
Uncached input tokens includes only the input tokens that weren't read from a cache (cache read tokens).
Cache write tokens includes tokens that were used to create or update a cache.

Quotas by model and region

The following table shows the default quotas and supported context length for each model in each region.

Model	Region	Quotas	Context length
Claude Opus 4.1
	`us-east5`	QPM: 25 Input TPM: 60,000 uncached and cache write Output TPM: 6,000	200,000
	`global endpoint`	QPM: 25 Input TPM: 60,000 uncached and cache write Output TPM: 6,000	200,000
Claude Opus 4
	`us-east5`	QPM: 25 Input TPM: 60,000 uncached and cache write Output TPM: 6,000	200,000
	`global endpoint`	QPM: 25 Input TPM: 60,000 uncached and cache write Output TPM: 6,000	200,000
Claude Sonnet 4
	`us-east5`	QPM: 35 Input TPM: 280,000 uncached and cache write Output TPM: 20,000	200,000
	`europe-west1`	QPM: 25 Input TPM: 180,000 uncached and cache write Output TPM: 20,000	200,000
	`asia-east1`	QPM: 70 Input TPM: 550,000 uncached and cache write Output TPM: 50,000	200,000
	`global endpoint`	QPM: 35 Input TPM: 276,000 uncached and cache write Output TPM: 24,000	200,000
Claude 3.7 Sonnet
	`us-east5`	QPM: 55 TPM: 500,000 (uncached input and output)	200,000
	`europe-west1`	QPM: 40 TPM: 300,000 (uncached input and output)	200,000
	`global endpoint`	QPM: 35 TPM: 300,000 (uncached input and output)	200,000
Claude 3.5 Sonnet v2
	`us-east5`	QPM: 90 TPM: 540,000 (input and output)	200,000
	`europe-west1`	QPM: 55 TPM: 330,000 (input and output)	200,000
	`global endpoint`	QPM: 25 TPM: 140,000 (input and output)	200,000
Claude 3.5 Haiku
	`us-east5`	QPM: 80 TPM: 350,000 (input and output)	200,000
	`europe-west1`	QPM: 90 TPM: 400,000 (input and output)	200,000
Claude 3.5 Sonnet
	`us-east5`	QPM: 80 TPM: 350,000 (input and output)	200,000
	`europe-west1`	QPM: 130 TPM: 600,000 (input and output)	200,000
	`asia-southeast1`	QPM: 35 TPM: 150,000 (input and output)	200,000
Claude 3 Opus
Claude 3 Opus	`us-east5`	QPM: 20 TPM: 105,000 (input and output)	200,000
Claude 3 Haiku
	`us-east5`	QPM: 245 TPM: 600,000 (input and output)	200,000
	`europe-west1`	QPM: 75 TPM: 181,000 (input and output)	200,000
	`asia-southeast1`	QPM: 70 TPM: 174,000 (input and output)	200,000

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.

Request predictions with Claude models Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Choose an interaction method

Use the Anthropic SDK

Make a streaming call to a Claude model using the Anthropic Vertex SDK

Python

Make a unary call to a Claude model using the Anthropic Vertex SDK

Python

Use a curl command

REST

curl

PowerShell

Response

Example curl command

Tool use (function calling)

Python

REST

curl

PowerShell

Response

Use Vertex AI Studio

Anthropic Claude quotas and region availability

Input

Quotas by model and region

Request predictions with Claude models