You can use Anthropic's SDK or curl commands to send requests to the Vertex AI endpoint using the following model names:
- For Claude Opus 4, use
claude-opus-4@20250514
- For Claude Sonnet 4, use
claude-sonnet-4@20250514
- For Claude 3.7 Sonnet, use
claude-3-7-sonnet@20250219
- For Claude 3.5 Sonnet v2, use
claude-3-5-sonnet-v2@20241022
- For Claude 3.5 Haiku, use
claude-3-5-haiku@20241022
- For Claude 3.5 Sonnet, use
claude-3-5-sonnet@20240620
- For Claude 3 Opus, use
claude-3-opus@20240229
- For Claude 3 Haiku, use
claude-3-haiku@20240307
Anthropic Claude model versions must be used with a suffix that starts with an
@
symbol (such as claude-3-7-sonnet@20250219
or
claude-3-5-haiku@20241022
) to guarantee consistent behavior.
Before you begin
To use the Anthropic Claude models with Vertex AI, you must perform the
following steps. The Vertex AI API (aiplatform.googleapis.com
) must
be enabled to use Vertex AI. If you already have an existing project with
the Vertex AI API enabled, you can use that project instead of creating
a new project.
Make sure you have the required permissions to enable and use partner models. For more information, see Grant the required permissions.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI API.
-
Go to one of the following Model Garden model cards, then click
Enable:
- Go to the Claude Opus 4 model card
- Go to the Claude Sonnet 4 model card
- Go to the Claude 3.7 Sonnet model card
- Go to the Claude 3.5 Sonnet v2 model card
- Go to the Claude 3.5 Haiku model card
- Go to the Claude 3.5 Sonnet model card
- Go to the Claude 3 Opus model card
- Go to the Claude 3 Haiku model card
- Anthropic recommends that you enable 30-day logging of your prompt and completion activity to record any model misuse. To enable logging, see [Log requests and responses][logging].
Use the Anthropic SDK
You can make API requests to the Anthropic Claude models using the Anthropic Claude SDK. To learn more, see the following:
- Claude messages API reference
- Anthropic Python API library
- Anthropic Vertex AI TypeScript API Library
Make a streaming call to a Claude model using the Anthropic Vertex SDK
The following code sample uses the Anthropic Vertex SDK to perform a streaming call to a Claude model.
Vertex AI SDK for Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.
The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.Make a unary call to a Claude model using the Anthropic Vertex SDK
The following code sample uses the Anthropic Vertex SDK to perform a unary call to a Claude model.
Vertex AI SDK for Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.
The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.Use a curl command
You can use a curl command to make a request to the Vertex AI endpoint. The curl command specifies which supported Claude model you want to use.
Anthropic Claude model versions must be used with a suffix that starts with an
@
symbol (such as claude-3-7-sonnet@20250219
or
claude-3-5-haiku@20241022
) to guarantee consistent behavior.
The following topic shows you how to create a curl command and includes a sample curl command.
REST
To test a text prompt by using the Vertex AI API, send a POST request to the publisher model endpoint.
The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.Before using any of the request data, make the following replacements:
- LOCATION: A region that supports Anthropic Claude models. To use the global endpoint, see Specify the global endpoint.
- MODEL: The model name you want to use.
- ROLE: The role associated with a
message. You can specify a
user
or anassistant
. The first message must use theuser
role. Claude models operate with alternatinguser
andassistant
turns. If the final message uses theassistant
role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - STREAM: A boolean that specifies whether the response
is streamed or not. Stream your response to reduce the end-use latency perception. Set to
true
to stream the response andfalse
to return the response all at once. - CONTENT: The content, such as text, of the
user
orassistant
message. - MAX_TOKENS:
Maximum number of tokens that can be generated in the response. A token is
approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
- TOP_P (Optional):
Top-P changes how the model selects tokens for output. Tokens are selected
from the most (see top-K) to least probable until the sum of their probabilities
equals the top-P value. For example, if tokens A, B, and C have a probability of
0.3, 0.2, and 0.1 and the top-P value is
0.5
, then the model will select either A or B as the next token by using temperature and excludes C as a candidate.Specify a lower value for less random responses and a higher value for more random responses.
- TOP_K(Optional):
Top-K changes how the model selects tokens for output. A top-K of
1
means the next selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding), while a top-K of3
means that the next token is selected from among the three most probable tokens by using temperature.For each token selection step, the top-K tokens with the highest probabilities are sampled. Then tokens are further filtered based on top-P with the final token selected using temperature sampling.
Specify a lower value for less random responses and a higher value for more random responses.
- TYPE: For
Claude 3.7 Sonnet only, to enable extended thinking mode,
specify
enable
. - BUDGET_TOKENS: If you
enable extended thinking, you must specify the number of tokens that the model
can use for its internal reasoning as part of the output. Larger budgets can
enable more thorough analysis for complex problems and improve response
quality. You must specify a value greater than or equal to
1024
but less thanMAX_TOKENS
.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:streamRawPredict
Request JSON body:
{ "anthropic_version": "vertex-2023-10-16", "messages": [ { "role": "ROLE", "content": "CONTENT" }], "max_tokens": MAX_TOKENS, "stream": STREAM, "thinking": { "type": "TYPE", "budget_tokens": BUDGET_TOKENS } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:streamRawPredict"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:streamRawPredict" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Example curl command
MODEL_ID="MODEL"
LOCATION="us-central1"
PROJECT_ID="PROJECT_ID"
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/anthropic/models/${MODEL_ID}:streamRawPredict -d \
'{
"anthropic_version": "vertex-2023-10-16",
"messages": [{
"role": "user",
"content": "Hello!"
}],
"max_tokens": 50,
"stream": true}'
Tool use (function calling)
The Anthropic Claude models support tools and function calling to enhance a model's capabilities. For more information, see the Tool use overview in the Anthropic documentation.
The following samples demonstrate how to use tools by using an SDK or curl command. The samples search for nearby restaurants in San Francisco that are open.
Vertex AI SDK for Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.
The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.REST
The following sample uses regional endpoints. To use the global endpoint, see Specify the global endpoint.Before using any of the request data, make the following replacements:
- LOCATION: A region that supports Anthropic Claude models. To use the global endpoint, see Specify the global endpoint.
- MODEL: The model name to use.
- ROLE: The role associated with a
message. You can specify a
user
or anassistant
. The first message must use theuser
role. Claude models operate with alternatinguser
andassistant
turns. If the final message uses theassistant
role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response. - STREAM: A boolean that specifies
whether the response is streamed or not. Stream your response to reduce the
end-use latency perception. Set to
true
to stream the response andfalse
to return the response all at once. - CONTENT: The content, such as
text, of the
user
orassistant
message. - MAX_TOKENS:
Maximum number of tokens that can be generated in the response. A token is
approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:rawPredict
Request JSON body:
{ "anthropic_version": "vertex-2023-10-16", "max_tokens": MAX_TOKENS, "stream": STREAM, "tools": [ { "name": "text_search_places_api", "description": "Returns information about a set of places based on a string", "input_schema": { "type": "object", "properties": { "textQuery": { "type": "string", "description": "The text string on which to search" }, "priceLevels": { "type": "array", "description": "Price levels to query places, value can be one of [PRICE_LEVEL_INEXPENSIVE, PRICE_LEVEL_MODERATE, PRICE_LEVEL_EXPENSIVE, PRICE_LEVEL_VERY_EXPENSIVE]", }, "openNow": { "type": "boolean", "description": "Describes whether a place is open for business at the time of the query." }, }, "required": ["textQuery"] } } ], "messages": [ { "role": "user", "content": "What are some affordable and good Italian restaurants that are open now in San Francisco??" } ] }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:rawPredict"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:rawPredict" | Select-Object -Expand Content
You should receive a JSON response similar to the following.
Use Vertex AI Studio
For some of the Anthropic Claude models, you can use Vertex AI Studio to quickly prototype and test generative AI models in the Google Cloud console. As an example, you can use Vertex AI Studio to compare Claude model responses with other supported models such as Google Gemini.
For more information, see Quickstart: Send text prompts to Gemini using Vertex AI Studio.
Anthropic Claude quotas and region availability
Claude models have regional quotas and, for models that support a global endpoint, a global quota. The quota is specified in queries per minute (QPM) and tokens per minute (TPM). TPM includes both input and output tokens.
To maintain overall service performance and acceptable use, the maximum quotas might vary by account and, in some cases, access might be restricted. View your project's quotas on the Quotas & Systems Limits page in the Google Cloud console. You must also have the following quotas available:
Online prediction requests per base model per minute per region per base_model
Online prediction tokens per minute per base model per minute per region per base_model
online_prediction_input_tokens_per_minute_per_base_model
online_prediction_output_tokens_per_minute_per_base_model
Input tokens
The following list defines the input tokens that can count towards your input TPM quota. The input tokens that each model counts can vary. To see which input tokens a model counts, see Quotas by model and region.
- Input tokens includes all input tokens, including cache read and cache write tokens.
- Uncached input tokens includes only the input tokens that weren't read from a cache (cache read tokens).
- Cache write tokens includes tokens that were used to create or update a cache.
Quotas by model and region
The following table shows the default quotas and supported context length for each model in each region.
Model | Region | Quotas | Context length |
---|---|---|---|
Claude Opus 4 | |||
us-east5 |
|
200,000 | |
Claude Sonnet 4 | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
global |
|
200,000 | |
Claude 3.7 Sonnet | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
global |
|
200,000 | |
Claude 3.5 Sonnet v2 | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
global |
|
200,000 | |
Claude 3.5 Haiku | |||
us-east5 |
|
200,000 | |
Claude 3.5 Sonnet | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
asia-southeast1 |
|
200,000 | |
Claude 3 Opus | |||
us-east5 |
|
200,000 | |
Claude 3 Haiku | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
asia-southeast1 |
|
200,000 |
If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.