Google models
The following table shows the throughput, purchase increment, and burndown rates for Google models that support Provisioned Throughput. Your per-second throughput is defined as your prompt input and generated output across all requests per second.
Provisioned Throughput only supports models that you call directly from your project using the model's API and doesn't support models that are called by other Vertex AI products, including Vertex AI Agents and Vertex AI Search.
To find out how many tokens your workload requires, refer to the SDK tokenizer or the countTokens API.
Model | Per-second throughput per GSU | Units | Minimum GSU purchase increment | Burndown rates |
---|---|---|---|---|
Gemini 2.0 Flash-Lite | 6,720 | Tokens | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 tokens 1 output text token = 4 tokens |
Gemini 2.0 Flash | 3,360 | Tokens | 1 | 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 7 tokens 1 output text token = 4 tokens |
Imagen 3 | 0.025 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
Imagen 3 Fast | 0.05 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
Imagen 2 | 0.05 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
Imagen 2 Edit | 0.05 | Images | 1 | Only output images count toward your Provisioned Throughput quota. |
MedLM medium | 2,000 | Characters | 1 | 1 input char = 1 char 1 output char = 2 chars |
MedLM large | 200 | Characters | 1 | 1 input char = 1 char 1 output char = 3 chars |
MedLM large 1.5 | 200 | Characters | 1 | 1 input char = 1 char 1 output char = 3 chars |
For more information about supported locations, see Available locations.
You can upgrade to new models as they are made available. For information about model availability and discontinuation dates, see Google models.
Supervised fine-tuned model support
The following is supported for Google models that support supervised fine-tuning:
Provisioned Throughput can be applied to both base models and supervised fine-tuned versions of those base models.
Supervised fine-tuned model endpoints and their corresponding base model count towards the same Provisioned Throughput quota.
For example, Provisioned Throughput purchased for
gemini-2.0-flash-lite-001
for a specific project prioritizes requests that are made from supervised fine-tuned versions ofgemini-2.0-flash-lite-001
created within that project. Use the appropriate header to control traffic behavior.
Google legacy models
See Legacy models that support Provisioned Throughput.
Partner models
The following table shows the throughput, purchase increment, and burndown rates for partner models that support Provisioned Throughput. Claude models are measured in tokens per second, which is defined as a total of input and output tokens across all requests per second.
Model | Throughput per GSU (tokens/sec) | Minimum GSU purchase | GSU purchase increment | Burndown rates |
---|---|---|---|---|
Anthropic's Claude 3.7 Sonnet | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3.5 Sonnet v2 | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3.5 Haiku | 2,000 | 10 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3 Opus | 70 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3 Haiku | 4,200 | 5 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3.5 Sonnet | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
For information about supported locations, see Anthropic Claude region availability. To order Provisioned Throughput for Anthropic models, contact your Google Cloud account representative.