Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Stay organized with collections
Save and categorize content based on your preferences.
This guide explains how to calculate your Provisioned Throughput requirements by understanding Generative AI Scale Units (GSUs) and burndown rates. It covers the following topics:
GSU and burndown rate: Learn about the core concepts of Generative AI Scale Units and burndown rates, which are used for pricing and calculation.
Important Considerations: Review key factors to consider when planning for Provisioned Throughput, such as request prioritization and how throughput is measured.
A Generative AI Scale Unit (GSU) is a measure of throughput for your prompts
and responses. This amount specifies how much throughput to provision a model
with.
A burndown rate is a ratio that converts the input and output units (such as
tokens, characters, or images) to input tokens per second, input characters per
second, or input images per second, respectively. This ratio represents the
throughput and is used to produce a standard unit across models.
Different models use different amounts of throughput. For information about the
minimum GSU purchase amount and increments for each model, see Supported models
and burndown rates.
This equation demonstrates how throughput is calculated:
The calculated throughput per second determines how many GSUs you need for your use case.
Important Considerations
To help you plan for your Provisioned Throughput needs, review the following considerations:
Requests are prioritized.
Provisioned Throughput customers are prioritized and serviced
first before on-demand requests.
Throughput doesn't accumulate.
Unused throughput doesn't accumulate or carry over to the next
month.
Provisioned Throughput is measured in tokens per second, characters per second, or images per second.
Provisioned Throughput isn't measured solely based on queries per minute
(QPM). It's measured based on the query size for your use case, the response
size, and the QPM.
Provisioned Throughput is specific to a project, region, model, and version.
Provisioned Throughput is assigned to a specific
project-region-model-version combination. The same model called from a
different region won't count against your Provisioned Throughput
quota and won't be prioritized over on-demand requests.
Context caching
Provisioned Throughput supports default
context caching.
However, Provisioned Throughput doesn't support caching requests
using the Vertex AI API that include retrieving information about a context
cache.
By default, Google automatically caches inputs to reduce cost and latency.
For the Gemini 2.5 Flash and Gemini 2.5 Pro models, cached
tokens are charged at a 75% discount
relative to standard input tokens when a cache hit occurs. For
Provisioned Throughput, the discount is applied through a
reduced burndown rate.
For example, Gemini 2.5 Pro has the following burndown rates for input
text tokens and cached tokens:
1 input text token = 1 token
1 input cached text token = 0.25 tokens
Sending 1,000 input tokens to this model results in a burndown of your
Provisioned Throughput by 1,000 input tokens per second. However,
if you send 1,000 cached tokens to Gemini 2.5 Pro, this results in a
burndown of your Provisioned Throughput by 250 tokens per second.
Note that this can lead to higher throughput for similar queries where the tokens
aren't cached and the cache discount isn't applied.
Provisioned Throughput supports the Gemini 2.5 Flash with
Live API. To understand how to calculate the burndown while using
the Live API, see
Calculate throughput for Live API.
Example of estimating your Provisioned Throughput needs
To estimate your Provisioned Throughput needs, you can use the
estimation tool in the Google Cloud console.
The following example illustrates the process of estimating the amount of
Provisioned Throughput for your model. The region isn't considered
in the estimation calculations.
This table provides the burndown rates for gemini-2.0-flash that you
can use to follow the example.
In this example, your requirement is to support 10 queries per second (QPS) of a query with an input of 1,000 text tokens and 500 audio tokens, to receive an output of 300 text tokens using gemini-2.0-flash.
To calculate your throughput, refer to the burndown rates for your selected model.
Calculate your throughput.
Multiply your inputs by the burndown rates to arrive at total input tokens:
1,000*(1 token per input text token) + 500*(7 tokens per input audio
token) = 4,500 burndown adjusted input tokens per query.
Multiply your outputs by the burndown rates to arrive at total output tokens:
300*(4 tokens per output text token) = 1,200 burndown adjusted output
tokens per query
Add your totals together:
4,500 burndown adjusted input tokens + 1,200 burndown adjusted output
tokens = 5,700 total tokens per query
Multiply the total number of tokens by the QPS to arrive at total
throughput per second:
5,700 total tokens per query * 10 QPS = 57,000 total tokens per second
Calculate your GSUs.
The GSUs are the total tokens per second divided by per-second throughput per GSU from the burndown table.
57,000 total tokens per second ÷ 3,360 per-second throughput per GSU = 16.96 GSUs
The minimum GSU purchase increment for gemini-2.0-flash is
1, so you'll need 17 GSUs to assure your workload.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-27 UTC."],[],[],null,["# Calculate Provisioned Throughput requirements\n\nThis section explains the concepts of generative AI scale unit (GSU) and\nburndown rates. Provisioned Throughput is calculated and priced\nusing generative AI scale units (GSUs) and burndown rates.\n\nGSU and burndown rate\n---------------------\n\nA *Generative AI Scale Unit (GSU)* is a measure of throughput for your prompts\nand responses. This amount specifies how much throughput to provision a model\nwith.\n\nA *burndown rate* is a ratio that converts the input and output units (such as\ntokens, characters, or images) to input tokens per second, input characters per\nsecond, or input images per second, respectively. This ratio represents the\nthroughput and is used to produce a standard unit across models.\n\nDifferent models use different amounts of throughput. For information about the\nminimum GSU purchase amount and increments for each model, see [Supported models\nand burndown rates](/vertex-ai/generative-ai/docs/supported-models) in this document.\n\nThis equation demonstrates how throughput is calculated: \n\n inputs_per_query = inputs_across_modalities_converted_using_burndown_rates\n outputs_per_query = outputs_across_modalities_converted_using_burndown_rates\n\n throughput_per_second = (inputs_per_query + outputs_per_query) * queries_per_second\n\nThe calculated throughput per second determines how many GSUs that you need for\nyour use case.\n\nImportant Considerations\n------------------------\n\nTo help you plan for your Provisioned Throughput needs, review the\nfollowing important considerations:\n\n- **Requests are prioritized.**\n\n Provisioned Throughput customers are prioritized and serviced\n first before on-demand requests.\n- **Throughput doesn't accumulate.**\n\n Unused throughput doesn't accumulate or carry over to the next\n month.\n- **Provisioned Throughput is measured in tokens per second, characters per second, or images per second.**\n\n Provisioned Throughput isn't measured solely based on queries per minute\n (QPM). It's measured based on the query size for your use case, the response\n size, and the QPM.\n- **Provisioned Throughput is specific to a project, region, model, and version.**\n\n Provisioned Throughput is assigned to a specific\n project-region-model-version combination. The same model called from a\n different region won't count against your Provisioned Throughput\n quota and won't be prioritized over on-demand requests.\n\n### Context caching\n\n|\n| **Preview**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nProvisioned Throughput supports default\n[context caching](/vertex-ai/generative-ai/docs/context-cache/context-cache-overview).\nHowever, Provisioned Throughput doesn't support caching requests\nusing the Vertex AI API that include retrieving information about a context\ncache.\n\nBy default, Google automatically caches inputs to reduce cost and latency.\nFor the Gemini 2.5 Flash and Gemini 2.5 Pro models, cached\ntokens are charged at a [75% discount](/vertex-ai/generative-ai/pricing)\nrelative to standard input tokens when a cache hit occurs. For\nProvisioned Throughput, the discount is applied through a\nreduced burndown rate.\n\nFor example, Gemini 2.5 Pro has the following burndown rates for input\ntext tokens and cached tokens:\n\n- 1 input text token = 1 token\n\n- 1 input cached text token = 0.25 tokens\n\nSending 1,000 input tokens to this model results in a burndown of your\nProvisioned Throughput by 1,000 input tokens per second. However,\nif you send 1,000 cached tokens to Gemini 2.5 Pro, this results in a\nburndown of your Provisioned Throughput by 250 tokens per second.\n\nNote that this can lead to higher throughput for similar queries where the tokens\naren't cached and the cache discount isn't applied.\n\nTo view the burndown rates for models supported in Provisioned Throughput,\nsee [Supported models and burndown rates](/vertex-ai/generative-ai/docs/supported-models).\n\nUnderstand the burndown for Live API\n------------------------------------\n\n| **Request access:** For information about access to this release, see the [access request page](https://docs.google.com/forms/d/e/1FAIpQLScxBeD4UJ8GbUfX4SXjj5a1XJ1K7Urwvb0iSGdGccNcFRBrpQ/viewform).\n\nProvisioned Throughput supports the Gemini 2.5 Flash with\nLive API. To understand how to calculate the burndown while using\nthe Live API, see\n[Calculate throughput for Live API](/vertex-ai/generative-ai/docs/provisioned-throughput/live-api#calculate).\n\nFor more information about using Provisioned Throughput\nfor Gemini 2.5 Flash with Live API, see\n[Provisioned Throughput for Live API](/vertex-ai/generative-ai/docs/provisioned-throughput/live-api).\n\nExample of estimating your Provisioned Throughput needs\n-------------------------------------------------------\n\nTo estimate your Provisioned Throughput needs, use the\n[estimation tool in the Google Cloud console](/vertex-ai/generative-ai/docs/purchase-provisioned-throughput#estimate-provisioned-throughput).\nThe following example illustrates the process of estimating the amount of\nProvisioned Throughput for your model. The region isn't considered\nin the estimation calculations.\n\nThis table provides the burndown rates for `gemini-2.0-flash` that you\ncan use to follow the example.\n\n1. Gather your requirements.\n\n 1. In this example, your requirement is to verify that you can support 10\n queries per second (QPS) of a query with an input of 1,000 text tokens and\n 500 audio tokens, to receive an output of 300 text tokens using\n `gemini-2.0-flash`.\n\n This step means that you understand your use case, because you have\n identified your model, the QPS, and the size of your inputs and outputs.\n 2. To calculate your throughput, refer to the\n [burndown rates](/vertex-ai/generative-ai/docs/supported-models#google-models) for your selected model.\n\n2. Calculate your throughput.\n\n 1. Multiply your inputs by the burndown rates to arrive at total input tokens:\n\n 1,000\\*(1 token per input text token) + 500\\*(7 tokens per input audio\n token) = 4,500 burndown adjusted input tokens per query.\n 2. Multiply your outputs by the burndown rates to arrive at total output tokens:\n\n 300\\*(4 tokens per output text token) = 1,200 burndown adjusted output\n tokens per query\n 3. Add your totals together:\n\n 4,500 burndown adjusted input tokens + 1,200 burndown adjusted output\n tokens = 5,700 total tokens per query\n 4. Multiply the total number of tokens by the QPS to arrive at total\n throughput per second:\n\n 5,700 total tokens per query \\* 10 QPS = 57,000 total tokens per second\n3. Calculate your GSUs.\n\n 1. The GSUs are the total tokens per second divided by per-second throughput per GSU from the burndown table.\n\n 57,000 total tokens per second ÷ 3,360 per-second throughput per GSU = 16.96 GSUs\n 2. The minimum GSU purchase increment for `gemini-2.0-flash` is\n 1, so you'll need 17 GSUs to assure your workload.\n\nWhat's next\n-----------\n\n- [Purchase Provisioned Throughput](/vertex-ai/generative-ai/docs/purchase-provisioned-throughput)."]]