Vertex AI provides different ways to manage throughput for generative AI models to help you balance cost and performance. This document describes the available options: a flexible pay-as-you-go model and reserved capacity for predictable throughput.
Managed model quotas
Vertex AI offers two ways to manage throughput for the managed generative AI models on Vertex AI, which lets you balance cost, flexibility, and performance. You can either use a flexible pay-as-you-go model or reserve a dedicated amount of throughput for a fixed price.
Pay-as-you-go
For the default pay-as-you-go model, Vertex AI uses Dynamic Shared Quota, which doesn't have a predefined usage limit. Instead, you get access to a large, shared pool of resources that are dynamically allocated based on real-time availability and demand.
This model allows your workloads to use more resources when they are available.
If you receive a resource exhausted (429) error, it means the shared pool is
temporarily experiencing high demand from many users at once. You should
implement retry mechanisms in your application, as availability can change
quickly.
Reserved Capacity
For critical production applications that require consistent performance and predictable costs, you can use Provisioned Throughput. Provisioned Throughput is a fixed-cost subscription that reserves a specific amount of throughput for your models in a chosen location.
Quotas for Generative AI services
Vertex AI offers a suite of generative AI services, such as model tuning, model evaluation, batch prediction, embeddings, and retrieval augmented generation. To learn more about the quotas for these services, see Generative AI on Vertex AI quotas and system limits.
What's next
- Learn more about Dynamic Shared Quota.
- Learn more about Provisioned Throughput.
- Learn more about generative AI quotas and system limits.
- Learn more about Google Cloud quotas.