gemini-2.0-flash
as a default judge model
for model-based metrics.
A single evaluation request for a model-based metric might result in multiple underlying requests to
the Gen AI evaluation service. Each model's quota is calculated on a per-project basis, which means
that any requests directed to gemini-2.0-flash
for model inference and
model-based evaluation contribute to the quota.
Quotas for the Gen AI evaluation service and the underlying judge model are shown
in the following table:
Request quota | Default quota |
---|---|
Gen AI evaluation service requests per minute | 1,000 requests per project per region |
Online prediction requests per minute forbase_model: gemini-2.0-flash |
See Quotas by region and model. |
If you receive an error related to quotas while using the Gen AI evaluation service, you might need to file a quota increase request. See View and manage quotas for more information.
Limit | Value |
---|---|
Gen AI evaluation service request timeout | 60 seconds |
When you use the Gen AI evaluation service for the first time in a new project, you might experience an initial setup delay up to two minutes. If your first request fails, wait a few minutes and then retry. Subsequent evaluation requests typically complete within 60 seconds.
The maximum input and output tokens for model-based metrics depend on the model used as the judge model. See Google models for a list of models.
Vertex AI Pipelines quotas
Each tuning job uses Vertex AI Pipelines. For more information, see Vertex AI Pipelines quotas and limits.
What's next
- To learn about quotas and limits for Vertex AI, see Vertex AI quotas and limits.
- To learn more about Google Cloud quotas and limits, see
Understand quota values and system limits.