Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Stay organized with collections
Save and categorize content based on your preferences.
The GenAI evaluation service provides enterprise-grade tools for objective, data-driven assessment of generative AI models. It supports and informs a number of development tasks like model migrations, prompt editing, and fine-tuning.
Gen AI evaluation service features
The defining feature of the Gen AI evaluation service is the ability to use adaptive rubrics, a set of tailored pass or fail tests for each individual prompt. Evaluation rubrics are similar to unit tests in software development and aim to improve model performance across a variety of tasks.
The service also supports the following other common evaluation methods:
Static rubrics: Apply a fixed set of scoring criteria across all prompts.
Computation-based metrics: Use deterministic algorithms like ROUGE or BLEU when a ground truth is available.
Custom functions: Define your own evaluation logic in Python for specialized requirements.
Evaluation dataset generation
You can create an evaluation dataset through the following methods:
Upload a file containing complete prompt instances, or provide a prompt template alongside a corresponding file of variable values to populate the completed prompts.
Sample directly from production logs to evaluate the real-world usage of your model.
Use synthetic data generation to generate a large number of consistent examples for any prompt template.
Supported interfaces
You can define and run your evaluations using the following interfaces:
Google Cloud console: A web UI that provides a guided, end-to-end workflow. Manage your datasets, run evaluations, and dive deep into interactive reports and visualizations.
Python SDK: A notebook-native experience for developers. Programmatically run evaluations and render side-by-side model comparisons directly in your Colab or Jupyter environment.
Use cases
The Gen AI evaluation service lets you see how a model performs on your specific tasks and against your unique criteria providing valuable insights which cannot be derived from public leaderboards and general benchmarks. This supports critical development tasks, including:
Model migrations: Compare model versions to understand behavioral differences and fine-tune your prompts and settings accordingly.
Finding the best model: Run head-to-head comparisons of Google and third-party models on your data to establish a performance baseline and identify the best fit for your use case.
Prompt improvement: Use evaluation results to guide your customization efforts. Re-running an evaluation creates a tight feedback loop, providing immediate, quantifiable feedback on your changes.
Model fine-tuning: Evaluate the quality of a fine-tuned model by applying consistent evaluation criteria to every run.
Evaluations with adaptive rubrics
Adaptive rubrics are the recommended method for most evaluation use cases and are typically the fastest way to get started with evaluations.
Instead of using a general set of rating rubrics like most LLM-as-a-judge systems, the test-driven evaluation framework adaptively generates a unique set of pass or fail rubrics for each individual prompt in your dataset. This approach ensures that every evaluation is relevant to the specific task being evaluated.
The evaluation process for each prompt uses a two-step system:
Rubric generation: The service first analyzes your prompt and generates a list of specific, verifiable tests—the rubrics—that a good response should meet.
Rubric validation: After your model generates a response, the service assesses the response against each rubric, delivering a clear Pass or Fail verdict and a rationale.
The final result is an aggregated pass rate and a detailed breakdown of which rubrics the model passed, giving you actionable insights to diagnose issues and measure improvements.
By moving from high-level, subjective scores to granular, objective test results, you can adopt an evaluation-driven development cycle and bring the software engineering best practices to the process of building generative AI applications.
Rubrics evaluation example
To understand how the Gen AI evaluation service generates and uses rubrics, consider this example:
User prompt: Write a four-sentence summary of the provided article about renewable energy, maintaining an optimistic tone.
For this prompt, the rubric generation step might produce the following rubrics:
Rubric 1: The response is a summary of the provided article.
Rubric 2: The response contains exactly four sentences.
Rubric 3: The response maintains an optimistic tone.
Your model may produce the following response: The article highlights significant growth in solar and wind power. These advancements are making clean energy more affordable. The future looks bright for renewables. However, the report also notes challenges with grid infrastructure.
During rubric validation, the Gen AI evaluation service assesses the response against each rubric:
Rubric 1: The response is a summary of the provided article.
Verdict: Pass
Reason: The response accurately summarizes the main points.
Rubric 2: The response contains exactly four sentences.
Verdict: Pass
Reason: The response is composed of four distinct sentences
Rubric 3: The response maintains an optimistic tone.
Verdict: Fail
Reason: The final sentence introduces a negative point, which detracts from the optimistic tone.
The final pass rate for this response is 66.7%. To compare two models, you can evaluate their responses against this same set of generated tests and compare their overall pass rates.
Evaluation workflow
Completing an evaluation typically requires going through the following steps:
Create an evaluation dataset: Assemble a dataset of prompt instances that reflect your specific use case. You can include reference answers (ground truth) if you plan to use computation-based metrics.
Define evaluation metrics: Choose the metrics you want to use to measure model performance. The SDK supports all metric types, while the console supports adaptive rubrics.
Generate model responses: Select one or more models to generate responses for your dataset. The SDK supports any model callable viaLiteLLM, while the console supports Google Gemini models.
Run the evaluation: Execute the evaluation job, which assesses each model's responses against your selected metrics.
Interpret the results: Review the aggregated scores and individual responses to analyze model performance.
Alternatively, the following code shows how to complete an evaluation with the GenAI Client in Vertex AI SDK:
fromvertexaiimportclientfromvertexaiimporttypesimportpandasaspd# Create an evaluation datasetprompts_df=pd.DataFrame({"prompt":["Write a simple story about a dinosaur","Generate a poem about Vertex AI",],})# Get responses from one or multiple modelseval_dataset=client.evals.run_inference(model="gemini-2.5-flash",src=prompts_df)# Define the evaluation metrics and run the evaluation jobeval_result=client.evals.evaluate(dataset=eval_dataset,metrics=[types.RubricMetric.GENERAL_QUALITY])# View the evaluation resultseval_result.show()
The Gen AI evaluation service offers two SDK interfaces:
GenAI Client in Vertex AI SDK (Recommended) (Preview)
from vertexai import client
The GenAI Client is the newer, recommended interface for evaluation, accessed through the unified Client class. It supports all evaluation methods and is designed for workflows that include model comparison, in-notebook visualization, and insights for model customization.
Evaluation module in Vertex AI SDK (GA)
from vertexai.evaluation import EvalTask
The evaluation module is the older interface, maintained for backward compatibility with existing workflows but no longer under active development. It is accessed through the EvalTask class. This method supports standard LLM-as-a-judge and computation-based metrics but does not support newer evaluation methods like adaptive rubrics.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-27 UTC."],[],[],null,["# Gen AI evaluation service overview\n\n| **Note:** Vertex AI provides model evaluation metrics for both predictive AI and generative AI models. This page provides an overview of the evaluation service for generative AI models. To evaluate a predictive AI model, see [Model\n| evaluation in Vertex AI](/vertex-ai/docs/evaluation/introduction).\n\nThe Gen AI evaluation service in Vertex AI lets you evaluate any generative model or application and benchmark the evaluation results against your own judgment, using your own evaluation criteria.\n\nWhile leaderboards and reports offer insights into overall model performance, they don't reveal how a model handles your specific needs. The Gen AI evaluation service helps you define your own evaluation criteria, ensuring a clear understanding of how well generative AI models and applications align with your unique use case.\n\nEvaluation is important at every step of your Gen AI development process including model selection, prompt engineering, and model customization. Evaluating Gen AI is integrated within Vertex AI to help you launch and reuse evaluations as needed.\n\nGen AI evaluation service capabilities\n--------------------------------------\n\nThe Gen AI evaluation service can help you with the following tasks:\n\n- **Model selection**: Choose the best pre-trained model for your task based on benchmark results and its performance on your specific data.\n\n- **Generation settings**: Tweak model parameters (like temperature) to optimize output for your needs.\n\n- **Prompt engineering**: Craft effective prompts and prompt templates to guide the model towards your preferred behavior and responses.\n\n- **Improve and safeguard fine-tuning**: Fine-tune a model to improve performance for your use case, while avoiding biases or undesirable behaviors.\n\n- **RAG optimization**: Select the most effective Retrieval Augmented Generation (RAG) architecture to enhance performance for your application.\n\n- **Migration**: Continuously assess and improve the performance of your AI solution by migrating to newer models when they provide a clear advantage for your specific use case.\n\n- **Translation** (preview): Assess the quality of your model's translations.\n\n- **Evaluate agents**: Evaluate the performance of your agents using the Gen AI evaluation service.\n\nEvaluation process\n------------------\n\nThe Gen AI evaluation service lets you evaluate any Gen AI model or application on your evaluation criteria by following these steps:\n\n1. [**Define evaluation metrics**](/vertex-ai/generative-ai/docs/models/determine-eval):\n\n - Learn how to tailor model-based metrics to your business criteria.\n\n - Evaluate a single model (pointwise) or determine the winner when comparing 2 models (pairwise).\n\n - Include computation-based metrics for additional insights.\n\n2. [**Prepare your evaluation dataset**](/vertex-ai/generative-ai/docs/models/evaluation-dataset).\n\n - Provide a dataset that reflects your specific use case.\n3. [**Run an evaluation**](/vertex-ai/generative-ai/docs/models/run-evaluation).\n\n - Start from scratch, use a template, or adapt existing examples.\n\n - Define candidate models and create an `EvalTask` to reuse your evaluation logic through Vertex AI.\n\n4. [**View and interpret your evaluation results**](/vertex-ai/generative-ai/docs/models/view-evaluation).\n\n5. (Optional) Evaluate and improve the quality of the judge model:\n\n - [Evaluate the judge model](/vertex-ai/generative-ai/docs/models/evaluate-judge-model).\n\n - Use [advanced prompt engineering techniques](/vertex-ai/generative-ai/docs/models/prompt-judge-model) for judge model customization.\n\n - Use [system instructions and judge model configurations](/vertex-ai/generative-ai/docs/models/configure-judge-model) to improve evaluate results consistency and reduce judge model bias.\n\n6. (Optional) [Evaluate generative AI agents](/vertex-ai/generative-ai/docs/models/evaluation-agents).\n\nNotebooks for evaluation use cases\n----------------------------------\n\nThe following table lists Vertex AI SDK for Python notebooks for various generative AI evaluation use cases:\n\nSupported models and languages\n------------------------------\n\nThe Vertex AI Gen AI evaluation service supports Google's foundation models, third party models, and open models. You can provide pre-generated predictions directly, or automatically generate candidate model responses in the following ways:\n\n- Automatically generate responses for Google's foundation models (such as Gemini 2.0 Flash) and any model deployed in Vertex AI Model Registry.\n\n- Integrate with SDK text generation APIs from other third party and open models.\n\n- Wrap model endpoints from other providers using the Vertex AI SDK.\n\nFor Gemini model-based metrics, the Gen AI evaluation service supports all input languages that are [supported by Gemini 2.0 Flash](/vertex-ai/generative-ai/docs/learn/models#languages-gemini). However, the quality of evaluations for non-English inputs may not be as high as the quality for English inputs.\n\nThe Gen AI evaluation service supports the following languages for model-based translation metrics: \n\n### MetricX\n\n**Supported languages for [MetricX](https://github.com/google-research/metricx)**: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.\n\n### COMET\n\n**Supported languages for [COMET](https://huggingface.co/Unbabel/wmt22-comet-da#languages-covered**)**: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.\n\nWhat's next\n-----------\n\n- Try the [evaluation quickstart](/vertex-ai/generative-ai/docs/models/evaluation-quickstart).\n\n- [Define your evaluation metrics](/vertex-ai/generative-ai/docs/models/determine-eval).\n\n- Learn how to [tune a foundation model](/vertex-ai/generative-ai/docs/models/tune-models)."]]