Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Stay organized with collections
Save and categorize content based on your preferences.
For model-based metrics, the Gen AI evaluation service evaluates your models with a foundational model, such as Gemini, that has been configured and prompted as a judge model. If you want to learn more about the judge model, the Advanced judge model customization series describes additional tools you can use to evaluate and configure the judge model.
For the basic evaluation workflow, see the Gen AI evaluation service quickstart. The Advanced judge model customization series includes the following pages:
Using human judges to evaluate large language models (LLMs) can be expensive and
time consuming. Using a judge model is a more scalable way to evaluate LLMs. The
Gen AI evaluation service uses a configured Gemini 2.0 Flash
model by default as the judge model, with customizable prompts to evaluate your
model for various use cases.
The following sections show how to evaluate a customized judge model for your ideal use case.
Prepare the dataset
To evaluate model-based metrics, prepare an evaluation dataset with human ratings as the ground truth. The goal is to compare the scores from model-based metrics with human ratings and see if model-based metrics have the ideal quality for your use case.
For PointwiseMetric, prepare the {metric_name}/human_rating column in the dataset as the ground truth for the {metric_name}/score result generated by model-based metrics.
For PairwiseMetric, prepare the {metric_name}/human_pairwise_choice column in the dataset as the ground truth for the {metric_name}/pairwise_choice result generated by model-based metrics.
Use the following dataset schema:
Model-based metric
Human rating column
PointwiseMetric
{metric_name}/human_rating
PairwiseMetric
{metric_name}/human_pairwise_choice
Available metrics
For a PointwiseMetric that returns only 2 scores (such as 0 and 1), and a PairwiseMetric that only has 2 preference types (Model A or Model B), the following metrics are available:
Use the confusion_matrix and confusion_matrix_labels fields to calculate metrics such as True positive rate (TPR), True negative rate (TNR), False positive rate (FPR), and False negative rate (FNR).
For a PointwiseMetric that returns more than 2 scores (such as 1 through 5), and a PairwiseMetric that has more than 2 preference types (Model A, Model B, or Tie), the following metrics are available:
\( cnt_i \) : number of \( class_i \) in ground truth data
\( sum \): number of elements in ground truth data
To calculate other metrics, you can use open-source libraries.
Evaluate the model-based metric
The following example updates the model-based metric with a custom definition of fluency, then evaluates the quality of the metric.
fromvertexai.preview.evaluationimport{AutoraterConfig,PairwiseMetric,}fromvertexai.preview.evaluation.autorater_utilsimportevaluate_autorater# Step 1: Prepare the evaluation dataset with the human rating data column.human_rated_dataset=pd.DataFrame({"prompt":[PROMPT_1,PROMPT_2],"response":[RESPONSE_1,RESPONSE_2],"baseline_model_response":[BASELINE_MODEL_RESPONSE_1,BASELINE_MODEL_RESPONSE_2],"pairwise_fluency/human_pairwise_choice":["model_A","model_B"]})# Step 2: Get the results from model-based metricpairwise_fluency=PairwiseMetric(metric="pairwise_fluency",metric_prompt_template="please evaluate pairwise fluency...")eval_result=EvalTask(dataset=human_rated_dataset,metrics=[pairwise_fluency],).evaluate()# Step 3: Calibrate model-based metric result and human preferences.# eval_result contains human evaluation result from human_rated_dataset.evaluate_autorater_result=evaluate_autorater(evaluate_autorater_input=eval_result.metrics_table,eval_metrics=[pairwise_fluency])
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-29 UTC."],[],[],null,["# Evaluate a judge model\n\n| **Preview**\n|\n|\n| This product or feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA products and features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nFor model-based metrics, the Gen AI evaluation service evaluates your models with a foundational model, such as Gemini, that has been configured and prompted as a judge model. If you want to learn more about the judge model, the *Advanced judge model customization series* describes additional tools you can use to evaluate and configure the judge model.\n\nFor the basic evaluation workflow, see the [Gen AI evaluation service quickstart](/vertex-ai/generative-ai/docs/models/evaluation-quickstart). The *Advanced judge model customization series* includes the following pages:\n\n1. Evaluate a judge model (current page)\n2. [Prompting for judge model customization](/vertex-ai/generative-ai/docs/models/prompt-judge-model)\n3. [Configure a judge model](/vertex-ai/generative-ai/docs/models/configure-judge-model)\n\nOverview\n--------\n\nUsing human judges to evaluate large language models (LLMs) can be expensive and\ntime consuming. Using a judge model is a more scalable way to evaluate LLMs. The\nGen AI evaluation service uses a configured Gemini 2.0 Flash\nmodel by default as the judge model, with customizable prompts to evaluate your\nmodel for various use cases.\n\nThe following sections show how to evaluate a customized judge model for your ideal use case.\n\nPrepare the dataset\n-------------------\n\nTo evaluate model-based metrics, prepare an evaluation dataset with human ratings as the ground truth. The goal is to compare the scores from model-based metrics with human ratings and see if model-based metrics have the ideal quality for your use case.\n\n- For `PointwiseMetric`, prepare the `{metric_name}/human_rating` column in the dataset as the ground truth for the `{metric_name}/score` result generated by model-based metrics.\n\n- For `PairwiseMetric`, prepare the `{metric_name}/human_pairwise_choice` column in the dataset as the ground truth for the `{metric_name}/pairwise_choice` result generated by model-based metrics.\n\nUse the following dataset schema:\n\nAvailable metrics\n-----------------\n\nFor a `PointwiseMetric` that returns only 2 scores (such as 0 and 1), and a `PairwiseMetric` that only has 2 preference types (Model A or Model B), the following metrics are available:\n\nFor a `PointwiseMetric` that returns more than 2 scores (such as 1 through 5), and a `PairwiseMetric` that has more than 2 preference types (Model A, Model B, or Tie), the following metrics are available:\n\nWhere:\n\n- \\\\( f1 = 2 \\* precision \\* recall / (precision + recall) \\\\)\n\n - \\\\( precision = True Positives / (True Positives + False Positives) \\\\)\n\n - \\\\( recall = True Positives / (True Positives + False Negatives) \\\\)\n\n- \\\\( n \\\\) : number of classes\n\n- \\\\( cnt_i \\\\) : number of \\\\( class_i \\\\) in ground truth data\n\n- \\\\( sum \\\\): number of elements in ground truth data\n\nTo calculate other metrics, you can use open-source libraries.\n\nEvaluate the model-based metric\n-------------------------------\n\nThe following example updates the model-based metric with a custom definition of fluency, then evaluates the quality of the metric. \n\n from vertexai.preview.evaluation import {\n AutoraterConfig,\n PairwiseMetric,\n }\n from vertexai.preview.evaluation.autorater_utils import evaluate_autorater\n\n\n # Step 1: Prepare the evaluation dataset with the human rating data column.\n human_rated_dataset = pd.DataFrame({\n \"prompt\": [PROMPT_1, PROMPT_2],\n \"response\": [RESPONSE_1, RESPONSE_2],\n \"baseline_model_response\": [BASELINE_MODEL_RESPONSE_1, BASELINE_MODEL_RESPONSE_2],\n \"pairwise_fluency/human_pairwise_choice\": [\"model_A\", \"model_B\"]\n })\n\n # Step 2: Get the results from model-based metric\n pairwise_fluency = PairwiseMetric(\n metric=\"pairwise_fluency\",\n metric_prompt_template=\"please evaluate pairwise fluency...\"\n )\n\n eval_result = EvalTask(\n dataset=human_rated_dataset,\n metrics=[pairwise_fluency],\n ).evaluate()\n\n # Step 3: Calibrate model-based metric result and human preferences.\n # eval_result contains human evaluation result from human_rated_dataset.\n evaluate_autorater_result = evaluate_autorater(\n evaluate_autorater_input=eval_result.metrics_table,\n eval_metrics=[pairwise_fluency]\n )\n\nWhat's next\n-----------\n\n- [Prompting for judge model customization](/vertex-ai/generative-ai/docs/models/prompt-judge-model)\n- [Configure your judge model](/vertex-ai/generative-ai/docs/models/configure-judge-model)"]]