Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Evaluate a judge model

For model-based metrics, the Gen AI evaluation service evaluates your models with a foundational model, such as Gemini, that has been configured and prompted as a judge model. If you want to learn more about the judge model, the Advanced judge model customization series describes additional tools you can use to evaluate and configure the judge model.

For the basic evaluation workflow, see the Gen AI evaluation service quickstart. The Advanced judge model customization series includes the following pages:

Overview

Using human judges to evaluate large language models (LLMs) can be expensive and time consuming. Using a judge model is a more scalable way to evaluate LLMs. The Gen AI evaluation service uses a configured Gemini 2.0 Flash model by default as the judge model, with customizable prompts to evaluate your model for various use cases.

The following sections show how to evaluate a customized judge model for your ideal use case.

Prepare the dataset

To evaluate model-based metrics, prepare an evaluation dataset with human ratings as the ground truth. The goal is to compare the scores from model-based metrics with human ratings and see if model-based metrics have the ideal quality for your use case.

For PointwiseMetric, prepare the {metric_name}/human_rating column in the dataset as the ground truth for the {metric_name}/score result generated by model-based metrics.
For PairwiseMetric, prepare the {metric_name}/human_pairwise_choice column in the dataset as the ground truth for the {metric_name}/pairwise_choice result generated by model-based metrics.

Use the following dataset schema:

Model-based metric	Human rating column
`PointwiseMetric`	{metric_name}/human_rating
`PairwiseMetric`	{metric_name}/human_pairwise_choice

Available metrics

For a PointwiseMetric that returns only 2 scores (such as 0 and 1), and a PairwiseMetric that only has 2 preference types (Model A or Model B), the following metrics are available:

Metric	Calculation
2-class balanced accuracy	\( (1/2)*(True Positive Rate + True Negative Rate) \)
2-class balanced f1 score	\( ∑_{i=0,1} (cnt_i/sum) * f1(class_i) \)
Confusion matrix	Use the `confusion_matrix` and `confusion_matrix_labels` fields to calculate metrics such as True positive rate (TPR), True negative rate (TNR), False positive rate (FPR), and False negative rate (FNR). For example, the following result: confusion_matrix = [[20, 31, 15], [10, 11, 3], [ 3, 2, 2]] confusion_matrix_labels = ['BASELINE', 'CANDIDATE', 'TIE'] translates to the following confusion matrix: BASELINE \| CANDIDATE \| TIE BASELINE. 20 31 15 CANDIDATE. 10 11 3 TIE. 3 2 2 \|

For a PointwiseMetric that returns more than 2 scores (such as 1 through 5), and a PairwiseMetric that has more than 2 preference types (Model A, Model B, or Tie), the following metrics are available:

Metric	Calculation
Multiple-class balanced accuracy	\( (1/n) *∑_{i=1...n}(recall(class_i)) \)
Multiple-class balanced f1 score	\( ∑_{i=1...n} (cnt_i/sum) * f1(class_i) \)

Where:

\( f1 = 2 * precision * recall / (precision + recall) \)
- \( precision = True Positives / (True Positives + False Positives) \)
- \( recall = True Positives / (True Positives + False Negatives) \)
\( n \) : number of classes
\( cnt_i \) : number of \( class_i \) in ground truth data
\( sum \): number of elements in ground truth data

To calculate other metrics, you can use open-source libraries.

Evaluate the model-based metric

The following example updates the model-based metric with a custom definition of fluency, then evaluates the quality of the metric.

from vertexai.preview.evaluation import {
   AutoraterConfig,
   PairwiseMetric,
}
from vertexai.preview.evaluation.autorater_utils import evaluate_autorater


# Step 1: Prepare the evaluation dataset with the human rating data column.
human_rated_dataset = pd.DataFrame({
  "prompt": [PROMPT_1, PROMPT_2],
    "response": [RESPONSE_1, RESPONSE_2],
  "baseline_model_response": [BASELINE_MODEL_RESPONSE_1, BASELINE_MODEL_RESPONSE_2],
    "pairwise_fluency/human_pairwise_choice": ["model_A", "model_B"]
})

# Step 2: Get the results from model-based metric
pairwise_fluency = PairwiseMetric(
    metric="pairwise_fluency",
    metric_prompt_template="please evaluate pairwise fluency..."
)

eval_result = EvalTask(
    dataset=human_rated_dataset,
    metrics=[pairwise_fluency],
).evaluate()

# Step 3: Calibrate model-based metric result and human preferences.
# eval_result contains human evaluation result from human_rated_dataset.
evaluate_autorater_result = evaluate_autorater(
  evaluate_autorater_input=eval_result.metrics_table,
  eval_metrics=[pairwise_fluency]
)