Model-based metrics let you customize how you generate evaluation metrics based on your criteria and use cases. This guide shows you how to configure a judge model and covers the following topics:
- Choose a configuration option: Compare the different methods available to customize your judge model.
- System instructions: Provide high-level, persistent instructions to the judge model to influence its behavior.
- Response flipping: Reduce potential positional bias by swapping the position of the baseline and candidate model responses.
- Multi-sampling: Improve consistency by calling the judge model multiple times for the same input and aggregating the results.
- Tuned judge model: Use a fine-tuned LLM as the judge model for specialized evaluation tasks.
For the basic evaluation workflow, see the Gen AI evaluation service quickstart. The Advanced judge model customization series includes the following pages:
- Evaluate a judge model
- Prompting for judge model customization
- Configure a judge model (current page)
Choose a configuration option
You have several options to configure your judge model for improved quality. The following table provides a high-level comparison of each approach.
Option | Description | Use Case |
---|---|---|
System instructions | Provides high-level, persistent instructions to the judge model that influence its behavior for all subsequent evaluation prompts. | When you need to define a consistent role, persona, or output format for the judge model across the entire evaluation task. |
Response flipping | Swaps the position of the baseline and candidate model responses for half of the evaluation calls. | To reduce potential positional bias in pairwise evaluations where the judge model might favor the response in the first or second position. |
Multi-sampling | Calls the judge model multiple times for the same input and aggregates the results. | To improve the consistency and reliability of evaluation scores by mitigating the effects of randomness in the judge model's responses. |
Tuned judge model | Uses a fine-tuned LLM as the judge model for evaluation. | For specialized evaluation tasks that require nuanced understanding or domain-specific knowledge that a general-purpose model lacks. |
System instructions
Gemini models can take in system instructions, which are a set of instructions that impact how the model processes prompts. You can use system instructions when you initialize or generate content from a model to specify product-level behavior such as roles, personas, contextual information, and explanation style and tone. The judge model typically gives more weight to system instructions than to input prompts.
For a list of models that support system instructions, see Supported models.
The following example uses the Vertex AI SDK to add system_instruction
at the metric level for PointwiseMetric
:
system_instruction = "You are an expert evaluator."
linguistic_acceptability = PointwiseMetric(
metric="linguistic_acceptability",
metric_prompt_template=linguistic_acceptability_metric_prompt_template,
system_instruction=system_instruction,
)
eval_result = EvalTask(
dataset=EVAL_DATASET,
metrics=[linguistic_acceptability]
).evaluate()
You can use the same approach with PairwiseMetric
.
Response flipping
For PairwiseMetrics
, Gen AI evaluation service uses responses from both a baseline model and a candidate model. The judge model evaluates which response better aligns with the criteria in the metric_prompt_template
. However, the judge model might be biased toward the baseline or candidate model in certain settings.
To reduce bias in the evaluation results, you can enable response flipping. This technique swaps the baseline and candidate model responses for half of the calls to the judge model. The following example shows how to enable response flipping using the Vertex AI SDK:
from vertexai.preview.evaluation import AutoraterConfig
pairwise_relevance_prompt_template = """
# Instruction
…
### Response A
{baseline_model_response}
### Response B
{candidate_model_response}
"""
my_pairwise_metric = PairwiseMetric(
metric="my_pairwise_metric",
metric_prompt_template=pairwise_relevance_prompt_template,
candidate_response_field_name = "candidate_model_response",
baseline_response_field_name = "baseline_model_response"
)
# Define an AutoraterConfig with flip_enabled
my_autorater_config = AutoraterConfig(flip_enabled=True)
# Define an EvalTask with autorater_config
flip_enabled_eval_result = EvalTask(
dataset=EVAL_DATASET,
metrics=[my_pairwise_metric],
autorater_config=my_autorater_config,
).evaluate()
Multi-sampling
The judge model can exhibit randomness in its responses during an evaluation. To mitigate the effects of this randomness and produce more consistent results, you can use additional sampling. This technique is also known as multi-sampling.
However, increasing sampling also increases the latency to complete the request. You can update the sampling count value using AutoraterConfig
to an integer between 1 and 32. We recommend using the default sampling_count
value of 4 to balance randomness and latency.
Using the Vertex AI SDK, you can specify the number of samples to execute for each request:
from vertexai.preview.evaluation import AutoraterConfig
# Define customized sampling count in AutoraterConfig
autorater_config = AutoraterConfig(sampling_count=6)
# Run evaluation with the sampling count.
eval_result = EvalTask(
dataset=EVAL_DATASET,
metrics=[METRICS],
autorater_config=autorater_config
).evaluate()
Tuned judge model
If you have good tuning data for your evaluation use case, you can use the Vertex AI SDK to tune a Gemini model as the judge model and use the tuned model for evaluation. You can specify a tuned model as the judge model through AutoraterConfig
:
from vertexai.preview.evaluation import {
AutoraterConfig,
PairwiseMetric,
tune_autorater,
evaluate_autorater,
}
# Tune a model to be the judge model. The tune_autorater helper function returns an AutoraterConfig with the judge model set as the tuned model.
autorater_config: AutoRaterConfig = tune_autorater(
base_model="gemini-2.0-flash",
train_dataset=f"{BUCKET_URI}/train/sft_train_samples.jsonl",
validation_dataset=f"{BUCKET_URI}/val/sft_val_samples.jsonl",
tuned_model_display_name=tuned_model_display_name,
)
# Alternatively, you can set up the judge model with an existing tuned model endpoint
autorater_config = AutoraterConfig(autorater_model=TUNED_MODEL)
# Use the tuned judge model
eval_result = EvalTask(
dataset=EVAL_DATASET,
metrics=[METRICS],
autorater_config=autorater_config,
).evaluate()