Class PairwiseMetric (1.54.0)

    metric: typing.Literal["summarization_quality", "question_answering_quality"],
    baseline_model: typing.Optional[
            vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
    ] = None,
    use_reference: bool = False,
    version: typing.Optional[int] = None

The Side-by-side(SxS) Pairwise Metric.

A model-based evaluation metric that compares two generative models side-by-side, and allows users to A/B test their generative models to determine which model is performing better on the given evaluation task.

For more details on when to use pairwise metrics, see Evaluation methods and metrics.

Result Details:

* In `EvalResult.summary_metrics`, win rates for both the baseline and
candidate model are computed, showing the rate of each model performs
better on the given task. The win rate is computed as the number of times
the candidate model performs better than the baseline model divided by the
total number of examples. The win rate is a number between 0 and 1.

* In `EvalResult.metrics_table`, a pairwise metric produces three
evaluation results for each row in the dataset:
    * `pairwise_choice`: the `pairwise_choice` in the evaluation result is
      an enumeration that indicates whether the candidate or baseline
      model perform better.
    * `explanation`: The model AutoRater's rationale behind each verdict
      using chain-of-thought reasoning. These explanations help users
      scrutinize the AutoRater's judgment and build appropriate trust in
      its decisions.
    * `confidence`: A score between 0 and 1, which signifies how confident
      the AutoRater was with its verdict. A score closer to 1 means higher

See [documentation page](
for more details on understanding the metric results.


from <xref uid="vertexai.generative_models">vertexai.generative_models</xref> import GenerativeModel
from vertexai.preview.evaluation import EvalTask, PairwiseMetric

baseline_model = GenerativeModel("gemini-1.0-pro")
candidate_model = GenerativeModel("gemini-1.5-pro")

pairwise_summarization_quality = PairwiseMetric(
  metric = "summarization_quality",

eval_task =  EvalTask(
  dataset = pd.DataFrame({
      "instruction": [...],
      "context": [...],

pairwise_results = eval_task.evaluate(
    prompt_template="instruction: {instruction}. context: {context}",



    metric: typing.Literal["summarization_quality", "question_answering_quality"],
    baseline_model: typing.Optional[
            vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
    ] = None,
    use_reference: bool = False,
    version: typing.Optional[int] = None

Initializes the Side-by-side(SxS) Pairwise evaluation metric.