A model-based evaluation metric that compares two generative models' responses
side-by-side, and allows users to A/B test their generative models to
determine which model is performing better.
* In `EvalResult.summary_metrics`, win rates for both the baseline and
candidate model are computed. The win rate is computed as proportion of
wins of one model's responses to total attempts as a decimal value
between 0 and 1.
* In `EvalResult.metrics_table`, a pairwise metric produces two
evaluation results per dataset row:
* `pairwise_choice`: The choice shows whether the candidate model or
the baseline model performs better, or if they are equally good.
* `explanation`: The rationale behind each verdict using
chain-of-thought reasoning. The explanation helps users scrutinize
the judgment and builds appropriate trust in the decisions.
See [documentation
page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#understand-results)
for more details on understanding the metric results.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-07 UTC."],[],[],null,["# Class PairwiseMetric (1.95.1)\n\nVersion latestkeyboard_arrow_down\n\n- [1.95.1 (latest)](/python/docs/reference/vertexai/latest/vertexai.evaluation.PairwiseMetric)\n- [1.94.0](/python/docs/reference/vertexai/1.94.0/vertexai.evaluation.PairwiseMetric)\n- [1.93.1](/python/docs/reference/vertexai/1.93.1/vertexai.evaluation.PairwiseMetric)\n- [1.92.0](/python/docs/reference/vertexai/1.92.0/vertexai.evaluation.PairwiseMetric)\n- [1.91.0](/python/docs/reference/vertexai/1.91.0/vertexai.evaluation.PairwiseMetric)\n- [1.90.0](/python/docs/reference/vertexai/1.90.0/vertexai.evaluation.PairwiseMetric)\n- [1.89.0](/python/docs/reference/vertexai/1.89.0/vertexai.evaluation.PairwiseMetric)\n- [1.88.0](/python/docs/reference/vertexai/1.88.0/vertexai.evaluation.PairwiseMetric)\n- [1.87.0](/python/docs/reference/vertexai/1.87.0/vertexai.evaluation.PairwiseMetric)\n- [1.86.0](/python/docs/reference/vertexai/1.86.0/vertexai.evaluation.PairwiseMetric)\n- [1.85.0](/python/docs/reference/vertexai/1.85.0/vertexai.evaluation.PairwiseMetric)\n- [1.84.0](/python/docs/reference/vertexai/1.84.0/vertexai.evaluation.PairwiseMetric)\n- [1.83.0](/python/docs/reference/vertexai/1.83.0/vertexai.evaluation.PairwiseMetric)\n- [1.82.0](/python/docs/reference/vertexai/1.82.0/vertexai.evaluation.PairwiseMetric)\n- [1.81.0](/python/docs/reference/vertexai/1.81.0/vertexai.evaluation.PairwiseMetric)\n- [1.80.0](/python/docs/reference/vertexai/1.80.0/vertexai.evaluation.PairwiseMetric)\n- [1.79.0](/python/docs/reference/vertexai/1.79.0/vertexai.evaluation.PairwiseMetric)\n- [1.78.0](/python/docs/reference/vertexai/1.78.0/vertexai.evaluation.PairwiseMetric)\n- [1.77.0](/python/docs/reference/vertexai/1.77.0/vertexai.evaluation.PairwiseMetric)\n- [1.76.0](/python/docs/reference/vertexai/1.76.0/vertexai.evaluation.PairwiseMetric)\n- [1.75.0](/python/docs/reference/vertexai/1.75.0/vertexai.evaluation.PairwiseMetric)\n- [1.74.0](/python/docs/reference/vertexai/1.74.0/vertexai.evaluation.PairwiseMetric)\n- [1.73.0](/python/docs/reference/vertexai/1.73.0/vertexai.evaluation.PairwiseMetric)\n- [1.72.0](/python/docs/reference/vertexai/1.72.0/vertexai.evaluation.PairwiseMetric)\n- [1.71.1](/python/docs/reference/vertexai/1.71.1/vertexai.evaluation.PairwiseMetric)\n- [1.70.0](/python/docs/reference/vertexai/1.70.0/vertexai.evaluation.PairwiseMetric)\n- [1.69.0](/python/docs/reference/vertexai/1.69.0/vertexai.evaluation.PairwiseMetric)\n- [1.68.0](/python/docs/reference/vertexai/1.68.0/vertexai.evaluation.PairwiseMetric)\n- [1.67.1](/python/docs/reference/vertexai/1.67.1/vertexai.evaluation.PairwiseMetric)\n- [1.66.0](/python/docs/reference/vertexai/1.66.0/vertexai.evaluation.PairwiseMetric)\n- [1.65.0](/python/docs/reference/vertexai/1.65.0/vertexai.evaluation.PairwiseMetric)\n- [1.63.0](/python/docs/reference/vertexai/1.63.0/vertexai.evaluation.PairwiseMetric)\n- [1.62.0](/python/docs/reference/vertexai/1.62.0/vertexai.evaluation.PairwiseMetric)\n- [1.60.0](/python/docs/reference/vertexai/1.60.0/vertexai.evaluation.PairwiseMetric)\n- [1.59.0](/python/docs/reference/vertexai/1.59.0/vertexai.evaluation.PairwiseMetric) \n\n PairwiseMetric(\n *,\n metric: str,\n metric_prompt_template: typing.Union[\n vertexai.evaluation.metrics.metric_prompt_template.PairwiseMetricPromptTemplate,\n str,\n ],\n baseline_model: typing.Optional[\n typing.Union[\n vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]\n ]\n ] = None\n )\n\nA Model-based Pairwise Metric.\n\nA model-based evaluation metric that compares two generative models' responses\nside-by-side, and allows users to A/B test their generative models to\ndetermine which model is performing better.\n\nFor more details on when to use pairwise metrics, see\n[Evaluation methods and\nmetrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#pointwise_versus_pairwise).\n\nResult Details: \n\n * In `EvalResult.summary_metrics`, win rates for both the baseline and\n candidate model are computed. The win rate is computed as proportion of\n wins of one model's responses to total attempts as a decimal value\n between 0 and 1.\n\n * In `EvalResult.metrics_table`, a pairwise metric produces two\n evaluation results per dataset row:\n * `pairwise_choice`: The choice shows whether the candidate model or\n the baseline model performs better, or if they are equally good.\n * `explanation`: The rationale behind each verdict using\n chain-of-thought reasoning. The explanation helps users scrutinize\n the judgment and builds appropriate trust in the decisions.\n\n See [documentation\n page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#understand-results)\n for more details on understanding the metric results.\n\nUsage Examples: \n\n ```\n baseline_model = GenerativeModel(\"gemini-1.0-pro\")\n candidate_model = GenerativeModel(\"gemini-1.5-pro\")\n\n pairwise_groundedness = PairwiseMetric(\n metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(\n \"pairwise_groundedness\"\n ),\n baseline_model=baseline_model,\n )\n eval_dataset = pd.DataFrame({\n \"prompt\" : [...],\n })\n pairwise_task = EvalTask(\n dataset=eval_dataset,\n metrics=[pairwise_groundedness],\n experiment=\"my-pairwise-experiment\",\n )\n pairwise_result = pairwise_task.evaluate(\n model=candidate_model,\n experiment_run_name=\"gemini-pairwise-eval-run\",\n )\n ```\n\nMethods\n-------\n\n### PairwiseMetric\n\n PairwiseMetric(\n *,\n metric: str,\n metric_prompt_template: typing.Union[\n vertexai.evaluation.metrics.metric_prompt_template.PairwiseMetricPromptTemplate,\n str,\n ],\n baseline_model: typing.Optional[\n typing.Union[\n vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]\n ]\n ] = None\n )\n\nInitializes a pairwise evaluation metric."]]