This guide shows you how to perform pairwise model-based evaluation using AutoSxS, a tool that uses the evaluation pipeline service. This guide covers the following topics:
- What is AutoSxS?: Learn about the automatic side-by-side evaluation tool and its core component, the autorater.
- Supported models, tasks, and criteria: Find out which models and tasks are supported and the criteria used for evaluation.
- Prepare the evaluation dataset: Understand the required dataset format and how to configure prompt parameters for inference and evaluation.
- Perform model evaluation: Follow step-by-step instructions to run an evaluation job using the REST API, Python SDK, or the Google Cloud console.
- View evaluation results: Learn how to interpret the output, including judgments, aggregate metrics, and human-preference alignment metrics.
- AutoSxS use cases: Explore common scenarios for using AutoSxS, such as comparing models, comparing predictions, and checking alignment with human preferences.
What is AutoSxS?
Automatic side-by-side (AutoSxS) is a pairwise model-based evaluation tool that uses the evaluation pipeline service. You can use AutoSxS to evaluate the performance of generative AI models in the Vertex AI Model Registry or to evaluate pre-generated predictions. This capability lets AutoSxS support Vertex AI foundation models, tuned generative AI models, and third-party language models. AutoSxS is available on demand and evaluates language models with performance comparable to human raters.
The autorater
At a high level, the following diagram shows how AutoSxS compares the predictions of models A and B with a third model, the autorater.
Models A and B receive input prompts, and each model generates a response. These responses are then sent to the autorater. Similar to a human rater, an autorater is a language model that judges the quality of model responses based on an original inference prompt. With AutoSxS, the autorater compares the quality of two model responses based on their inference instruction by using a set of criteria. The autorater uses these criteria to determine which model performed better. The autorater outputs response preferences as aggregate metrics, along with preference explanations and confidence scores for each example. For more information, see the judgment table.
Supported models
AutoSxS supports the evaluation of any model when you provide pre-generated predictions. AutoSxS also supports automatically generating responses for any model in the Vertex AI Model Registry that supports batch prediction on Vertex AI.
If your Text model isn't supported by the Vertex AI Model Registry, AutoSxS also accepts pre-generated predictions stored as JSONL in Cloud Storage or in a BigQuery table. For pricing, see Text generation.
Supported tasks and criteria
AutoSxS supports evaluating models for summarization and question-answering tasks. The evaluation criteria are predefined for each task to make language evaluation more objective and to improve response quality.
The following sections list the criteria for each task.
Summarization
The summarization
task has a 4,096 input token
limit.
The evaluation criteria for summarization
are as follows:
Criteria |
|
1. Follows instructions |
To what extent does the model's response demonstrate an understanding of the instruction from the prompt? |
2. Grounded |
Does the response include only information
from the inference context and inference instruction? |
3. Comprehensive |
To what extent does the model capture key
details in the summarization? |
4. Brief |
Is the summarization verbose? Does it include flowery language? Is it overly terse? |
Question answer
The question_answering
task has a 4,096 input token
limit.
The evaluation criteria for question_answering
are as follows:
Criteria |
|
1. Fully answers the question |
The answer responds to the question, completely. |
2. Grounded |
Does the response include only information
from the instruction context and inference instruction? |
3. Relevance |
Does the content of the answer relate to the question? |
4. Comprehensive |
To what extent does the model capture key
details in the question? |
Prepare the evaluation dataset
This section details the data to provide in your AutoSxS evaluation dataset and best practices for dataset construction. The examples should mirror real-world inputs that your models might encounter in production and effectively contrast the behavior of your models.
AutoSxS accepts a single evaluation dataset with a flexible schema. You can provide the dataset as a BigQuery table or as a JSON Lines file in Cloud Storage.
Each row of the evaluation dataset represents a single example, and the columns fall into one of the following categories:
- ID columns: Used to identify each unique example.
- Data columns: Used to fill out prompt templates. See Prompt parameters.
- Pre-generated predictions: Predictions made by the same model using the same prompt. Using pre-generated predictions saves time and resources.
- Ground-truth human preferences: Used to benchmark AutoSxS against your ground-truth preference data when pre-generated predictions are provided for both models.
The following is an example evaluation dataset where context
and question
are data columns, and model_b_response
contains pre-generated predictions.
context |
question |
model_b_response |
Some might think that steel is the hardest material or titanium, but diamond is actually the hardest material. |
What is the hardest material? |
Diamond is the hardest material. It is harder than steel or titanium. |
For more information about how to call AutoSxS, see Perform model
evaluation. For details about token length, see Supported tasks
and criteria. To upload your data to Cloud Storage, see
Upload evaluation dataset to
Cloud Storage.
Prompt parameters
Many language models take prompt parameters as inputs instead of a single prompt
string. For example,
chat-bison
takes
several prompt parameters (messages, examples, context), which make up pieces
of the prompt. However, text-bison
has only one prompt parameter, prompt, which contains the entire
prompt.
This section outlines how you can flexibly specify model prompt parameters at inference and evaluation time. AutoSxS lets you call language models with varying inputs by using templated prompt parameters.
Inference
If any of the models don't have pre-generated predictions, AutoSxS uses
Vertex AI batch prediction to generate responses. You must specify each model's prompt parameters.
In AutoSxS, you can provide a single column in the evaluation dataset as a prompt parameter.
{'some_parameter': {'column': 'my_column'}}
Alternatively, you can define templates that use columns from the evaluation dataset as variables to specify prompt parameters:
{'some_parameter': {'template': 'Summarize the following: {{ my_column }}.'}}
When providing model prompt parameters for inference, you can use the
protected default_instruction
keyword as a template argument, which is
replaced with the default inference instruction for the given task:
model_prompt_parameters = {
'prompt': {'template': '{{ default_instruction }}: {{ context }}'},
}
If you are generating predictions, provide model prompt parameters and an output column.
See the following examples:
Gemini
For Gemini models, the keys for model prompt parameters are
contents
(required) and system_instruction
(optional), which align with
the Gemini request body
schema.
model_a_prompt_parameters={
'contents': {
'column': 'context'
},
'system_instruction': {'template': '{{ default_instruction }}'},
},
text-bison
For example, text-bison
uses "prompt" for input and "content" for
output. Follow these
steps:
- Identify the inputs and outputs needed by the models being evaluated.
- Define the inputs as model prompt parameters.
- Pass the output to the response column.
model_a_prompt_parameters={
'prompt': {
'template': {
'Answer the following question from the point of view of a college professor: {{ context }}\n{{ question }}'
},
},
},
response_column_a='content', # Column in Model A response.
response_column_b='model_b_response', # Column in eval dataset.
Evaluation
Just as you must provide prompt parameters for inference, you must also provide
prompt parameters for evaluation. The autorater requires the
following prompt parameters:
Autorater prompt parameter |
Configurable by user? |
Description |
Example |
Autorater instruction |
No |
A calibrated instruction describing the criteria the autorater should use to judge the given responses. |
Pick the response that answers the question and best follows instructions. |
Inference instruction |
Yes |
A description of the task each candidate model should perform. |
Answer the question accurately: Which is the hardest material? |
Inference context |
Yes |
Additional context for the task being performed. |
While titanium and diamond are both harder than copper, diamond has a hardness rating of 98 while titanium has a rating of 36. A higher rating means higher hardness. |
Responses |
No1 |
A pair of responses to evaluate, one from each candidate model. |
Diamond |
1You can only configure the prompt parameter through pre-generated
responses.
The following is sample code using the parameters:
autorater_prompt_parameters={
'inference_instruction': {
'template': 'Answer the following question from the point of view of a college professor: {{ question }}.'
},
'inference_context': {
'column': 'context'
}
}
The inference instructions and context for Model A and Model B can be formatted
differently, even if they contain the same information. The autorater accepts a single, separate inference instruction and context for its evaluation.
Example evaluation dataset
This section provides an example of a question-answer task evaluation dataset,
including pre-generated predictions for model B. In this example, AutoSxS
performs inference only for model A. This example provides an id
column to differentiate
between examples with the same question and context.
{
"id": 1,
"question": "What is the hardest material?",
"context": "Some might think that steel is the hardest material, or even titanium. However, diamond is actually the hardest material.",
"model_b_response": "Diamond is the hardest material. It is harder than steel or titanium."
}
{
"id": 2,
"question": "What is the highest mountain in the world?",
"context": "K2 and Everest are the two tallest mountains, with K2 being just over 28k feet and Everest being 29k feet tall.",
"model_b_response": "Mount Everest is the tallest mountain, with a height of 29k feet."
}
{
"id": 3,
"question": "Who directed The Godfather?",
"context": "Mario Puzo and Francis Ford Coppola co-wrote the screenplay for The Godfather, and the latter directed it as well.",
"model_b_response": "Francis Ford Coppola directed The Godfather."
}
{
"id": 4,
"question": "Who directed The Godfather?",
"context": "Mario Puzo and Francis Ford Coppola co-wrote the screenplay for The Godfather, and the latter directed it as well.",
"model_b_response": "John Smith."
}
Best practices
Follow these best practices when you define your evaluation dataset:
- Provide examples that represent the types of inputs that your models process
in production.
- Your dataset must include at least one evaluation example. For high-quality aggregate metrics, a dataset of around 100 examples is recommended. The rate of
aggregate-metric quality improvements tends to decrease when more than 400
examples are provided.
- For a guide to writing prompts, see Design text
prompts.
- If you're using pre-generated predictions for either model, include the
pre-generated predictions in a column of your evaluation dataset.
Providing pre-generated predictions lets you compare the
output of models that aren't in Vertex Model
Registry and lets you reuse
responses.
Method |
Description |
Use Case |
REST API |
Provides a language-agnostic interface for running evaluation jobs through direct HTTP requests. |
Integrating evaluation into non-Python environments or custom workflows where maximum control is needed. |
Vertex AI SDK for Python |
Offers a high-level Pythonic interface that simplifies interaction with the evaluation service. |
Automating evaluation pipelines, integrating with MLOps workflows, and running experiments from notebooks for data scientists and ML engineers. |
Google Cloud console |
A web-based graphical user interface that guides you through the evaluation setup process. |
Quick, one-off evaluations, exploring the service's capabilities, or for users who prefer a visual, code-free approach. |
You can evaluate models by using the REST API, the Vertex AI SDK for Python, or the
Google Cloud console.
Permissions required for this task
To perform this task, you must grant Identity and Access Management (IAM) roles to each of the following service accounts:
Service account |
Default principal |
Description |
Roles |
Vertex AI Service Agent |
service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com |
The Vertex AI Service Agent is automatically provisioned for your project and granted a
predefined role.
However, if an org policy modifies the default permissions of the Vertex AI Service Agent, you must
manually grant the role to the service agent. |
Vertex AI Service Agent
(roles/aiplatform.serviceAgent ) |
Vertex AI Pipelines Service Account |
PROJECT_NUMBER-compute@developer.gserviceaccount.com |
The service account that runs the pipeline. The default service account used is the
Compute Engine default service account. Optionally, you can use
a custom service account instead of the default service account.
|
|
Depending on your input and output data sources, you may also need to grant the Vertex AI Pipelines
Service Account additional roles:
Use the following syntax to specify the path to your model:
REST
To create a model evaluation job, send a POST
request by using the
pipelineJobs method.
Before using any of the request data, make the following replacements:
- PIPELINEJOB_DISPLAYNAME : Display name for the
pipelineJob
.
- PROJECT_ID : Google Cloud project that runs the pipeline components.
- LOCATION : Region to run the pipeline components.
us-central1
is supported.
- OUTPUT_DIR : Cloud Storage URI to store evaluation output.
- EVALUATION_DATASET : BigQuery table or a comma-separated list of Cloud Storage paths to a JSONL dataset containing evaluation examples.
- TASK : Evaluation task, which can be one of
[summarization, question_answering]
.
- ID_COLUMNS : Columns that distinguish unique evaluation examples.
- AUTORATER_PROMPT_PARAMETERS : Autorater prompt parameters mapped to columns or templates. The expected parameters are:
inference_instruction
(details on how to perform a task) and inference_context
(content to reference to perform the task). As an example, {'inference_context': {'column': 'my_prompt'}}
uses the evaluation dataset's `my_prompt` column for the autorater's context.
- RESPONSE_COLUMN_A : Either the name of a column in the evaluation dataset containing predefined predictions, or the name of the column in the Model A output containing predictions. If no value is provided, the correct model output column name will attempt to be inferred.
- RESPONSE_COLUMN_B : Either the name of a column in the evaluation dataset containing predefined predictions, or the name of the column in the Model B output containing predictions. If no value is provided, the correct model output column name will attempt to be inferred.
- MODEL_A (Optional): A fully-qualified model resource name (
projects/{project}/locations/{location}/models/{model}@{version}
) or publisher model resource name (publishers/{publisher}/models/{model}
). If Model A responses are specified, this parameter shouldn't be provided.
- MODEL_B (Optional): A fully-qualified model resource name (
projects/{project}/locations/{location}/models/{model}@{version}
) or publisher model resource name (publishers/{publisher}/models/{model}
). If Model B responses are specified, this parameter shouldn't be provided.
- MODEL_A_PROMPT_PARAMETERS (Optional): Model A's prompt template parameters mapped to columns or templates. If Model A responses are predefined, this parameter shouldn't be provided. Example:
{'prompt': {'column': 'my_prompt'}}
uses the evaluation dataset's my_prompt
column for the prompt parameter named prompt
.
- MODEL_B_PROMPT_PARAMETERS (Optional): Model B's prompt template parameters mapped to columns or templates. If Model B responses are predefined, this parameter shouldn't be provided. Example:
{'prompt': {'column': 'my_prompt'}}
uses the evaluation dataset's my_prompt
column for the prompt parameter named prompt
.
- JUDGMENTS_FORMAT
(Optional): The format to write judgments to. Can be
jsonl
(default), json
, or bigquery
.
- BIGQUERY_DESTINATION_PREFIX: BigQuery table to write judgments to if the specified format is
bigquery
.
Request JSON body
{
"displayName": "PIPELINEJOB_DISPLAYNAME",
"runtimeConfig": {
"gcsOutputDirectory": "gs://OUTPUT_DIR",
"parameterValues": {
"evaluation_dataset": "EVALUATION_DATASET",
"id_columns": ["ID_COLUMNS"],
"task": "TASK",
"autorater_prompt_parameters": AUTORATER_PROMPT_PARAMETERS,
"response_column_a": "RESPONSE_COLUMN_A",
"response_column_b": "RESPONSE_COLUMN_B",
"model_a": "MODEL_A",
"model_a_prompt_parameters": MODEL_A_PROMPT_PARAMETERS,
"model_b": "MODEL_B",
"model_b_prompt_parameters": MODEL_B_PROMPT_PARAMETERS,
"judgments_format": "JUDGMENTS_FORMAT",
"bigquery_destination_prefix":BIGQUERY_DESTINATION_PREFIX,
},
},
"templateUri": "https://us-kfp.pkg.dev/ml-pipeline/google-cloud-registry/autosxs-template/default"
}
To send your request, use curl
.
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs"
Response
"state": "PIPELINE_STATE_PENDING",
"labels": {
"vertex-ai-pipelines-run-billing-id": "1234567890123456789"
},
"runtimeConfig": {
"gcsOutputDirectory": "gs://my-evaluation-bucket/output",
"parameterValues": {
"evaluation_dataset": "gs://my-evaluation-bucket/output/data.json",
"id_columns": [
"context"
],
"task": "question_answering",
"autorater_prompt_parameters": {
"inference_instruction": {
"template": "Answer the following question: {{ question }} }."
},
"inference_context": {
"column": "context"
}
},
"response_column_a": "",
"response_column_b": "response_b",
"model_a": "publishers/google/models/text-bison@002",
"model_a_prompt_parameters": {
"prompt": {
"template": "Answer the following question from the point of view of a college professor: {{ question }}\n{{ context }} }"
}
},
"model_b": "",
"model_b_prompt_parameters": {}
}
},
"serviceAccount": "123456789012-compute@developer.gserviceaccount.com",
"templateUri": "https://us-kfp.pkg.dev/ml-pipeline/google-cloud-registry/autosxs-template/default",
"templateMetadata": {
"version": "sha256:7366b784205551ed28f2c076e841c0dbeec4111b6df16743fc5605daa2da8f8a"
}
}
Vertex AI SDK for Python
To learn how to install or update the Vertex AI SDK for Python, see
Install the Vertex AI SDK for
Python. For more
information, see the Vertex AI SDK for Python
API.
For more information about
pipeline parameters, see the Google Cloud Pipeline Components Reference
Documentation.
Before using any of the request data, make the following replacements:
- PIPELINEJOB_DISPLAYNAME : Display name for the
pipelineJob
.
- PROJECT_ID : Google Cloud project that runs the pipeline components.
- LOCATION : Region to run the pipeline components.
us-central1
is supported.
- OUTPUT_DIR : Cloud Storage URI to store evaluation output.
- EVALUATION_DATASET : BigQuery table or a comma-separated list of Cloud Storage paths to a JSONL dataset containing evaluation examples.
- TASK : Evaluation task, which can be one of
[summarization, question_answering]
.
- ID_COLUMNS : Columns that distinguish unique evaluation examples.
- AUTORATER_PROMPT_PARAMETERS : Autorater prompt parameters mapped to columns or templates. The expected parameters are:
inference_instruction
(details on how to perform a task) and inference_context
(content to reference to perform the task). As an example, {'inference_context': {'column': 'my_prompt'}}
uses the evaluation dataset's `my_prompt` column for the autorater's context.
- RESPONSE_COLUMN_A : Either the name of a column in the evaluation dataset containing predefined predictions, or the name of the column in the Model A output containing predictions. If no value is provided, the correct model output column name will attempt to be inferred.
- RESPONSE_COLUMN_B : Either the name of a column in the evaluation dataset containing predefined predictions, or the name of the column in the Model B output containing predictions. If no value is provided, the correct model output column name will attempt to be inferred.
- MODEL_A (Optional): A fully-qualified model resource name (
projects/{project}/locations/{location}/models/{model}@{version}
) or publisher model resource name (publishers/{publisher}/models/{model}
). If Model A responses are specified, this parameter shouldn't be provided.
- MODEL_B (Optional): A fully-qualified model resource name (
projects/{project}/locations/{location}/models/{model}@{version}
) or publisher model resource name (publishers/{publisher}/models/{model}
). If Model B responses are specified, this parameter shouldn't be provided.
- MODEL_A_PROMPT_PARAMETERS (Optional): Model A's prompt template parameters mapped to columns or templates. If Model A responses are predefined, this parameter shouldn't be provided. Example:
{'prompt': {'column': 'my_prompt'}}
uses the evaluation dataset's my_prompt
column for the prompt parameter named prompt
.
- MODEL_B_PROMPT_PARAMETERS (Optional): Model B's prompt template parameters mapped to columns or templates. If Model B responses are predefined, this parameter shouldn't be provided. Example:
{'prompt': {'column': 'my_prompt'}}
uses the evaluation dataset's my_prompt
column for the prompt parameter named prompt
.
- JUDGMENTS_FORMAT
(Optional): The format to write judgments to. Can be
jsonl
(default), json
, or bigquery
.
- BIGQUERY_DESTINATION_PREFIX: BigQuery table to write judgments to if the specified format is
bigquery
.
import os
from google.cloud import aiplatform
parameters = {
'evaluation_dataset': 'EVALUATION_DATASET',
'id_columns': ['ID_COLUMNS'],
'task': 'TASK',
'autorater_prompt_parameters': AUTORATER_PROMPT_PARAMETERS,
'response_column_a': 'RESPONSE_COLUMN_A',
'response_column_b': 'RESPONSE_COLUMN_B',
'model_a': 'MODEL_A',
'model_a_prompt_parameters': MODEL_A_PROMPT_PARAMETERS,
'model_b': 'MODEL_B',
'model_b_prompt_parameters': MODEL_B_PROMPT_PARAMETERS,
'judgments_format': 'JUDGMENTS_FORMAT',
'bigquery_destination_prefix':
BIGQUERY_DESTINATION_PREFIX,
}
aiplatform.init(project='PROJECT_ID', location='LOCATION', staging_bucket='gs://OUTPUT_DIR')
aiplatform.PipelineJob(
display_name='PIPELINEJOB_DISPLAYNAME',
pipeline_root=os.path.join('gs://OUTPUT_DIR', 'PIPELINEJOB_DISPLAYNAME'),
template_path=(
'https://us-kfp.pkg.dev/ml-pipeline/google-cloud-registry/autosxs-template/default'),
parameter_values=parameters,
).run()
Console
To create a pairwise model evaluation job by using the Google Cloud console,
perform the following steps:
Start with a Google foundation model, or use a model that already exists
in your Vertex AI Model Registry:
For each step on the evaluation creation page, enter the required
information and click Continue:
For the Evaluation dataset step, select an evaluation objective and a
model to compare against your selected model. Select an evaluation
dataset and enter the id columns (response columns).
For the Model settings step, specify whether to use the
model responses already in your dataset or to use
Vertex AI Batch Prediction to generate the responses. Specify
the response columns for both models. For the Vertex AI Batch
Prediction option, you can specify your inference model prompt
parameters.
For the Autorater settings step, enter your autorater prompt
parameters and an output location for the
evaluations.
Click Start Evaluation.
View evaluation results
You can find the evaluation results in
Vertex AI Pipelines
by inspecting the following artifacts produced by the AutoSxS pipeline:
Judgments
AutoSxS outputs judgments (example-level metrics) that help you understand
model performance at the example level. Judgments include the following
information:
- Inference prompts
- Model responses
- Autorater decisions
- Rating explanations
- Confidence scores
You can write judgments to Cloud Storage in JSONL format or to a
BigQuery table with the following columns:
Column |
Description |
id columns |
Columns that distinguish unique evaluation examples. |
inference_instruction |
Instruction used to generate model responses. |
inference_context |
Context used to generate model responses. |
response_a |
Model A's response, given the inference instruction and context. |
response_b |
Model B's response, given the inference instruction and context. |
choice |
The model with the better response. Possible values are Model A , Model B , or Error . Error means that an error prevented the autorater from determining whether model A's response or model B's response was best. |
confidence |
A score between 0 and 1 that signifies how confident the autorater was with its choice. |
explanation |
The autorater's reason for its choice. |
Aggregate metrics
AutoSxS calculates aggregate (win-rate) metrics using the judgments
table. If you don't provide human-preference data, AutoSxS generates the following aggregate metrics:
Metric |
Description |
AutoRater model A win rate |
Percentage of time the autorater decided model A had the better response. |
AutoRater model B win rate |
Percentage of time the autorater decided model B had the better response. |
To better understand the win rate, look at the row-based results and the
autorater's explanations to determine if the results and explanations align
with your expectations.
Human-preference alignment metrics
If you provide human-preference data, AutoSxS outputs the following metrics:
Metric |
Description |
AutoRater model A win rate |
Percentage of time the autorater decided model A had the better response. |
AutoRater model B win rate |
Percentage of time the autorater decided model B had the better response. |
Human-preference model A win rate |
Percentage of time humans decided model A had the better response. |
Human-preference model B win rate |
Percentage of time humans decided model B had the better response. |
TP |
Number of examples where both the autorater and human preferences were that Model A had the better response. |
FP |
Number of examples where the autorater chose Model A as the better response, but the human preference was that Model B had the better response. |
TN |
Number of examples where both the autorater and human preferences were that Model B had the better response. |
FN |
Number of examples where the autorater chose Model B as the better response, but the human preference was that Model A had the better response. |
Accuracy |
Percentage of time where the autorater agreed with human raters. |
Precision |
Percentage of time where both the autorater and humans thought Model A had a better response, out of all cases where the autorater thought Model A had a better response. |
Recall |
Percentage of time where both the autorater and humans thought Model A had a better response, out of all cases where humans thought Model A had a better response. |
F1 |
Harmonic mean of precision and recall. |
Cohen's Kappa |
A measurement of agreement between the autorater and human raters that takes the likelihood of random agreement into account. Cohen suggests the following interpretation:
-1.0 - 0.0 | Agreement worse than or equivalent to random chance | 0.0 - 0.2 | Slight agreement | 0.2 - 0.4 | Fair agreement | 0.4 - 0.6 | Moderate agreement | 0.6 - 0.8 | Substantial agreement | 0.8 - 1.0 | Nearly perfect agreement | 1.0 | Perfect agreement |
|
AutoSxS use cases
You can explore how to use AutoSxS with the following three use case scenarios.
Compare models
Evaluate a tuned first-party (1p) model against a reference 1p model.
You can specify that inference runs on both models simultaneously.
This code sample evaluates a tuned model from Vertex Model
Registry
against a reference model from the same registry.
# Evaluation dataset schema:
# my_question: str
# my_context: str
parameters = {
'evaluation_dataset': DATASET,
'id_columns': ['my_context'],
'task': 'question_answering',
'autorater_prompt_parameters': {
'inference_instruction': {'column': 'my_question'},
'inference_context': {'column': 'my_context'},
},
'model_a': 'publishers/google/models/text-bison@002',
'model_a_prompt_parameters': {QUESTION: {'template': '{{my_question}}\nCONTEXT: {{my_context}}'}},
'response_column_a': 'content',
'model_b': 'projects/abc/locations/abc/models/tuned_bison',
'model_b_prompt_parameters': {'prompt': {'template': '{{my_context}}\n{{my_question}}'}},
'response_column_b': 'content',
}
Compare predictions
Evaluate a tuned third-party (3p) model against a reference 3p model.
You can skip inference by directly supplying model responses.
This code sample evaluates a tuned 3p model against a reference 3p model.
# Evaluation dataset schema:
# my_question: str
# my_context: str
# response_b: str
parameters = {
'evaluation_dataset': DATASET,
'id_columns': ['my_context'],
'task': 'question_answering',
'autorater_prompt_parameters':
'inference_instruction': {'column': 'my_question'},
'inference_context': {'column': 'my_context'},
},
'response_column_a': 'content',
'response_column_b': 'response_b',
}
Check alignment
All supported tasks are benchmarked against human-rater data to align autorater responses with human preferences. If you want to benchmark AutoSxS for your use cases, provide human-preference data directly to AutoSxS, which then outputs alignment-aggregate statistics.
To check alignment against a human-preference dataset, you can specify both outputs (prediction results) to the autorater. You can also provide your inference results.
This code sample verifies that the autorater's results and explanations
align with your expectations.
# Evaluation dataset schema:
# my_question: str
# my_context: str
# response_a: str
# response_b: str
# actual: str
parameters = {
'evaluation_dataset': DATASET,
'id_columns': ['my_context'],
'task': 'question_answering',
'autorater_prompt_parameters': {
'inference_instruction': {'column': 'my_question'},
'inference_context': {'column': 'my_context'},
},
'response_column_a': 'response_a',
'response_column_b': 'response_b',
'human_preference_column': 'actual',
}
What's next