Metric prompt templates for model-based evaluation

This page provides a list of templates you can use for model-based evaluation using the Gen AI Evaluation Service. For more information about model-based metrics, see Define your own metrics.

Overview

For model-based evaluation, we send a prompt to the judge model to generate the metric score based on specified criteria, score rubrics, and other instructions.

The following table provides an overview of the available metric prompt template examples:

	Text use case	Multi-turn chat use case	Other key use cases
Pointwise	Fluency Coherence Groundedness Safety Instruction Following Verbosity Text Quality	Multi-turn Chat Quality Multi-turn Safety	Summarization Quality Question Answering Quality
Pairwise	Fluency Coherence Groundedness Safety Instruction Following Verbosity Text Quality	Multi-turn Chat Quality Multi-turn Safety	Summarization Quality Question Answering Quality

Structure a metric prompt template

A metric prompt template should include the following main sections:

Instruction
Evaluation
User inputs and AI-generated response.

Each section may contain sub-sections.

Instruction

Component	Function	Type	Example
Instruction	Includes a persona for the judge model and a brief description of its task.	Default value	You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user input and AI-generated responses. You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below. You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

Evaluation

Component	Function	Type	Example
Metric definition	Specifies the name and definition of the metric.	Optional user inputs	`You will be assessing a metric called SummarizationQuality, which measures the overall ability to summarize text`
Criteria	Defines the criteria (and optionally, subcriteria) for the metric.	Required user inputs	`Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements. Groundedness: The response contains information included only in the context. The response does not reference any outside information.`
Rating rubric	Specifies the scoring scale for the metric with explanations about the meaning of each score.	Required user inputs	`5: (Very good). The summary follows instructions, is grounded, is concise, and fluent. 4: (Good). The summary follows instructions, is grounded, concise, and fluent. 3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent. 2: (Bad). The summary is grounded, but does not follow the instructions. 1: (Very bad). The summary is not grounded.`
Few-shot examples	Examples of the task.	Optional user inputs. Note: Not only can few-shot examples improve performance, but they can also improve formatting of the judge model response. We suggest starting with 5-10 few-shot examples.	`RESPONSE: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs. EXPLANATION: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence. SCORE: 1`
Evaluation steps	Step by step instruction on how to carry out the task	Optional user inputs Note: You can specify rankings of criteria in the evaluation steps.	`STEP 1: Assess the response in aspects of instruction following, groundedness, helpfulness, and verbosity according to the criteria. STEP 2: Score based on the rubrics.`

User inputs

Component	Function	Type	Example
Input variables	The inputs users need to provide to complete the prompt for the autorater and get a response.	Required user inputs	`## User Inputs ### Prompt {prompt} ## AI-generated Response {response}`

Additionally, if the columns in user data and input variables don't match and you don't want to rename the data, you can provide a mapping:

Component	Function	Type	Example
Metric column mapping	A mapping from the input variables in the user prompt to user data.	Optional user inputs Note: `prompt`, `response`, and `baseline_model_response` don't support mapping if `evaluate()` runs model inference.	`metric_column_mapping = {"reference":"ground_truth"}`

Adapt a metric prompt template to your input data

To adapt a template for your specific data and evaluation criteria, follow these steps:

Identify the missing criteria: Determine which criteria are not adequately addressed by the existing template.
Add new criteria: Include the missing criteria in the prompt, clearly defining what you expect the model to consider.
Adjust user input fields: If you have extra columns from the evaluation dataset that would like to be used for evaluation, add them in the user input fields and instruct the judge model how to use this field.
Update the rating rubric: Modify the rating rubric to reflect the new criteria and their relative importance.

For example, if you want to evaluate a summarization model based on how well the response summary aligns with a reference summary, you can add a new criterion called "reference alignment" and add the reference data as part of User Inputs:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.
Reference alignment: The response is consistent and aligned with the reference response.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, concise, fluent and aligned with reference summary.
4: (Good). The summary follows instructions, is grounded, concise, and fluent but not aligned with reference summary.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent and is not aligned with reference summary.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, fluency and reference alignment according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Reference
{reference}

### Prompt
{prompt}

## AI-generated Response
{response}

Provide few-shot examples to improve quality

Few-shot examples can significantly improve the quality and consistency of evaluation responses by guiding the model toward your chosen output formats and styles. We suggest starting with 5-10 few-shot examples.

To incorporate few-shot examples:

Identify relevant examples: Select examples that are similar to the type of input data you'll be evaluating.
Include examples in the prompt: Place the examples directly within the evaluation prompt, before the task or context.
Format examples: Ensure the examples follow the chosen output format and style.

For example, you can provide few-shot examples for the coherence metric and add the instruction to use the examples as follows:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps as shown in few shot examples. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
...

## Criteria
...

## Rating Rubric
...

## Few-shot Examples
Response: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs.
Explanation: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence.
Score: 1

Response: Learning a new language can be a rewarding experience for children, opening doors to different cultures and expanding their understanding of the world. There are many resources available to help children learn languages, from online courses and apps to language exchange programs and immersion schools.
Explanation: The response presents two related ideas: the benefits of learning a new language for children and the resources available to aid in that process. However, there is no clear transition or connection between these two distinct points. While both sentences are relevant to the topic of language acquisition in children, the relationship between them could be made more explicit.
Score: 3

Response: Although the internet has revolutionized communication and information sharing, it has also created echo chambers where individuals are only exposed to opinions and beliefs that align with their own. This polarization can lead to increased hostility and misunderstanding between different groups, making it difficult to find common ground on important issues. Consequently, fostering media literacy and critical thinking skills is essential for navigating the vast and often biased landscape of online information. By teaching individuals to evaluate sources, identify biases, and consider diverse perspectives, we can empower them to break free from echo chambers and engage in meaningful dialogue with those who hold differing views.
Explanation: The response exhibits a clear and logical flow of ideas. The transition words 'although' and 'consequently' effectively signal the relationship between the internet's advantages, its drawbacks (echo chambers), and the proposed solution (media literacy). The text maintains cohesion through consistent focus on the central theme of online polarization and its remedies.
Score: 5

## Evaluation Steps
...

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}