After you create an evaluation dataset, the next step is to define the metrics used to measure model performance. Generative AI models can create applications for a wide range of tasks, and the Gen AI evaluation service uses a test-driven framework that transforms evaluation from subjective ratings into objective, actionable results.
The following are core concepts related to evaluation metrics:
Rubrics: The criteria for how to rate the response of an LLM model or application.
Metrics: A score that measures the model output against the rating rubrics.
The Gen AI evaluation service offers the following categories of metrics:
Rubric-based metrics: Incorporate LLMs into evaluation workflows.
Adaptive rubrics (recommended): Rubrics are dynamically generated for each prompt. Responses are evaluated with granular, explainable pass or fail feedback specific to the prompt.
Static rubrics: Rubrics are defined explicitly and the same rubric applies to all prompts. Responses are evaluated with the same set of numerical scoring-based evaluators. A single numerical score (such as 1-5) per prompt. When an evaluation is required on a very specific dimension or when the exact same rubric is required across all prompts.
Computation-based metrics: Evaluate responses with deterministic algorithms, usually using ground truth. A numerical score (such as 0.0-1.0) per prompt. When ground truth is available and can be matched with a deterministic method.
Custom function metrics: Define your own metric through a Python function.
Rubric-based metrics
Rubric-based metrics incorporate large language models within workflows to assess the quality of model responses. Rubric-based evaluations are suited for a variety of tasks, especially writing quality, safety, and instruction following, which are often difficult to evaluate with deterministic algorithms.
Adaptive rubrics
Adaptive rubrics function like unit tests for your models. Adaptive rubrics dynamically generate a unique set of pass or fail tests for each individual prompt in your dataset. The rubrics keep the evaluation relevant to the requested task and aim to provide objective, explainable, and consistent results.
The following example shows how adaptive rubrics might be generated for a set of prompts:
Prompt | Adaptive rubrics |
---|---|
"Summarize the following article about the benefits of solar power in under 100 words…" |
|
"Write a short, friendly email inviting employees to the annual company picnic. Mention the date is September 15th and that vegetarian options will be available…" |
|
You can access adaptive rubrics through the SDK. We recommend starting with GENERAL_QUALITY
as the default.
General quality metric
GENERAL_QUALITY
generates a set of rubrics covering a variety of tasks such as instruction following, formatting, tone, style, depending on the input prompt. You can combine rubric generation with validation in the following line of code:
from vertexai import types
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.RubricMetric.GENERAL_QUALITY,
],
)
You can generate rubrics separately (to review or re-use them across models and agents) before using them to evaluate model responses:
from vertexai import types
# Use GENERAL_QUALITY recipe to generate rubrics, and store them
# as a rubric group named "general_quality_rubrics".
data_with_rubrics = client.evals.generate_rubrics(
src=eval_dataset_df,
rubric_group_name="general_quality_rubrics",
predefined_spec_name=types.RubricMetric.GENERAL_QUALITY,
)
# Specify the group of rubrics to use for the evaluation.
eval_result = client.evals.evaluate(
dataset=data_with_rubrics,
metrics=[types.RubricMetric.GENERAL_QUALITY(
rubric_group_name="general_quality_rubrics",
)],
)
You can also guide GENERAL_QUALITY
with natural language guidelines
to focus rubric generation on the criteria that are most important to you. The Gen AI evaluation service then generates rubrics covering both its default tasks and the guidelines you specify.
from vertexai import types
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.RubricMetric.GENERAL_QUALITY(
metric_spec_parameters={
"guidelines": "The response must maintain a professional tone and must not provide financial advice."
}
)
],
)
Targeted quality metrics
If you need to evaluate a more targeted aspect of model quality, you can use metrics that generate rubrics focused on a specific area. For example:
from vertexai import types
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.RubricMetric.TEXT_QUALITY,
types.RubricMetric.INSTRUCTION_FOLLOWING,
],
)
The Gen AI evaluation service offers the following types of adaptive rubrics:
INSTRUCTION_FOLLOWING
: Measures how well the response adheres to the specific constraints and instructions in the prompt.TEXT_QUALITY
: Focuses specifically on the linguistic quality of the response, assessing fluency, coherence, and grammar.
Multi-turn conversation
multi_turn_general_quality
: Evaluates overall conversational quality in a multi-turn dialogue.multi_turn_text_quality
: Evaluates the text quality of the responses within a multi-turn dialogue.
Agent evaluation
final_response_reference_free
: Evaluates the quality of an agent's final answer without needing a reference answer.
For more details about targeted adaptive rubrics, see Adaptive rubric details.
Static rubrics
A static rubric applies a single, fixed set of scoring guidelines to every example in your dataset. This score-driven approach is useful when you need to measure performance against a consistent benchmark across all prompts.
For example, the following static rubric rates text quality on a 1-5 scale:
5: (Very good). Exceptionally clear, coherent, fluent, and concise. Fully adheres to instructions and stays grounded.
4: (Good). Well-written, coherent, and fluent. Mostly adheres to instructions and stays grounded. Minor room for improvement.
3: (Ok). Adequate writing with decent coherence and fluency. Partially fulfills instructions and may contain minor ungrounded information. Could be more concise.
2: (Bad). Poorly written, lacking coherence and fluency. Struggles to adhere to instructions and may include ungrounded information. Issues with conciseness.
1: (Very bad). Very poorly written, incoherent, and non-fluent. Fails to follow instructions and contains substantial ungrounded information. Severely lacking in conciseness.
The Gen AI evaluation service provides the following static rubric metrics:
GROUNDING
: Checks for factuality and consistency against a provided source text (ground truth). This metric is crucial for RAG systems.SAFETY
: Assesses the model's response for violations of safety policies, such as hate speech or dangerous content.
You can also use metric prompt templates such as FLUENCY
.
from vertexai import types
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.RubricMetric.SAFETY,
types.RubricMetric.GROUNDING,
types.RubricMetric.FLUENCY,
],
)
Customizing static rubrics
For highly specialized needs, you can create your own static rubric. This method offers maximum control but requires you to carefully design the evaluation prompt to ensure consistent and reliable results. We recommend using guidelines with GENERAL_QUALITY
before customizing static rubrics.
# Define a custom metric to evaluate language simplicity
simplicity_metric = types.LLMMetric(
name='language_simplicity',
prompt_template=types.MetricPromptBuilder(
instruction="Evaluate the story's simplicity for a 5-year-old.",
criteria={
"Vocabulary": "Uses simple words.",
"Sentences": "Uses short sentences.",
},
rating_scores={
"5": "Excellent: Very simple, ideal for a 5-year-old.",
"4": "Good: Mostly simple, with minor complex parts.",
"3": "Fair: Mix of simple and complex; may be challenging for a 5-year-old.",
"2": "Poor: Largely too complex, with difficult words/sentences.",
"1": "Very Poor: Very complex, unsuitable for a 5-year-old."
}
)
)
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
simplicity_metric
],
)
Computation-based metrics
Computation-based metrics use deterministic algorithms to score a model's response by comparing it to a reference answer. They require a ground truth in your dataset and are ideal for tasks where a "correct" answer is well-defined.
Recall-Oriented Understudy for Gisting Evaluation (rouge_l, rouge_1): Measures the overlap of n-grams (contiguous sequences of words) between the model's response and a reference text. It's commonly used for evaluating text summarization.
Bilingual Evaluation Understudy (bleu): Measures how similar a response is to a high-quality reference text by counting matching n-grams. It is the standard metric for translation quality but can also be used for other text generation tasks.
Exact Match (exact_match): Measures the percentage of responses that are identical to the reference answer. This is useful for fact-based question-answering or tasks where there is only one correct response.
from vertexai import types
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.Metric(name='bleu'),
types.Metric(name='rouge_l'),
types.Metric(name='exact_match')
],
)
Custom function metric
You can also implement custom evaluation logic by passing a custom Python function to the custom_function
parameter. The Gen AI evaluation service executes this function for each row of your dataset.
# Define a custom function to check for the presence of a keyword
def contains_keyword(instance: dict) -> dict:
keyword = "magic"
response_text = instance.get("response", "")
score = 1.0 if keyword in response_text.lower() else 0.0
return {"score": score}
keyword_metric = types.Metric(
name="keyword_check",
custom_function=contains_keyword
)
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[keyword_metric]
)