- 1.72.0 (latest)
- 1.71.1
- 1.70.0
- 1.69.0
- 1.68.0
- 1.67.1
- 1.66.0
- 1.65.0
- 1.63.0
- 1.62.0
- 1.60.0
- 1.59.0
- 1.58.0
- 1.57.0
- 1.56.0
- 1.55.0
- 1.54.1
- 1.53.0
- 1.52.0
- 1.51.0
- 1.50.0
- 1.49.0
- 1.48.0
- 1.47.0
- 1.46.0
- 1.45.0
- 1.44.0
- 1.43.0
- 1.39.0
- 1.38.1
- 1.37.0
- 1.36.4
- 1.35.0
- 1.34.0
- 1.33.1
- 1.32.0
- 1.31.1
- 1.30.1
- 1.29.0
- 1.28.1
- 1.27.1
- 1.26.1
- 1.25.0
- 1.24.1
- 1.23.0
- 1.22.1
- 1.21.0
- 1.20.0
- 1.19.1
- 1.18.3
- 1.17.1
- 1.16.1
- 1.15.1
- 1.14.0
- 1.13.1
- 1.12.1
- 1.11.0
- 1.10.0
- 1.9.0
- 1.8.1
- 1.7.1
- 1.6.2
- 1.5.0
- 1.4.3
- 1.3.0
- 1.2.0
- 1.1.1
- 1.0.1
- 0.9.0
- 0.8.0
- 0.7.1
- 0.6.0
- 0.5.1
- 0.4.0
- 0.3.1
API documentation for evaluation
package.
Classes
CustomMetric
The custom evaluation metric.
The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key.
EvalResult
Evaluation result.
EvalTask
A class representing an EvalTask.
An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.
Dataset Details:
Default dataset column names:
* content_column_name: "content"
* reference_column_name: "reference"
* response_column_name: "response"
Requirement for different use cases:
* Bring your own prediction: A `response` column is required. Response
column name can be customized by providing `response_column_name`
parameter.
* Without prompt template: A column representing the input prompt to the
model is required. If `content_column_name` is not specified, the
eval dataset requires `content` column by default. The response
column is not used if present and new responses from the model are
generated with the content column and used for evaluation.
* With prompt template: Dataset must contain column names corresponding to
the placeholder names in the prompt template. For example, if prompt
template is "Instruction: {instruction}, context: {context}", the
dataset must contain `instruction` and `context` column.
Metrics Details:
The supported metrics, metric bundle descriptions, grading rubrics, and
the required input fields can be found on the Vertex AI public
documentation page [Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).
Usage:
1. To perform bring-your-own-prediction(BYOP) evaluation, provide the model
responses in the response column in the dataset. The response column name
is "response" by default, or specify `response_column_name` parameter to
customize.
```
eval_dataset = pd.DataFrame({
"reference": [...],
"response" : [...],
})
eval_task = EvalTask(
dataset=eval_dataset,
metrics=["bleu", "rouge_l_sum", "coherence", "fluency"],
experiment="my-experiment",
)
eval_result = eval_task.evaluate(
experiment_run_name="eval-experiment-run"
)
```
2. To perform evaluation with built-in Gemini model inference, specify the
`model` parameter with a GenerativeModel instance. The default query
column name to the model is `content`.
```
eval_dataset = pd.DataFrame({
"reference": [...],
"content" : [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=["exact_match", "bleu", "rouge_1", "rouge_2",
"rouge_l_sum"],
experiment="my-experiment",
).evaluate(
model=GenerativeModel("gemini-pro"),
experiment_run_name="gemini-pro-eval-run"
)
```
3. If a `prompt_template` is specified, the `content` column is not required.
Prompts can be assembled from the evaluation dataset, and all placeholder
names must be present in the dataset columns.
```
eval_dataset = pd.DataFrame({
"context" : [...],
"instruction": [...],
"reference" : [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=["summarization_quality"],
).evaluate(
model=model,
prompt_template="{instruction}. Article: {context}. Summary:",
)
```
4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom prediction function. The `content` column in the
dataset is used to generate predictions with the custom model function for
evaluation.
```
def custom_model_fn(input: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": input}
]
)
return response.choices[0].message.content
eval_dataset = pd.DataFrame({
"content" : [...],
"reference": [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=["text_generation_similarity","text_generation_quality"],
experiment="my-experiment",
).evaluate(
model=custom_model_fn,
experiment_run_name="gpt-eval-run"
)
```
PairwiseMetric
The Side-by-side(SxS) Pairwise Metric.
A model-based evaluation metric that compares two generative models side-by-side, and allows users to A/B test their generative models to determine which model is performing better on the given evaluation task.
For more details on when to use pairwise metrics, see Evaluation methods and metrics.
Result Details:
* In `EvalResult.summary_metrics`, win rates for both the baseline and
candidate model are computed, showing the rate of each model performs
better on the given task. The win rate is computed as the number of times
the candidate model performs better than the baseline model divided by the
total number of examples. The win rate is a number between 0 and 1.
* In `EvalResult.metrics_table`, a pairwise metric produces three
evaluation results for each row in the dataset:
* `pairwise_choice`: the `pairwise_choice` in the evaluation result is
an enumeration that indicates whether the candidate or baseline
model perform better.
* `explanation`: The model AutoRater's rationale behind each verdict
using chain-of-thought reasoning. These explanations help users
scrutinize the AutoRater's judgment and build appropriate trust in
its decisions.
* `confidence`: A score between 0 and 1, which signifies how confident
the AutoRater was with its verdict. A score closer to 1 means higher
confidence.
See [documentation page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#understand-results)
for more details on understanding the metric results.
Usages:
```
from <xref uid="vertexai.generative_models">vertexai.generative_models</xref> import GenerativeModel
from vertexai.preview.evaluation import EvalTask, PairwiseMetric
baseline_model = GenerativeModel("gemini-1.0-pro")
candidate_model = GenerativeModel("gemini-1.5-pro")
pairwise_summarization_quality = PairwiseMetric(
metric = "summarization_quality",
baseline_model=baseline_model,
)
eval_task = EvalTask(
dataset = pd.DataFrame({
"instruction": [...],
"context": [...],
}),
metrics=[pairwise_summarization_quality],
)
pairwise_results = eval_task.evaluate(
prompt_template="instruction: {instruction}. context: {context}",
model=candidate_model,
)
```
PromptTemplate
A prompt template for creating prompts with placeholders.
The PromptTemplate
class allows users to define a template string with
placeholders represented in curly braces {placeholder}
. The placeholder
names cannot contain spaces. These placeholders can be replaced with specific
values using the assemble
method, providing flexibility in generating
dynamic prompts.
Usage:
```
template_str = "Hello, {name}! Today is {day}. How are you?"
prompt_template = PromptTemplate(template_str)
completed_prompt = prompt_template.assemble(name="John", day="Monday")
print(completed_prompt)
```
Packages Functions
make_metric
make_metric(
name: str,
metric_function: typing.Callable[
[typing.Dict[str, typing.Any]], typing.Dict[str, typing.Any]
],
) -> vertexai.preview.evaluation.metrics._base.CustomMetric
Makes a custom metric.
Parameters | |
---|---|
Name | Description |
name |
The name of the metric |
metric_function |
The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key. |