- 1.75.0 (latest)
- 1.74.0
- 1.73.0
- 1.72.0
- 1.71.1
- 1.70.0
- 1.69.0
- 1.68.0
- 1.67.1
- 1.66.0
- 1.65.0
- 1.63.0
- 1.62.0
- 1.60.0
- 1.59.0
- 1.58.0
- 1.57.0
- 1.56.0
- 1.55.0
- 1.54.1
- 1.53.0
- 1.52.0
- 1.51.0
- 1.50.0
- 1.49.0
- 1.48.0
- 1.47.0
- 1.46.0
- 1.45.0
- 1.44.0
- 1.43.0
- 1.39.0
- 1.38.1
- 1.37.0
- 1.36.4
- 1.35.0
- 1.34.0
- 1.33.1
- 1.32.0
- 1.31.1
- 1.30.1
- 1.29.0
- 1.28.1
- 1.27.1
- 1.26.1
- 1.25.0
- 1.24.1
- 1.23.0
- 1.22.1
- 1.21.0
- 1.20.0
- 1.19.1
- 1.18.3
- 1.17.1
- 1.16.1
- 1.15.1
- 1.14.0
- 1.13.1
- 1.12.1
- 1.11.0
- 1.10.0
- 1.9.0
- 1.8.1
- 1.7.1
- 1.6.2
- 1.5.0
- 1.4.3
- 1.3.0
- 1.2.0
- 1.1.1
- 1.0.1
- 0.9.0
- 0.8.0
- 0.7.1
- 0.6.0
- 0.5.1
- 0.4.0
- 0.3.1
API documentation for evaluation
package.
Classes
CustomMetric
The custom evaluation metric.
The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key.
EvalResult
Evaluation result.
EvalTask
A class representing an EvalTask.
An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.
Dataset details: Default dataset column names:
- content_column_name: "content"
- reference_column_name: "reference"
- response_column_name: "response"
Requirement for different use cases:
- Bring your own prediction: A
response
column is required. Response column name can be customized by providingresponse_column_name
parameter. - Without prompt template: A column representing the input prompt to the
model is required. If
content_column_name
is not specified, the eval dataset requirescontent
column by default. The response column is not used if present and new responses from the model are generated with the content column and used for evaluation. - With prompt template: Dataset must contain column names corresponding to
the placeholder names in the prompt template. For example, if prompt
template is "Instruction: {instruction}, context: {context}", the
dataset must contain
instruction
andcontext
column.
- Bring your own prediction: A
Metrics Details: The supported metrics, metric bundle descriptions, grading rubrics, and the required input fields can be found on the Vertex AI public documentation page Evaluation methods and metrics.
Usage:
To perform bring your own prediction evaluation, provide the model responses in the response column in the dataset. The response column name is "response" by default, or specify
response_column_name
parameter to customize.eval_dataset = pd.DataFrame({ "reference": [...], "response" : [...], }) eval_task = EvalTask( dataset=eval_dataset, metrics=["bleu", "rouge_l_sum", "coherence", "fluency"], experiment="my-experiment", ) eval_result = eval_task.evaluate( experiment_run_name="eval-experiment-run" )
To perform evaluation with built-in Gemini model inference, specify the
model
parameter with a GenerativeModel instance. The default query column name to the model iscontent
.eval_dataset = pd.DataFrame({ "reference": [...], "content" : [...], }) result = EvalTask( dataset=eval_dataset, metrics=["exact_match", "bleu", "rouge_1", "rouge_2", "rouge_l_sum"], experiment="my-experiment", ).evaluate( model=GenerativeModel("gemini-pro"), experiment_run_name="gemini-pro-eval-run" )
If a
prompt_template
is specified, thecontent
column is not required. Prompts can be assembled from the evaluation dataset, and all placeholder names must be present in the dataset columns.eval_dataset = pd.DataFrame({ "context" : [...], "instruction": [...], "reference" : [...], }) result = EvalTask( dataset=eval_dataset, metrics=["summarization_quality"], ).evaluate( model=model, prompt_template="{instruction}. Article: {context}. Summary:", )
To perform evaluation with custom model inference, specify the
model
parameter with a custom prediction function. Thecontent
column in the dataset is used to generate predictions with the custom model function for evaluation.def custom_model_fn(input: str) -> str: response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "user", "content": input} ] ) return response.choices[0].message.content eval_dataset = pd.DataFrame({ "content" : [...], "reference": [...], }) result = EvalTask( dataset=eval_dataset, metrics=["text_generation_similarity","text_generation_quality"], experiment="my-experiment", ).evaluate( model=custom_model_fn, experiment_run_name="gpt-eval-run" )
PairwiseMetric
The Side-by-side(SxS) Pairwise Metric.
PromptTemplate
A prompt template for creating prompts with placeholders.
The PromptTemplate
class allows users to define a template string with
placeholders represented in curly braces {placeholder}
. The placeholder
names cannot contain spaces. These placeholders can be replaced with specific
values using the assemble
method, providing flexibility in generating
dynamic prompts.
Example Usage:
```
template_str = "Hello, {name}! Today is {day}. How are you?"
prompt_template = PromptTemplate(template_str)
completed_prompt = prompt_template.assemble(name="John", day="Monday")
print(completed_prompt)
```
A set of placeholder names from the template string.
Packages Functions
make_metric
make_metric(
name: str,
metric_function: typing.Callable[
[typing.Dict[str, typing.Any]], typing.Dict[str, typing.Any]
],
) -> vertexai.preview.evaluation.metrics._base.CustomMetric
Makes a custom metric.
Parameters | |
---|---|
Name | Description |
name |
The name of the metric |
metric_function |
The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key. |