EvalTask(
*,
dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
metrics: typing.List[
typing.Union[
typing.Literal[
"exact_match",
"bleu",
"rouge_1",
"rouge_2",
"rouge_l",
"rouge_l_sum",
"coherence",
"fluency",
"safety",
"groundedness",
"fulfillment",
"summarization_quality",
"summarization_helpfulness",
"summarization_verbosity",
"question_answering_quality",
"question_answering_relevance",
"question_answering_helpfulness",
"question_answering_correctness",
"text_generation_similarity",
"text_generation_quality",
"text_generation_instruction_following",
"text_generation_safety",
"text_generation_factuality",
"summarization_pointwise_reference_free",
"qa_pointwise_reference_free",
"qa_pointwise_reference_based",
"tool_call_quality",
],
vertexai.preview.evaluation.metrics._base.CustomMetric,
]
],
experiment: typing.Optional[str] = None,
content_column_name: str = "content",
reference_column_name: str = "reference",
response_column_name: str = "response"
)
A class representing an EvalTask.
An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.
Dataset details: Default dataset column names:
- content_column_name: "content"
- reference_column_name: "reference"
- response_column_name: "response"
Requirement for different use cases:
- Bring your own prediction: A
response
column is required. Response column name can be customized by providingresponse_column_name
parameter. - Without prompt template: A column representing the input prompt to the
model is required. If
content_column_name
is not specified, the eval dataset requirescontent
column by default. The response column is not used if present and new responses from the model are generated with the content column and used for evaluation. - With prompt template: Dataset must contain column names corresponding to
the placeholder names in the prompt template. For example, if prompt
template is "Instruction: {instruction}, context: {context}", the
dataset must contain
instruction
andcontext
column.
- Bring your own prediction: A
Metrics Details: The supported metrics, metric bundle descriptions, grading rubrics, and the required input fields can be found on the Vertex AI public documentation.
Usage:
To perform bring your own prediction evaluation, provide the model responses in the response column in the dataset. The response column name is "response" by default, or specify
response_column_name
parameter to customize.eval_dataset = pd.DataFrame({ "reference": [...], "response" : [...], }) eval_task = EvalTask( dataset=eval_dataset, metrics=["bleu", "rouge_l_sum", "coherence", "fluency"], experiment="my-experiment", ) eval_result = eval_task.evaluate( experiment_run_name="eval-experiment-run" )
To perform evaluation with built-in Gemini model inference, specify the
model
parameter with a GenerativeModel instance. The default query column name to the model iscontent
.eval_dataset = pd.DataFrame({ "reference": [...], "content" : [...], }) result = EvalTask( dataset=eval_dataset, metrics=["exact_match", "bleu", "rouge_1", "rouge_2", "rouge_l_sum"], experiment="my-experiment", ).evaluate( model=GenerativeModel("gemini-pro"), experiment_run_name="gemini-pro-eval-run" )
If a
prompt_template
is specified, thecontent
column is not required. Prompts can be assembled from the evaluation dataset, and all placeholder names must be present in the dataset columns.eval_dataset = pd.DataFrame({ "context" : [...], "instruction": [...], "reference" : [...], }) result = EvalTask( dataset=eval_dataset, metrics=["summarization_quality"], ).evaluate( model=model, prompt_template="{instruction}. Article: {context}. Summary:", )
To perform evaluation with custom model inference, specify the
model
parameter with a custom prediction function. Thecontent
column in the dataset is used to generate predictions with the custom model function for evaluation.def custom_model_fn(input: str) -> str: response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "user", "content": input} ] ) return response.choices[0].message.content eval_dataset = pd.DataFrame({ "content" : [...], "reference": [...], }) result = EvalTask( dataset=eval_dataset, metrics=["text_generation_similarity","text_generation_quality"], experiment="my-experiment", ).evaluate( model=custom_model_fn, experiment_run_name="gpt-eval-run" )
Methods
EvalTask
EvalTask(
*,
dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
metrics: typing.List[
typing.Union[
typing.Literal[
"exact_match",
"bleu",
"rouge_1",
"rouge_2",
"rouge_l",
"rouge_l_sum",
"coherence",
"fluency",
"safety",
"groundedness",
"fulfillment",
"summarization_quality",
"summarization_helpfulness",
"summarization_verbosity",
"question_answering_quality",
"question_answering_relevance",
"question_answering_helpfulness",
"question_answering_correctness",
"text_generation_similarity",
"text_generation_quality",
"text_generation_instruction_following",
"text_generation_safety",
"text_generation_factuality",
"summarization_pointwise_reference_free",
"qa_pointwise_reference_free",
"qa_pointwise_reference_based",
"tool_call_quality",
],
vertexai.preview.evaluation.metrics._base.CustomMetric,
]
],
experiment: typing.Optional[str] = None,
content_column_name: str = "content",
reference_column_name: str = "reference",
response_column_name: str = "response"
)
Initializes an EvalTask.
display_runs
display_runs()
Displays experiment runs associated with this EvalTask.
evaluate
evaluate(
*,
model: typing.Optional[
typing.Union[
vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
]
] = None,
prompt_template: typing.Optional[str] = None,
experiment_run_name: typing.Optional[str] = None,
response_column_name: str = "response"
) -> vertexai.preview.evaluation._base.EvalResult
Runs an evaluation for the EvalTask.