Class EvalTask (1.66.0)

EvalTask(
    *,
    dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
    metrics: typing.List[
        typing.Union[
            typing.Literal[
                "exact_match",
                "bleu",
                "rouge_1",
                "rouge_2",
                "rouge_l",
                "rouge_l_sum",
                "tool_call_valid",
                "tool_name_match",
                "tool_parameter_key_match",
                "tool_parameter_kv_match",
            ],
            vertexai.evaluation.CustomMetric,
            vertexai.evaluation.metrics._base._AutomaticMetric,
            vertexai.evaluation.metrics.pointwise_metric.PointwiseMetric,
            vertexai.evaluation.metrics.pairwise_metric.PairwiseMetric,
        ]
    ],
    experiment: typing.Optional[str] = None,
    metric_column_mapping: typing.Optional[typing.Dict[str, str]] = None,
    output_uri_prefix: typing.Optional[str] = ""
)

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset Details:

Default dataset column names:
    * prompt_column_name: "prompt"
    * reference_column_name: "reference"
    * response_column_name: "response"
    * baseline_model_response_column_name: "baseline_model_response"

Requirement for different use cases:
  * Bring-your-own-response: A `response` column is required. Response
      column name can be customized by providing `response_column_name`
      parameter. If a pairwise metric is used and a baseline model is
      not provided, a `baseline_model_response` column is required.
      Baseline model response column name can be customized by providing
      `baseline_model_response_column_name` parameter. If the `response`
      column or `baseline_model_response` column is present while the
      corresponding model is specified, an error will be raised.
  * Perform model inference without a prompt template: A `prompt` column
      in the evaluation dataset representing the input prompt to the
      model is required and is used directly as input to the model.
  * Perform model inference with a prompt template: Evaluation dataset
      must contain column names corresponding to the variable names in
      the prompt template. For example, if prompt template is
      "Instruction: {instruction}, context: {context}", the dataset must
      contain `instruction` and `context` columns.

Metrics Details:

The supported metrics descriptions, rating rubrics, and the required
input variables can be found on the Vertex AI public documentation page.
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).

Usage Examples:

1. To perform bring-your-own-response(BYOR) evaluation, provide the model
responses in the `response` column in the dataset. If a pairwise metric is
used for BYOR evaluation, provide the baseline model responses in the
`baseline_model_response` column.

  ```
  eval_dataset = pd.DataFrame({
          "prompt"  : [...],
          "reference": [...],
          "response" : [...],
          "baseline_model_response": [...],
  })
  eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
            "bleu",
            "rouge_l_sum",
            MetricPromptTemplateExamples.Pointwise.FLUENCY,
            MetricPromptTemplateExamples.Pairwise.SAFETY
    ],
    experiment="my-experiment",
  )
  eval_result = eval_task.evaluate(experiment_run_name="eval-experiment-run")
  ```

2. To perform evaluation with Gemini model inference, specify the `model`
parameter with a `GenerativeModel` instance.  The input column name to the
model is `prompt` and must be present in the dataset.

  ```
  eval_dataset = pd.DataFrame({
        "reference": [...],
        "prompt"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["exact_match", "bleu", "rouge_1", "rouge_l_sum"],
      experiment="my-experiment",
  ).evaluate(
      model=GenerativeModel("gemini-1.5-pro"),
      experiment_run_name="gemini-eval-run"
  )
  ```

3. If a `prompt_template` is specified, the `prompt` column is not required.
Prompts can be assembled from the evaluation dataset, and all prompt
template variable names must be present in the dataset columns.
  ```
  eval_dataset = pd.DataFrame({
      "context"    : [...],
      "instruction": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],
  ).evaluate(
      model=GenerativeModel("gemini-1.5-pro"),
      prompt_template="{instruction}. Article: {context}. Summary:",
  )
  ```

4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom inference function. The input column name to the
custom inference function is `prompt` and must be present in the dataset.

  ```
  from openai import OpenAI
  client = OpenAI()
  def custom_model_fn(input: str) -> str:
    response = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "user", "content": input}
      ]
    )
    return response.choices[0].message.content

  eval_dataset = pd.DataFrame({
        "prompt"  : [...],
        "reference": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[MetricPromptTemplateExamples.Pointwise.SAFETY],
      experiment="my-experiment",
  ).evaluate(
      model=custom_model_fn,
      experiment_run_name="gpt-eval-run"
  )
  ```

5. To perform pairwise metric evaluation with model inference step, specify
the `baseline_model` input to a `PairwiseMetric` instance and the candidate
`model` input to the `EvalTask.evaluate()` function. The input column name
to both models is `prompt` and must be present in the dataset.

  ```
  baseline_model = GenerativeModel("gemini-1.0-pro")
  candidate_model = GenerativeModel("gemini-1.5-pro")

  pairwise_groundedness = PairwiseMetric(
      metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
          "pairwise_groundedness"
      ),
      baseline_model=baseline_model,
  )
  eval_dataset = pd.DataFrame({
        "prompt"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[pairwise_groundedness],
      experiment="my-pairwise-experiment",
  ).evaluate(
      model=candidate_model,
      experiment_run_name="gemini-pairwise-eval-run",
  )
  ```

Properties

dataset

Returns evaluation dataset.

experiment

Returns experiment name.

metrics

Returns metrics.

Methods

EvalTask

EvalTask(
    *,
    dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
    metrics: typing.List[
        typing.Union[
            typing.Literal[
                "exact_match",
                "bleu",
                "rouge_1",
                "rouge_2",
                "rouge_l",
                "rouge_l_sum",
                "tool_call_valid",
                "tool_name_match",
                "tool_parameter_key_match",
                "tool_parameter_kv_match",
            ],
            vertexai.evaluation.CustomMetric,
            vertexai.evaluation.metrics._base._AutomaticMetric,
            vertexai.evaluation.metrics.pointwise_metric.PointwiseMetric,
            vertexai.evaluation.metrics.pairwise_metric.PairwiseMetric,
        ]
    ],
    experiment: typing.Optional[str] = None,
    metric_column_mapping: typing.Optional[typing.Dict[str, str]] = None,
    output_uri_prefix: typing.Optional[str] = ""
)

Initializes an EvalTask.

display_runs

display_runs()

Displays experiment runs associated with this EvalTask.

evaluate

evaluate(
    *,
    model: typing.Optional[
        typing.Union[
            vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
        ]
    ] = None,
    prompt_template: typing.Optional[str] = None,
    experiment_run_name: typing.Optional[str] = None,
    response_column_name: typing.Optional[str] = None,
    baseline_model_response_column_name: typing.Optional[str] = None,
    evaluation_service_qps: typing.Optional[float] = None,
    retry_timeout: float = 600.0,
    output_file_name: typing.Optional[str] = None
) -> vertexai.evaluation.EvalResult

Runs an evaluation for the EvalTask.