Package evaluation (1.66.0)

API documentation for evaluation package.

Classes

CustomMetric

The custom evaluation metric.

A fully-customized CustomMetric that can be used to evaluate a single model by defining a metric function for a computation-based metric. The CustomMetric is computed on the client-side using the user-defined metric function in SDK only, not by the Vertex Gen AI Evaluation Service.

Attributes: name: The name of the metric. metric_function: The user-defined evaluation function to compute a metric score. Must use the dataset row dictionary as the metric function input and return per-instance metric result as a dictionary output. The metric score must mapped to the name of the CustomMetric as key.

EvalResult

Evaluation result.

EvalTask

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset Details:

Default dataset column names:
    * prompt_column_name: "prompt"
    * reference_column_name: "reference"
    * response_column_name: "response"
    * baseline_model_response_column_name: "baseline_model_response"

Requirement for different use cases:
  * Bring-your-own-response: A `response` column is required. Response
      column name can be customized by providing `response_column_name`
      parameter. If a pairwise metric is used and a baseline model is
      not provided, a `baseline_model_response` column is required.
      Baseline model response column name can be customized by providing
      `baseline_model_response_column_name` parameter. If the `response`
      column or `baseline_model_response` column is present while the
      corresponding model is specified, an error will be raised.
  * Perform model inference without a prompt template: A `prompt` column
      in the evaluation dataset representing the input prompt to the
      model is required and is used directly as input to the model.
  * Perform model inference with a prompt template: Evaluation dataset
      must contain column names corresponding to the variable names in
      the prompt template. For example, if prompt template is
      "Instruction: {instruction}, context: {context}", the dataset must
      contain `instruction` and `context` columns.

Metrics Details:

The supported metrics descriptions, rating rubrics, and the required
input variables can be found on the Vertex AI public documentation page.
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).

Usage Examples:

1. To perform bring-your-own-response(BYOR) evaluation, provide the model
responses in the `response` column in the dataset. If a pairwise metric is
used for BYOR evaluation, provide the baseline model responses in the
`baseline_model_response` column.

  ```
  eval_dataset = pd.DataFrame({
          "prompt"  : [...],
          "reference": [...],
          "response" : [...],
          "baseline_model_response": [...],
  })
  eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
            "bleu",
            "rouge_l_sum",
            MetricPromptTemplateExamples.Pointwise.FLUENCY,
            MetricPromptTemplateExamples.Pairwise.SAFETY
    ],
    experiment="my-experiment",
  )
  eval_result = eval_task.evaluate(experiment_run_name="eval-experiment-run")
  ```

2. To perform evaluation with Gemini model inference, specify the `model`
parameter with a `GenerativeModel` instance.  The input column name to the
model is `prompt` and must be present in the dataset.

  ```
  eval_dataset = pd.DataFrame({
        "reference": [...],
        "prompt"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["exact_match", "bleu", "rouge_1", "rouge_l_sum"],
      experiment="my-experiment",
  ).evaluate(
      model=GenerativeModel("gemini-1.5-pro"),
      experiment_run_name="gemini-eval-run"
  )
  ```

3. If a `prompt_template` is specified, the `prompt` column is not required.
Prompts can be assembled from the evaluation dataset, and all prompt
template variable names must be present in the dataset columns.
  ```
  eval_dataset = pd.DataFrame({
      "context"    : [...],
      "instruction": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],
  ).evaluate(
      model=GenerativeModel("gemini-1.5-pro"),
      prompt_template="{instruction}. Article: {context}. Summary:",
  )
  ```

4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom inference function. The input column name to the
custom inference function is `prompt` and must be present in the dataset.

  ```
  from openai import OpenAI
  client = OpenAI()
  def custom_model_fn(input: str) -> str:
    response = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "user", "content": input}
      ]
    )
    return response.choices[0].message.content

  eval_dataset = pd.DataFrame({
        "prompt"  : [...],
        "reference": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[MetricPromptTemplateExamples.Pointwise.SAFETY],
      experiment="my-experiment",
  ).evaluate(
      model=custom_model_fn,
      experiment_run_name="gpt-eval-run"
  )
  ```

5. To perform pairwise metric evaluation with model inference step, specify
the `baseline_model` input to a `PairwiseMetric` instance and the candidate
`model` input to the `EvalTask.evaluate()` function. The input column name
to both models is `prompt` and must be present in the dataset.

  ```
  baseline_model = GenerativeModel("gemini-1.0-pro")
  candidate_model = GenerativeModel("gemini-1.5-pro")

  pairwise_groundedness = PairwiseMetric(
      metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
          "pairwise_groundedness"
      ),
      baseline_model=baseline_model,
  )
  eval_dataset = pd.DataFrame({
        "prompt"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[pairwise_groundedness],
      experiment="my-pairwise-experiment",
  ).evaluate(
      model=candidate_model,
      experiment_run_name="gemini-pairwise-eval-run",
  )
  ```

MetricPromptTemplateExamples

Examples of metric prompt templates for model-based evaluation.

PairwiseMetric

A Model-based Pairwise Metric.

A model-based evaluation metric that compares two generative models' responses side-by-side, and allows users to A/B test their generative models to determine which model is performing better.

For more details on when to use pairwise metrics, see Evaluation methods and metrics.

Result Details:

* In `EvalResult.summary_metrics`, win rates for both the baseline and
candidate model are computed. The win rate is computed as proportion of
wins of one model's responses to total attempts as a decimal value
between 0 and 1.

* In `EvalResult.metrics_table`, a pairwise metric produces two
evaluation results per dataset row:
    * `pairwise_choice`: The choice shows whether the candidate model or
      the baseline model performs better, or if they are equally good.
    * `explanation`: The rationale behind each verdict using
      chain-of-thought reasoning. The explanation helps users scrutinize
      the judgment and builds appropriate trust in the decisions.

See [documentation
page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#understand-results)
for more details on understanding the metric results.

Usage Examples:

```
baseline_model = GenerativeModel("gemini-1.0-pro")
candidate_model = GenerativeModel("gemini-1.5-pro")

pairwise_groundedness = PairwiseMetric(
    metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
        "pairwise_groundedness"
    ),
    baseline_model=baseline_model,
)
eval_dataset = pd.DataFrame({
      "prompt"  : [...],
})
pairwise_task = EvalTask(
    dataset=eval_dataset,
    metrics=[pairwise_groundedness],
    experiment="my-pairwise-experiment",
)
pairwise_result = pairwise_task.evaluate(
    model=candidate_model,
    experiment_run_name="gemini-pairwise-eval-run",
)
```

PairwiseMetricPromptTemplate

Pairwise metric prompt template for pairwise model-based metrics.

PointwiseMetric

A Model-based Pointwise Metric.

A model-based evaluation metric that evaluate a single generative model's response.

For more details on when to use model-based pointwise metrics, see Evaluation methods and metrics.

Usage Examples:

```
candidate_model = GenerativeModel("gemini-1.5-pro")
eval_dataset = pd.DataFrame({
    "prompt"  : [...],
})
fluency_metric = PointwiseMetric(
    metric="fluency",
    metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template('fluency'),
)
pointwise_eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        fluency_metric,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
    ],
)
pointwise_result = pointwise_eval_task.evaluate(
    model=candidate_model,
)
```

PointwiseMetricPromptTemplate

Pointwise metric prompt template for pointwise model-based metrics.

PromptTemplate

A prompt template for creating prompts with variables.

The PromptTemplate class allows users to define a template string with variables represented in curly braces {variable}. The variable names cannot contain spaces. These variables can be replaced with specific values using the assemble method, providing flexibility in generating dynamic prompts.

Usage:

```
template_str = "Hello, {name}! Today is {day}. How are you?"
prompt_template = PromptTemplate(template_str)
completed_prompt = prompt_template.assemble(name="John", day="Monday")
print(completed_prompt)
```

Rouge

The ROUGE Metric.

Calculates the recall of n-grams in prediction as compared to reference and returns a score ranging between 0 and 1. Supported rouge types are rougen[1-9], rougeL, and rougeLsum.