Package evaluation (1.54.0)

API documentation for evaluation package.

Classes

CustomMetric

The custom evaluation metric.

The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key.

EvalResult

Evaluation result.

EvalTask

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset Details:

Default dataset column names:
    * content_column_name: "content"
    * reference_column_name: "reference"
    * response_column_name: "response"
Requirement for different use cases:
  * Bring your own prediction: A `response` column is required. Response
      column name can be customized by providing `response_column_name`
      parameter.
  * Without prompt template: A column representing the input prompt to the
      model is required. If `content_column_name` is not specified, the
      eval dataset requires `content` column by default. The response
      column is not used if present and new responses from the model are
      generated with the content column and used for evaluation.
  * With prompt template: Dataset must contain column names corresponding to
      the placeholder names in the prompt template. For example, if prompt
      template is "Instruction: {instruction}, context: {context}", the
      dataset must contain `instruction` and `context` column.

Metrics Details:

The supported metrics, metric bundle descriptions, grading rubrics, and
the required input fields can be found on the Vertex AI public
documentation page [Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).

Usage:

1. To perform bring-your-own-prediction(BYOP) evaluation, provide the model
responses in the response column in the dataset. The response column name
is "response" by default, or specify `response_column_name` parameter to
customize.

  ```
  eval_dataset = pd.DataFrame({
          "reference": [...],
          "response" : [...],
  })
  eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=["bleu", "rouge_l_sum", "coherence", "fluency"],
    experiment="my-experiment",
  )
  eval_result = eval_task.evaluate(
        experiment_run_name="eval-experiment-run"
  )
  ```

2. To perform evaluation with built-in Gemini model inference, specify the
`model` parameter with a GenerativeModel instance.  The default query
column name to the model is `content`.

  ```
  eval_dataset = pd.DataFrame({
        "reference": [...],
        "content"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["exact_match", "bleu", "rouge_1", "rouge_2",
      "rouge_l_sum"],
      experiment="my-experiment",
  ).evaluate(
      model=GenerativeModel("gemini-pro"),
      experiment_run_name="gemini-pro-eval-run"
  )
  ```

3. If a `prompt_template` is specified, the `content` column is not required.
Prompts can be assembled from the evaluation dataset, and all placeholder
names must be present in the dataset columns.
  ```
  eval_dataset = pd.DataFrame({
      "context"    : [...],
      "instruction": [...],
      "reference"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["summarization_quality"],
  ).evaluate(
      model=model,
      prompt_template="{instruction}. Article: {context}. Summary:",
  )
  ```

4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom prediction function. The `content` column in the
dataset is used to generate predictions with the custom model function for
evaluation.

  ```
  def custom_model_fn(input: str) -> str:
    response = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "user", "content": input}
      ]
    )
    return response.choices[0].message.content

  eval_dataset = pd.DataFrame({
        "content"  : [...],
        "reference": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["text_generation_similarity","text_generation_quality"],
      experiment="my-experiment",
  ).evaluate(
      model=custom_model_fn,
      experiment_run_name="gpt-eval-run"
  )
  ```

PairwiseMetric

The Side-by-side(SxS) Pairwise Metric.

A model-based evaluation metric that compares two generative models side-by-side, and allows users to A/B test their generative models to determine which model is performing better on the given evaluation task.

For more details on when to use pairwise metrics, see Evaluation methods and metrics.

Result Details:

* In `EvalResult.summary_metrics`, win rates for both the baseline and
candidate model are computed, showing the rate of each model performs
better on the given task. The win rate is computed as the number of times
the candidate model performs better than the baseline model divided by the
total number of examples. The win rate is a number between 0 and 1.

* In `EvalResult.metrics_table`, a pairwise metric produces three
evaluation results for each row in the dataset:
    * `pairwise_choice`: the `pairwise_choice` in the evaluation result is
      an enumeration that indicates whether the candidate or baseline
      model perform better.
    * `explanation`: The model AutoRater's rationale behind each verdict
      using chain-of-thought reasoning. These explanations help users
      scrutinize the AutoRater's judgment and build appropriate trust in
      its decisions.
    * `confidence`: A score between 0 and 1, which signifies how confident
      the AutoRater was with its verdict. A score closer to 1 means higher
      confidence.

See [documentation page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#understand-results)
for more details on understanding the metric results.

Usages:

```
from <xref uid="vertexai.generative_models">vertexai.generative_models</xref> import GenerativeModel
from vertexai.preview.evaluation import EvalTask, PairwiseMetric

baseline_model = GenerativeModel("gemini-1.0-pro")
candidate_model = GenerativeModel("gemini-1.5-pro")

pairwise_summarization_quality = PairwiseMetric(
  metric = "summarization_quality",
  baseline_model=baseline_model,
)

eval_task =  EvalTask(
  dataset = pd.DataFrame({
      "instruction": [...],
      "context": [...],
  }),
  metrics=[pairwise_summarization_quality],
)

pairwise_results = eval_task.evaluate(
    prompt_template="instruction: {instruction}. context: {context}",
    model=candidate_model,
)
```

PromptTemplate

A prompt template for creating prompts with placeholders.

The PromptTemplate class allows users to define a template string with placeholders represented in curly braces {placeholder}. The placeholder names cannot contain spaces. These placeholders can be replaced with specific values using the assemble method, providing flexibility in generating dynamic prompts.