Package evaluation (1.51.0)

API documentation for evaluation package.

Classes

CustomMetric

The custom evaluation metric.

The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key.

EvalResult

Evaluation result.

EvalTask

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset details: Default dataset column names:

  • content_column_name: "content"
  • reference_column_name: "reference"
  • response_column_name: "response" Requirement for different use cases:
    • Bring your own prediction: A response column is required. Response column name can be customized by providing response_column_name parameter.
    • Without prompt template: A column representing the input prompt to the model is required. If content_column_name is not specified, the eval dataset requires content column by default. The response column is not used if present and new responses from the model are generated with the content column and used for evaluation.
    • With prompt template: Dataset must contain column names corresponding to the placeholder names in the prompt template. For example, if prompt template is "Instruction: {instruction}, context: {context}", the dataset must contain instruction and context column.

Metrics Details: The supported metrics, metric bundle descriptions, grading rubrics, and the required input fields can be found on the Vertex AI public documentation page Evaluation methods and metrics.

Usage:

  1. To perform bring your own prediction evaluation, provide the model responses in the response column in the dataset. The response column name is "response" by default, or specify response_column_name parameter to customize.

    eval_dataset = pd.DataFrame({
           "reference": [...],
           "response" : [...],
    })
    eval_task = EvalTask(
     dataset=eval_dataset,
     metrics=["bleu", "rouge_l_sum", "coherence", "fluency"],
     experiment="my-experiment",
    )
    eval_result = eval_task.evaluate(
         experiment_run_name="eval-experiment-run"
    )
    
  2. To perform evaluation with built-in Gemini model inference, specify the model parameter with a GenerativeModel instance. The default query column name to the model is content.

    eval_dataset = pd.DataFrame({
         "reference": [...],
         "content"  : [...],
    })
    result = EvalTask(
       dataset=eval_dataset,
       metrics=["exact_match", "bleu", "rouge_1", "rouge_2",
       "rouge_l_sum"],
       experiment="my-experiment",
    ).evaluate(
       model=GenerativeModel("gemini-pro"),
       experiment_run_name="gemini-pro-eval-run"
    )
    
  3. If a prompt_template is specified, the content column is not required. Prompts can be assembled from the evaluation dataset, and all placeholder names must be present in the dataset columns.

    eval_dataset = pd.DataFrame({
       "context"    : [...],
       "instruction": [...],
       "reference"  : [...],
    })
    result = EvalTask(
       dataset=eval_dataset,
       metrics=["summarization_quality"],
    ).evaluate(
       model=model,
       prompt_template="{instruction}. Article: {context}. Summary:",
    )
    
  4. To perform evaluation with custom model inference, specify the model parameter with a custom prediction function. The content column in the dataset is used to generate predictions with the custom model function for evaluation.

    def custom_model_fn(input: str) -> str:
     response = client.chat.completions.create(
       model="gpt-3.5-turbo",
       messages=[
         {"role": "user", "content": input}
       ]
     )
     return response.choices[0].message.content
    
    eval_dataset = pd.DataFrame({
         "content"  : [...],
         "reference": [...],
    })
    result = EvalTask(
       dataset=eval_dataset,
       metrics=["text_generation_similarity","text_generation_quality"],
       experiment="my-experiment",
    ).evaluate(
       model=custom_model_fn,
       experiment_run_name="gpt-eval-run"
    )
    

PairwiseMetric

The Side-by-side(SxS) Pairwise Metric.

PromptTemplate

A prompt template for creating prompts with placeholders.

The PromptTemplate class allows users to define a template string with placeholders represented in curly braces {placeholder}. The placeholder names cannot contain spaces. These placeholders can be replaced with specific values using the assemble method, providing flexibility in generating dynamic prompts.

Example Usage:

```
    template_str = "Hello, {name}! Today is {day}. How are you?"
    prompt_template = PromptTemplate(template_str)
    completed_prompt = prompt_template.assemble(name="John", day="Monday")
    print(completed_prompt)
```

A set of placeholder names from the template string.

Packages Functions

make_metric

make_metric(
    name: str,
    metric_function: typing.Callable[
        [typing.Dict[str, typing.Any]], typing.Dict[str, typing.Any]
    ],
) -> vertexai.preview.evaluation.metrics._base.CustomMetric

Makes a custom metric.

Parameters
Name Description
name

The name of the metric

metric_function

The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key.