Tutorial: Perform evaluation using the Vertex Gen AI SDK

This tutorial shows you how to evaluate your generative AI models and applications using the Vertex Gen AI Eval SDK for Python.

This tutorial covers the following topics:

  • Define your metric: Learn how to use prebuilt metrics, create custom LLM-based metrics, or implement computation-based metrics for your evaluation.
  • Prepare your evaluation dataset: Understand the supported data formats, including Pandas DataFrames and JSONL files from Vertex AI or third-party services.
  • Run evaluation: Execute single-model or multi-model comparison evaluations, including asynchronous options for large datasets.
  • Visualization: Inspect and analyze your inference and evaluation results using interactive reports.

The following diagram summarizes the overall workflow:

You can use the Vertex Gen AI Eval SDK for Gen AI evaluation service to measure the performance of prompts, foundation models, and complex AI agents against the most relevant criteria. You can evaluate any number of candidates at once to fine-tune prompts, select the best model, or iterate on complex agents.

With the Vertex Gen AI SDK, you can do the following:

  • Compare multiple models or configurations side-by-side in a single run, using win-rate calculations to guide your decisions.
  • Use built-in support to evaluate and benchmark against popular third-party models without complex integrations.
  • Efficiently handle large datasets and offload large-scale evaluation tasks using asynchronous batch evaluation.

End-to-end example

The following end-to-end example demonstrates the two-step workflow of the Vertex Gen AI Eval SDK: generating model responses and then evaluating them.

  1. Install the Vertex Gen AI Eval SDK:

    pip install --upgrade google-cloud-aiplatform[evaluation]
    
  2. Prepare the evaluation dataset:

    import pandas as pd
    from vertexai import Client, types
    
    client = Client(project="your-project-id", location="us-central1")
    
    prompts_df = pd.DataFrame({"prompt": ["How does AI work?"]})
    
  3. Run the evaluation:

    # Evaluating a single model. 
    eval_dataset = client.evals.run_inference(
        model="gemini-2.5-flash",
        src=prompts_df,
    )
    eval_result = client.evals.evaluate(
        dataset=eval_dataset,
        metrics=[types.PrebuiltMetric.TEXT_QUALITY]
    )
    eval_result.show()
    
  4. Compare multiple candidates:

    # Comparing multiple candidates. 
    candidate_1 = client.evals.run_inference(
        model="gemini-2.0-flash", src=prompts_df
    )
    candidate_2 = client.evals.run_inference(
        model="gemini-2.5-flash", src=prompts_df
    )
    comparison_result = client.evals.evaluate(
        dataset=[candidate_1, candidate_2],
        metrics=[types.PrebuiltMetric.TEXT_QUALITY]
    )
    comparison_result.show()
    

Define your metric

The Vertex Gen AI Eval SDK supports several types of metrics to suit different evaluation needs. You can use prebuilt metrics, create custom metrics judged by an LLM, or implement your own computation-based logic.

Metric Type Description Use Case
LLM-based metrics Uses a large language model (LLM) as a "judge" to score responses based on a prompt with nuanced criteria. Evaluating subjective qualities like writing style, creativity, safety, or instruction following.
Computation-based metrics Mathematically compares a model's output against a known "ground truth" reference. Measuring objective correctness with algorithms like exact_match, bleu, or rouge. Requires a reference answer.
Custom function metrics Executes a custom Python function to implement bespoke evaluation logic for each row in the dataset. Implementing highly specific checks that aren't covered by other metric types, such as checking for a specific keyword or format.

LLM-based metrics

LLM-based metrics use a large language model (LLM) as a "judge" to evaluate nuanced criteria such as style or writing quality, which are difficult to measure with algorithms alone.

Use prebuilt LLM metrics

The Vertex Gen AI Eval SDK provides a variety of ready-to-use, model-based metrics like TEXT_QUALITY, SAFETY, and INSTRUCTION_FOLLOWING. You can access these through the PrebuiltMetric class. Prebuilt metric definitions are loaded on-demand from a centralized library, which ensures consistency across Vertex Gen AI Eval SDK versions.

# Assumes 'eval_dataset' is an EvaluationDataset object created via run_inference()

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
    ]
)

To create consistent results across different SDK versions, you can pin a metric to a specific version. By default, the latest version is used.

# Pin to a specific version of a pre-built metric
instruction_following_v2 = types.PrebuiltMetric.INSTRUCTION_FOLLOWING(version='v1')

Customize LLM metrics

For use cases that require specialized criteria, you can define your own LLM-based metric by instantiating the LLMMetric class. This gives you control over the evaluation prompt template, judge model, and other parameters.

The MetricPromptBuilder helper class creates a structured prompt template for the judge model where you define the instruction, criteria, and rating_scores separately.

# Define a custom metric to evaluate language simplicity
simplicity_metric = types.LLMMetric(
    name='language_simplicity',
    prompt_template=types.MetricPromptBuilder(
        instruction="Evaluate the story's simplicity for a 5-year-old.",
        criteria={
            "Vocabulary": "Uses simple words.",
            "Sentences": "Uses short sentences.",
        },
        rating_scores={
            "5": "Excellent: Very simple, ideal for a 5-year-old.",
            "4": "Good: Mostly simple, with minor complex parts.",
            "3": "Fair: Mix of simple and complex; may be challenging for a 5-year-old.",
            "2": "Poor: Largely too complex, with difficult words/sentences.",
            "1": "Very Poor: Very complex, unsuitable for a 5-year-old."
        }
    )
)
# Use the custom metric in an evaluation
eval_result = client.evals.evaluate(
    dataset=inference_results,
    metrics=[simplicity_metric]
)

Computation-based and custom function metrics

Computation-based metrics mathematically compare a model's output to a ground truth or reference. These metrics are calculated with code using the base Metric class and support predefined algorithms such as exact_match, bleu, and rouge_1. To use a computation-based metric, instantiate the Metric class with the metric's name. The metric requires a reference column in your dataset for comparison.

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.Metric(name='exact_match'),
        types.Metric(name='bleu'),
        types.Metric(name='rouge_1'),
    ]
)

Implement a custom function metric

You can also implement custom evaluation logic by passing a custom Python function to the custom_function parameter. The Vertex Gen AI Eval SDK executes this function for each row of your dataset.

# Define a custom function to check for the presence of a keyword
def contains_keyword(instance: dict) -> dict:
    keyword = "magic"
    response_text = instance.get("response", "")
    score = 1.0 if keyword in response_text.lower() else 0.0
    return {"score": score}

keyword_metric = types.Metric(
    name="keyword_check",
    custom_function=contains_keyword
)


eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[keyword_metric]
)

Prepare your evaluation dataset

The Vertex Gen AI Eval SDK automatically detects and handles several common data formats. This means you can often use your data as-is without needing to perform manual conversions, whether you're generating new responses with run_inference or evaluating existing ones with evaluate.

Format Description Use Case
Pandas DataFrame A simple, in-memory table structure using the pandas library. The SDK automatically detects common column names like prompt and reference. Best for straightforward, local evaluations with smaller datasets that fit comfortably in memory.
Gemini batch prediction format JSONL files from a Vertex AI batch prediction job, where each line contains a request and response object. Ideal for evaluating pre-generated responses from a large-scale Vertex AI batch job without reformatting.
OpenAI Chat Completion format JSONL files where each line is a JSON object structured like an OpenAI API request. Convenient for evaluating or comparing with third-party models (e.g., from OpenAI) by using their native request format.

The Vertex Gen AI Eval SDK supports the following formats:

Pandas DataFrame (flattened format)

For local evaluations, you can use a pandas.DataFrame. The Vertex Gen AI Eval SDK looks for common column names like prompt, response, and reference. This format is fully backward-compatible.

import pandas as pd

# Simple DataFrame with prompts and ground truth references
prompts_df = pd.DataFrame({
    "prompt": [
        "What is the capital of France?",
        "Who wrote 'Hamlet'?",
    ],
    "reference": [
        "Paris",
        "William Shakespeare",
    ]
})

# Generate responses using the DataFrame as a source
inference_results = client.evals.run_inference(
    model="gemini-2.5-flash",
    src=prompts_df
)
inference_results.show()

Gemini batch prediction format

You can directly use the output of a Vertex AI batch prediction job, which are typically JSONL files stored in Cloud Storage, where each line contains a request and response object. The Vertex Gen AI Eval SDK parses this structure automatically, which provides seamless integration with other Vertex AI services.

An example of a single line in a jsonl file:

{"request": {"contents": [{"role": "user", "parts": [{"text": "Why is the sky blue?"}]}]}, "response": {"candidates": [{"content": {"role": "model", "parts": [{"text": "The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}

You can then evaluate pre-generated responses from a batch job directly:

# Cloud Storage path to your batch prediction output file 
batch_job_output_uri = "gs://path/to/your/batch_output.jsonl"

# Evaluate the results directly from Cloud Storage
eval_result = client.evals.evaluate(
    dataset=batch_job_output_uri,
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.FLUENCY,
    ]
)
eval_result.show()

OpenAI Chat Completion format

For evaluating or comparing with third-party models, the Vertex Gen AI Eval SDK supports the OpenAI Chat Completion format. You can supply a dataset where each row is a JSON object structured like an OpenAI API request. The Vertex Gen AI Eval SDK automatically detects this format.

An example of a single line in this format:

{"request": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}], "model": "gpt-4o"}}

You can use this data to generate responses from a third-party model and evaluate the responses:

# Ensure your third-party API key is set
# e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key' 

openai_request_uri = "gs://path/to/your/openai_requests.jsonl"

# Generate responses using a LiteLLM-supported model string
openai_responses = client.evals.run_inference(
    model="gpt-4o",
    src=openai_request_uri,
)

# The resulting dataset can then be evaluated 
eval_result = client.evals.evaluate(
    dataset=openai_responses,
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.FLUENCY,
    ]
)

eval_result.show()

Run evaluation

The evaluation process with the Vertex Gen AI Eval SDK involves two main steps:

  1. run_inference(): Generate responses from your model for a given set of prompts.
  2. evaluate(): Compute metrics on the generated responses.
eval_dataset = client.evals.run_inference(
    model="gemini-2.5-flash",
src="gs://vertex-evaluation-llm-dataset-us-central1/genai_eval_sdk/test_prompts.jsonl",
)
eval_dataset.show()


eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.QUESTION_ANSWERING_QUALITY,
        types.Metric(name='bleu'),
        types.Metric(name='rouge_1'),
    ]
)
eval_result.show()

To analyze the performance of multiple AI models or systems in a single evaluation, generate a response for each candidate and pass them in a list to the evaluate() method:

inference_result_1 = client.evals.run_inference(
    model="gemini-2.0-flash",
    src=prompts_df,
)
inference_result_2 = client.evals.run_inference(
    model="gemini-2.5-flash",
    src=prompts_df,
)

# Compare the responses against each other
comparison_result = client.evals.evaluate(
    dataset=[inference_result_1, inference_result_2],
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
    ]
)

comparison_result.show()

Asynchronous and large-scale evaluation

For large datasets, the Vertex Gen AI Eval SDK provides an asynchronous, long-running batch evaluation method. This is ideal for scenarios where you don't need immediate results and want to offload the computation.

The batch_evaluate() method returns an operation object that you can poll to track its progress. The parameters are compatible with the evaluate() method.

GCS_DEST_BUCKET = "gs://your-gcs-bucket/batch_eval_results/"

inference_result_saved = client.evals.run_inference(
    model="gemini-2.0-flash",
    src=prompts_df,
    config={'dest': GCS_DEST_BUCKET}
)
print(f"Eval dataset uploaded to: {inference_result_saved.gcs_source}")

batch_eval_job  = client.evals.batch_evaluate(
   dataset = inference_result_saved,
   metrics = [
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
        types.PrebuiltMetric.FLUENCY,
        types.Metric(name='bleu'),
    ],
   dest=GCS_DEST_BUCKET
)

Evaluating third-party models

You can use the Vertex Gen AI Eval SDK to evaluate and compare models from providers such as OpenAI. The SDK uses the litellm library to call third-party model APIs. To run an evaluation, pass the model name string to the run_inference method.

Set the required API key as an environment variable (OPENAI_API_KEY):

import os

# Set your third-party model API key
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_API_KEY'

# Run inference on an OpenAI model
gpt_response = client.evals.run_inference(
    model='gpt-4o',
    src=prompt_df
)

# You can now evaluate the responses
eval_result = client.evals.evaluate(
    dataset=gpt_response,
    metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)

eval_result.show()

Visualization

You can use the Vertex Gen AI Eval SDK to visualize your results directly within your development environment, such as a Colab or Jupyter notebook. The .show() method, available on both EvaluationDataset and EvaluationResult objects, renders an interactive HTML report for analysis.

Visualizing inference results

After generating responses with run_inference(), you can call .show() on the resulting EvaluationDataset object to inspect the model's outputs alongside your original prompts and references. This is useful for a quick quality check before running a full evaluation.

# First, run inference to get an EvaluationDataset
gpt_response = client.evals.run_inference(
    model='gpt-4o',
    src=prompt_df
)

# Now, visualize the inference results
gpt_response.show()

A table displays with each prompt, its corresponding reference (if provided), and the newly generated response.

Visualizing inference results

Visualizing evaluation reports

When you call .show() on an EvaluationResult object, a report displays with two main sections:

  • Summary metrics: An aggregated view of all metrics, showing the mean score and standard deviation across the entire dataset.
  • Detailed results: A case-by-case breakdown, allowing you to inspect the prompt, reference, candidate response, and the specific score and explanation for each metric.
# First, run an evaluation on a single candidate
eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)

# Visualize the detailed evaluation report
eval_result.show()

Evaluation report

The report's format adapts depending on whether you are evaluating a single candidate or comparing multiple candidates. For a multi-candidate evaluation, the report provides a side-by-side view and includes win/tie-rate calculations in the summary table.

For all reports, you can expand a View Raw JSON section to inspect the data for any structured prompt or response.

Detailed comparison

What's next