Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Tutorial: Perform evaluation using the Vertex Gen AI SDK

This page shows you how to evaluate your generative AI models and applications across a range of use cases using the Vertex Gen AI Eval SDK for Python.

The Vertex Gen AI Eval SDK for Gen AI evaluation service lets you measure the performance of prompts, foundation models, and complex AI agents against the most relevant criteria. You can evaluate any number of candidates at once to fine-tune prompts, select the best model, or iterate on complex agents.

You can do the following with the Vertex Gen AI SDK:

Compare multiple models or configurations side-by-side in a single run, using win-rate calculations to guide your decisions.
Use built-in support to evaluate and benchmark against popular third-party models without complex integrations.
Handle large datasets with more efficiency and offload large-scale evaluation tasks using asynchronous batch evaluation.

End-to-end example

The Vertex Gen AI Eval SDK uses a two-step workflow: generating model responses and evaluating the responses. The following end-to-end example shows how the Vertex Gen AI Eval SDK works:

Install the Vertex Gen AI Eval SDK:

pip install --upgrade google-cloud-aiplatform[evaluation]

Prepare the evaluation dataset:

import pandas as pd
from vertexai import Client, types

client = Client(project="your-project-id", location="us-central1")

prompts_df = pd.DataFrame({"prompt": ["How does AI work?"]})

Run the evaluation:

# Evaluating a single model. 
eval_dataset = client.evals.run_inference(
    model="gemini-2.5-flash",
    src=prompts_df,
)
eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
eval_result.show()

Compare multiple candidates:

# Comparing multiple candidates. 
candidate_1 = client.evals.run_inference(
    model="gemini-2.0-flash", src=prompts_df
)
candidate_2 = client.evals.run_inference(
    model="gemini-2.5-flash", src=prompts_df
)
comparison_result = client.evals.evaluate(
    dataset=[candidate_1, candidate_2],
    metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
comparison_result.show()

Define your metric

Define your metrics as either an LLM-based metric or computation-based metric.

LLM-based metrics

LLM-based metrics use a large language model (LLM) as a "judge" to evaluate nuanced criteria such as style or writing quality, which are difficult to measure with algorithms alone.

Use prebuilt LLM metrics

The Vertex Gen AI Eval SDK provides a variety of ready-to-use, model-based metrics like TEXT_QUALITY, SAFETY, and INSTRUCTION_FOLLOWING. You can access these through the PrebuiltMetric class. Prebuilt metric definitions are loaded on-demand from a centralized library to create consistency across Vertex Gen AI Eval SDK versions.

# Assumes 'eval_dataset' is an EvaluationDataset object created via run_inference()

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
    ]
)

To create consistent results across different SDK versions, you can pin a metric to a specific version. By default, the latest version is used.

# Pin to a specific version of a pre-built metric
instruction_following_v2 = types.PrebuiltMetric.INSTRUCTION_FOLLOWING(version='v1')

Customize LLM metrics

For use cases requiring specialized criteria, you can define your own LLM-based metric by instantiating the LLMMetric class. This gives you full control over the evaluation prompt template, judge model, and other parameters.

The MetricPromptBuilder helper class creates a structured prompt template for the judge model by letting you define the instruction, criteria, and rating_scores separately.

# Define a custom metric to evaluate language simplicity
simplicity_metric = types.LLMMetric(
    name='language_simplicity',
    prompt_template=types.MetricPromptBuilder(
        instruction="Evaluate the story's simplicity for a 5-year-old.",
        criteria={
            "Vocabulary": "Uses simple words.",
            "Sentences": "Uses short sentences.",
        },
        rating_scores={
            "5": "Excellent: Very simple, ideal for a 5-year-old.",
            "4": "Good: Mostly simple, with minor complex parts.",
            "3": "Fair: Mix of simple and complex; may be challenging for a 5-year-old.",
            "2": "Poor: Largely too complex, with difficult words/sentences.",
            "1": "Very Poor: Very complex, unsuitable for a 5-year-old."
        }
    )
)
# Use the custom metric in an evaluation
eval_result = client.evals.evaluate(
    dataset=inference_results,
    metrics=[simplicity_metric]
)

Computation-based and custom function metrics

Computation-based metrics mathematically compare a model's output to a ground truth or reference. These metrics, calculated with code using the base Metric class, support predefined computation-based algorithms such as exact_match, bleu, and rouge_1. To use a computation-based metric, instantiate the Metric class with the metric's name. The metric requires a reference column in your dataset for comparison.

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.Metric(name='exact_match'),
        types.Metric(name='bleu'),
        types.Metric(name='rouge_1'),
    ]
)

Implement a custom function metric

You can also implement custom evaluation logic by passing a custom Python function to the custom_function parameter for complete control. The Vertex Gen AI Eval SDK executes this function for each row of your dataset.

# Define a custom function to check for the presence of a keyword
def contains_keyword(instance: dict) -> dict:
    keyword = "magic"
    response_text = instance.get("response", "")
    score = 1.0 if keyword in response_text.lower() else 0.0
    return {"score": score}

keyword_metric = types.Metric(
    name="keyword_check",
    custom_function=contains_keyword
)


eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[keyword_metric]
)

Prepare your evaluation dataset

The Vertex Gen AI Eval SDK automatically detects and handles several common data formats. This means you can often use your data as-is without needing to perform manual conversions, whether you're generating new responses with run_inference or evaluating existing ones with evaluate.

The Vertex Gen AI Eval SDK supports the following formats:

Pandas DataFrame (flattened format)

For straightforward evaluations, you can use a pandas.DataFrame. The Vertex Gen AI Eval SDK looks for common column names like prompt, response, and reference. This format is fully backward-compatible.

import pandas as pd

# Simple DataFrame with prompts and ground truth references
prompts_df = pd.DataFrame({
    "prompt": [
        "What is the capital of France?",
        "Who wrote 'Hamlet'?",
    ],
    "reference": [
        "Paris",
        "William Shakespeare",
    ]
})

# Generate responses using the DataFrame as a source
inference_results = client.evals.run_inference(
    model="gemini-2.5-flash",
    src=prompts_df
)
inference_results.show()

Gemini batch prediction format

You can directly use the output of a Vertex AI batch prediction job, which are typically JSONL files stored in Cloud Storage, where each line contains a request and response object. The Vertex Gen AI Eval SDK parses this structure automatically to provide integration with other Vertex AI services.

An example of a single line in a jsonl file:

{"request": {"contents": [{"role": "user", "parts": [{"text": "Why is the sky blue?"}]}]}, "response": {"candidates": [{"content": {"role": "model", "parts": [{"text": "The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}

You can then evaluate pre-generated responses from a batch job directly:

# Cloud Storage path to your batch prediction output file 
batch_job_output_uri = "gs://path/to/your/batch_output.jsonl"

# Evaluate the results directly from Cloud Storage
eval_result = client.evals.evaluate(
    dataset=batch_job_output_uri,
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.FLUENCY,
    ]
)
eval_result.show()

OpenAI Chat Completion format

For evaluating or comparing with third-party models, the Vertex Gen AI Eval SDK supports the OpenAI Chat Completion format. You can supply a dataset where each row is a JSON object structured like an OpenAI API request. The Vertex Gen AI Eval SDK automatically detects this format.

An example of a single line in this format:

{"request": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}], "model": "gpt-4o"}}

You can use this data to generate responses from a third-party model and evaluate the responses:

# Ensure your third-party API key is set
# e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key' 

openai_request_uri = "gs://path/to/your/openai_requests.jsonl"

# Generate responses using a LiteLLM-supported model string
openai_responses = client.evals.run_inference(
    model="gpt-4o",
    src=openai_request_uri,
)

# The resulting dataset can then be evaluated 
eval_result = client.evals.evaluate(
    dataset=openai_responses,
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.FLUENCY,
    ]
)

eval_result.show()

Run evaluation

The Vertex Gen AI Eval SDK uses the following client-based process for running evaluations:

run_inference(): Generate responses from your model for a given set of prompts.
evaluate(): Compute metrics on the generated responses.

eval_dataset = client.evals.run_inference(
    model="gemini-2.5-flash",
src="gs://vertex-evaluation-llm-dataset-us-central1/genai_eval_sdk/test_prompts.jsonl",
)
eval_dataset.show()


eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.QUESTION_ANSWERING_QUALITY,
        types.Metric(name='bleu'),
        types.Metric(name='rouge_1'),
    ]
)
eval_result.show()

To analyze the performance of multiple AI models or systems in a single evaluation, generate a response for each candidate and pass them in a list to the evaluate() method:

inference_result_1 = client.evals.run_inference(
    model="gemini-2.0-flash",
    src=prompts_df,
)
inference_result_2 = client.evals.run_inference(
    model="gemini-2.5-flash",
    src=prompts_df,
)

# Compare the responses against each other
comparison_result = client.evals.evaluate(
    dataset=[inference_result_1, inference_result_2],
    metrics=[
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
    ]
)

comparison_result.show()

Asynchronous and large-scale evaluation

For large datasets, the Vertex Gen AI Eval SDK provides an asynchronous, long-running batch evaluation method. This is ideal for scenarios where you don't need immediate results and want to offload the computation.

The batch_evaluate() method returns an operation object that you can poll to track its progress. The parameters are compatible with the evaluate() method.

GCS_DEST_BUCKET = "gs://your-gcs-bucket/batch_eval_results/"

inference_result_saved = client.evals.run_inference(
    model="gemini-2.0-flash",
    src=prompts_df,
    config={'dest': GCS_DEST_BUCKET}
)
print(f"Eval dataset uploaded to: {inference_result_saved.gcs_source}")

batch_eval_job  = client.evals.batch_evaluate(
   dataset = inference_result_saved,
   metrics = [
        types.PrebuiltMetric.TEXT_QUALITY,
        types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
        types.PrebuiltMetric.FLUENCY,
        types.Metric(name='bleu'),
    ],
   dest=GCS_DEST_BUCKET
)

Evaluating third-party models

You can use the Vertex Gen AI Eval SDK to evaluate and compare models from providers such as OpenAI by passing the model name string to the run_inference method. The Vertex Gen AI Eval SDK uses the litellm library to call the model API.

Make sure to set the required API key as an environment variable (OPENAI_API_KEY):

import os

# Set your third-party model API key
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_API_KEY'

# Run inference on an OpenAI model
gpt_response = client.evals.run_inference(
    model='gpt-4o',
    src=prompt_df
)

# You can now evaluate the responses
eval_result = client.evals.evaluate(
    dataset=gpt_response,
    metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)

eval_result.show()

Visualization

The Vertex Gen AI Eval SDK lets you visualize your results directly within your development environment, such as a Colab or Jupyter notebook. The .show() method, available on both EvaluationDataset and EvaluationResult objects, renders an interactive HTML report for analysis.

Visualizing inference results

After generating responses with run_inference(), you can call .show() on the resulting EvaluationDataset object to inspect the model's outputs alongside your original prompts and references. This is useful for a quick quality check before running a full evaluation.

# First, run inference to get an EvaluationDataset
gpt_response = client.evals.run_inference(
    model='gpt-4o',
    src=prompt_df
)

# Now, visualize the inference results
gpt_response.show()

A table displays with each prompt, its corresponding reference (if provided), and the newly generated response.

Visualizing inference results

Visualizing evaluation reports

When you call .show() on an EvaluationResult object, a report displays with two main sections:

Summary metrics: An aggregated view of all metrics, showing the mean score and standard deviation across the entire dataset.
Detailed results: A case-by-case breakdown, allowing you to inspect the prompt, reference, candidate response, and the specific score and explanation for each metric.

# First, run an evaluation on a single candidate
eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)

# Visualize the detailed evaluation report
eval_result.show()

Evaluation report

The report's format adapts depending on whether you are evaluating a single candidate or comparing multiple candidates. For a multi-candidate evaluation, the report provides a side-by-side view and includes win/tie-rate calculations in the summary table.

For all reports, you can expand a View Raw JSON section to inspect the data for any structured prompt or response.

Detailed comparison

What's next

Try the evaluation quickstart.
Define your evaluation metrics.

Tutorial: Perform evaluation using the Vertex Gen AI SDK Stay organized with collections Save and categorize content based on your preferences.

End-to-end example

Define your metric

LLM-based metrics

Use prebuilt LLM metrics

Customize LLM metrics

Computation-based and custom function metrics

Implement a custom function metric

Prepare your evaluation dataset

Run evaluation

Asynchronous and large-scale evaluation

Evaluating third-party models

Visualization

Visualizing inference results

Visualizing evaluation reports

What's next

Tutorial: Perform evaluation using the Vertex Gen AI SDK