This tutorial shows you how to evaluate your generative AI models and applications using the Vertex Gen AI Eval SDK for Python. This tutorial covers the following topics: The following diagram summarizes the overall workflow:
You can use the Vertex Gen AI Eval SDK for Gen AI evaluation service to measure the performance of prompts, foundation models, and complex AI agents against the most relevant criteria. You can evaluate any number of candidates at once to fine-tune prompts, select the best model, or iterate on complex agents. With the Vertex Gen AI SDK, you can do the following: The following end-to-end example demonstrates the two-step workflow of the Vertex Gen AI Eval SDK: generating model responses and then evaluating them. Install the Vertex Gen AI Eval SDK: Prepare the evaluation dataset: Run the evaluation: Compare multiple candidates: The Vertex Gen AI Eval SDK supports several types of metrics to suit different evaluation needs. You can use prebuilt metrics, create custom metrics judged by an LLM, or implement your own computation-based logic. LLM-based metrics use a large language model (LLM) as a "judge" to evaluate nuanced criteria such as style or writing quality, which are difficult to measure with algorithms alone. Use prebuilt LLM metrics The Vertex Gen AI Eval SDK provides a variety of ready-to-use, model-based metrics like To create consistent results across different SDK versions, you can pin a metric to a specific version. By default, the latest version is used. Customize LLM metrics For use cases that require specialized criteria, you can define your own LLM-based metric by instantiating the The Computation-based metrics mathematically compare a model's output to a ground truth or reference. These metrics are calculated with code using the base Implement a custom function metric You can also implement custom evaluation logic by passing a custom Python function to the The Vertex Gen AI Eval SDK automatically detects and handles several common data formats. This means you can often use your data as-is without needing to perform manual conversions, whether you're generating new responses with The Vertex Gen AI Eval SDK supports the following formats: Pandas DataFrame (flattened format) For local evaluations, you can use a Gemini batch prediction format You can directly use the output of a Vertex AI batch prediction job, which are typically JSONL files stored in Cloud Storage, where each line contains a An example of a single line in a You can then evaluate pre-generated responses from a batch job directly: OpenAI Chat Completion format For evaluating or comparing with third-party models, the Vertex Gen AI Eval SDK supports the OpenAI Chat Completion format. You can supply a dataset where each row is a JSON object structured like an OpenAI API request. The Vertex Gen AI Eval SDK automatically detects this format. An example of a single line in this format: You can use this data to generate responses from a third-party model and evaluate the responses: The evaluation process with the Vertex Gen AI Eval SDK involves two main steps: To analyze the performance of multiple AI models or systems in a single evaluation, generate a response for each candidate and pass them in a list to the For large datasets, the Vertex Gen AI Eval SDK provides an asynchronous, long-running batch evaluation method. This is ideal for scenarios where you don't need immediate results and want to offload the computation. The You can use the Vertex Gen AI Eval SDK to evaluate and compare models from providers such as OpenAI. The SDK uses the Set the required API key as an environment variable ( You can use the Vertex Gen AI Eval SDK to visualize your results directly within your development environment, such as a Colab or Jupyter notebook. The After generating responses with A table displays with each prompt, its corresponding reference (if provided), and the newly generated response. When you call The report's format adapts depending on whether you are evaluating a single candidate or comparing multiple candidates. For a multi-candidate evaluation, the report provides a side-by-side view and includes win/tie-rate calculations in the summary table. For all reports, you can expand a View Raw JSON section to inspect the data for any structured prompt or response.
End-to-end example
pip install --upgrade google-cloud-aiplatform[evaluation]
import pandas as pd
from vertexai import Client, types
client = Client(project="your-project-id", location="us-central1")
prompts_df = pd.DataFrame({"prompt": ["How does AI work?"]})
# Evaluating a single model.
eval_dataset = client.evals.run_inference(
model="gemini-2.5-flash",
src=prompts_df,
)
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
eval_result.show()
# Comparing multiple candidates.
candidate_1 = client.evals.run_inference(
model="gemini-2.0-flash", src=prompts_df
)
candidate_2 = client.evals.run_inference(
model="gemini-2.5-flash", src=prompts_df
)
comparison_result = client.evals.evaluate(
dataset=[candidate_1, candidate_2],
metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
comparison_result.show()
Define your metric
Metric Type
Description
Use Case
LLM-based metrics
Uses a large language model (LLM) as a "judge" to score responses based on a prompt with nuanced criteria.
Evaluating subjective qualities like writing style, creativity, safety, or instruction following.
Computation-based metrics
Mathematically compares a model's output against a known "ground truth" reference.
Measuring objective correctness with algorithms like
exact_match
, bleu
, or rouge
. Requires a reference answer.
Custom function metrics
Executes a custom Python function to implement bespoke evaluation logic for each row in the dataset.
Implementing highly specific checks that aren't covered by other metric types, such as checking for a specific keyword or format.
LLM-based metrics
TEXT_QUALITY
, SAFETY
, and INSTRUCTION_FOLLOWING
. You can access these through the PrebuiltMetric
class. Prebuilt metric definitions are loaded on-demand from a centralized library, which ensures consistency across Vertex Gen AI Eval SDK versions.# Assumes 'eval_dataset' is an EvaluationDataset object created via run_inference()
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
]
)
# Pin to a specific version of a pre-built metric
instruction_following_v2 = types.PrebuiltMetric.INSTRUCTION_FOLLOWING(version='v1')
LLMMetric
class. This gives you control over the evaluation prompt template, judge model, and other parameters.MetricPromptBuilder
helper class creates a structured prompt template for the judge model where you define the instruction
, criteria
, and rating_scores
separately.# Define a custom metric to evaluate language simplicity
simplicity_metric = types.LLMMetric(
name='language_simplicity',
prompt_template=types.MetricPromptBuilder(
instruction="Evaluate the story's simplicity for a 5-year-old.",
criteria={
"Vocabulary": "Uses simple words.",
"Sentences": "Uses short sentences.",
},
rating_scores={
"5": "Excellent: Very simple, ideal for a 5-year-old.",
"4": "Good: Mostly simple, with minor complex parts.",
"3": "Fair: Mix of simple and complex; may be challenging for a 5-year-old.",
"2": "Poor: Largely too complex, with difficult words/sentences.",
"1": "Very Poor: Very complex, unsuitable for a 5-year-old."
}
)
)
# Use the custom metric in an evaluation
eval_result = client.evals.evaluate(
dataset=inference_results,
metrics=[simplicity_metric]
)
Computation-based and custom function metrics
Metric
class and support predefined algorithms such as exact_match
, bleu
, and rouge_1
. To use a computation-based metric, instantiate the Metric
class with the metric's name. The metric requires a reference
column in your dataset for comparison.eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.Metric(name='exact_match'),
types.Metric(name='bleu'),
types.Metric(name='rouge_1'),
]
)
custom_function
parameter. The Vertex Gen AI Eval SDK executes this function for each row of your dataset.# Define a custom function to check for the presence of a keyword
def contains_keyword(instance: dict) -> dict:
keyword = "magic"
response_text = instance.get("response", "")
score = 1.0 if keyword in response_text.lower() else 0.0
return {"score": score}
keyword_metric = types.Metric(
name="keyword_check",
custom_function=contains_keyword
)
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[keyword_metric]
)
Prepare your evaluation dataset
run_inference
or evaluating existing ones with evaluate
.
Format
Description
Use Case
Pandas DataFrame
A simple, in-memory table structure using the
pandas
library. The SDK automatically detects common column names like prompt
and reference
.Best for straightforward, local evaluations with smaller datasets that fit comfortably in memory.
Gemini batch prediction format
JSONL files from a Vertex AI batch prediction job, where each line contains a
request
and response
object.Ideal for evaluating pre-generated responses from a large-scale Vertex AI batch job without reformatting.
OpenAI Chat Completion format
JSONL files where each line is a JSON object structured like an OpenAI API request.
Convenient for evaluating or comparing with third-party models (e.g., from OpenAI) by using their native request format.
pandas.DataFrame
. The Vertex Gen AI Eval SDK looks for common column names like prompt
, response
, and reference
. This format is fully backward-compatible.import pandas as pd
# Simple DataFrame with prompts and ground truth references
prompts_df = pd.DataFrame({
"prompt": [
"What is the capital of France?",
"Who wrote 'Hamlet'?",
],
"reference": [
"Paris",
"William Shakespeare",
]
})
# Generate responses using the DataFrame as a source
inference_results = client.evals.run_inference(
model="gemini-2.5-flash",
src=prompts_df
)
inference_results.show()
request
and response
object. The Vertex Gen AI Eval SDK parses this structure automatically, which provides seamless integration with other Vertex AI services.jsonl
file:{"request": {"contents": [{"role": "user", "parts": [{"text": "Why is the sky blue?"}]}]}, "response": {"candidates": [{"content": {"role": "model", "parts": [{"text": "The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}
# Cloud Storage path to your batch prediction output file
batch_job_output_uri = "gs://path/to/your/batch_output.jsonl"
# Evaluate the results directly from Cloud Storage
eval_result = client.evals.evaluate(
dataset=batch_job_output_uri,
metrics=[
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.FLUENCY,
]
)
eval_result.show()
{"request": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}], "model": "gpt-4o"}}
# Ensure your third-party API key is set
# e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key'
openai_request_uri = "gs://path/to/your/openai_requests.jsonl"
# Generate responses using a LiteLLM-supported model string
openai_responses = client.evals.run_inference(
model="gpt-4o",
src=openai_request_uri,
)
# The resulting dataset can then be evaluated
eval_result = client.evals.evaluate(
dataset=openai_responses,
metrics=[
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.FLUENCY,
]
)
eval_result.show()
Run evaluation
run_inference()
: Generate responses from your model for a given set of prompts.evaluate()
: Compute metrics on the generated responses.eval_dataset = client.evals.run_inference(
model="gemini-2.5-flash",
src="gs://vertex-evaluation-llm-dataset-us-central1/genai_eval_sdk/test_prompts.jsonl",
)
eval_dataset.show()
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.QUESTION_ANSWERING_QUALITY,
types.Metric(name='bleu'),
types.Metric(name='rouge_1'),
]
)
eval_result.show()
evaluate()
method:inference_result_1 = client.evals.run_inference(
model="gemini-2.0-flash",
src=prompts_df,
)
inference_result_2 = client.evals.run_inference(
model="gemini-2.5-flash",
src=prompts_df,
)
# Compare the responses against each other
comparison_result = client.evals.evaluate(
dataset=[inference_result_1, inference_result_2],
metrics=[
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
]
)
comparison_result.show()
Asynchronous and large-scale evaluation
batch_evaluate()
method returns an operation object that you can poll to track its progress. The parameters are compatible with the evaluate()
method.GCS_DEST_BUCKET = "gs://your-gcs-bucket/batch_eval_results/"
inference_result_saved = client.evals.run_inference(
model="gemini-2.0-flash",
src=prompts_df,
config={'dest': GCS_DEST_BUCKET}
)
print(f"Eval dataset uploaded to: {inference_result_saved.gcs_source}")
batch_eval_job = client.evals.batch_evaluate(
dataset = inference_result_saved,
metrics = [
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
types.PrebuiltMetric.FLUENCY,
types.Metric(name='bleu'),
],
dest=GCS_DEST_BUCKET
)
Evaluating third-party models
litellm
library to call third-party model APIs. To run an evaluation, pass the model name string to the run_inference
method.OPENAI_API_KEY
):import os
# Set your third-party model API key
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_API_KEY'
# Run inference on an OpenAI model
gpt_response = client.evals.run_inference(
model='gpt-4o',
src=prompt_df
)
# You can now evaluate the responses
eval_result = client.evals.evaluate(
dataset=gpt_response,
metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
eval_result.show()
Visualization
.show()
method, available on both EvaluationDataset
and EvaluationResult
objects, renders an interactive HTML report for analysis.Visualizing inference results
run_inference()
, you can call .show()
on the resulting EvaluationDataset
object to inspect the model's outputs alongside your original prompts and references. This is useful for a quick quality check before running a full evaluation.# First, run inference to get an EvaluationDataset
gpt_response = client.evals.run_inference(
model='gpt-4o',
src=prompt_df
)
# Now, visualize the inference results
gpt_response.show()
Visualizing evaluation reports
.show()
on an EvaluationResult
object, a report displays with two main sections:
# First, run an evaluation on a single candidate
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
# Visualize the detailed evaluation report
eval_result.show()
What's next
Tutorial: Perform evaluation using the Vertex Gen AI SDK
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-21 UTC.