This page shows you how to evaluate your generative AI models and applications across a range of use cases using the Vertex Gen AI Eval SDK for Python.
The Vertex Gen AI Eval SDK for Gen AI evaluation service lets you measure the performance of prompts, foundation models, and complex AI agents against the most relevant criteria. You can evaluate any number of candidates at once to fine-tune prompts, select the best model, or iterate on complex agents.
You can do the following with the Vertex Gen AI SDK:
Compare multiple models or configurations side-by-side in a single run, using win-rate calculations to guide your decisions.
Use built-in support to evaluate and benchmark against popular third-party models without complex integrations.
Handle large datasets with more efficiency and offload large-scale evaluation tasks using asynchronous batch evaluation.
End-to-end example
The Vertex Gen AI Eval SDK uses a two-step workflow: generating model responses and evaluating the responses. The following end-to-end example shows how the Vertex Gen AI Eval SDK works:
Install the Vertex Gen AI Eval SDK:
pip install --upgrade google-cloud-aiplatform[evaluation]
Prepare the evaluation dataset:
import pandas as pd from vertexai import Client, types client = Client(project="your-project-id", location="us-central1") prompts_df = pd.DataFrame({"prompt": ["How does AI work?"]})
Run the evaluation:
# Evaluating a single model. eval_dataset = client.evals.run_inference( model="gemini-2.5-flash", src=prompts_df, ) eval_result = client.evals.evaluate( dataset=eval_dataset, metrics=[types.PrebuiltMetric.TEXT_QUALITY] ) eval_result.show()
Compare multiple candidates:
# Comparing multiple candidates. candidate_1 = client.evals.run_inference( model="gemini-2.0-flash", src=prompts_df ) candidate_2 = client.evals.run_inference( model="gemini-2.5-flash", src=prompts_df ) comparison_result = client.evals.evaluate( dataset=[candidate_1, candidate_2], metrics=[types.PrebuiltMetric.TEXT_QUALITY] ) comparison_result.show()
Define your metric
Define your metrics as either an LLM-based metric or computation-based metric.
LLM-based metrics
LLM-based metrics use a large language model (LLM) as a "judge" to evaluate nuanced criteria such as style or writing quality, which are difficult to measure with algorithms alone.
Use prebuilt LLM metrics
The Vertex Gen AI Eval SDK provides a variety of ready-to-use, model-based metrics like TEXT_QUALITY
, SAFETY
, and INSTRUCTION_FOLLOWING
. You can access these through the PrebuiltMetric
class. Prebuilt metric definitions are loaded on-demand from a centralized library to create consistency across Vertex Gen AI Eval SDK versions.
# Assumes 'eval_dataset' is an EvaluationDataset object created via run_inference()
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
]
)
To create consistent results across different SDK versions, you can pin a metric to a specific version. By default, the latest version is used.
# Pin to a specific version of a pre-built metric
instruction_following_v2 = types.PrebuiltMetric.INSTRUCTION_FOLLOWING(version='v1')
Customize LLM metrics
For use cases requiring specialized criteria, you can define your own LLM-based metric by instantiating the LLMMetric
class. This gives you full control over the evaluation prompt template, judge model, and other parameters.
The MetricPromptBuilder
helper class creates a structured prompt template for the judge model by letting you define the instruction
, criteria
, and rating_scores
separately.
# Define a custom metric to evaluate language simplicity
simplicity_metric = types.LLMMetric(
name='language_simplicity',
prompt_template=types.MetricPromptBuilder(
instruction="Evaluate the story's simplicity for a 5-year-old.",
criteria={
"Vocabulary": "Uses simple words.",
"Sentences": "Uses short sentences.",
},
rating_scores={
"5": "Excellent: Very simple, ideal for a 5-year-old.",
"4": "Good: Mostly simple, with minor complex parts.",
"3": "Fair: Mix of simple and complex; may be challenging for a 5-year-old.",
"2": "Poor: Largely too complex, with difficult words/sentences.",
"1": "Very Poor: Very complex, unsuitable for a 5-year-old."
}
)
)
# Use the custom metric in an evaluation
eval_result = client.evals.evaluate(
dataset=inference_results,
metrics=[simplicity_metric]
)
Computation-based and custom function metrics
Computation-based metrics mathematically compare a model's output to a ground truth or reference. These metrics, calculated with code using the base Metric
class, support predefined computation-based algorithms such as exact_match
, bleu
, and rouge_1
. To use a computation-based metric, instantiate the Metric
class with the metric's name. The metric requires a reference
column in your dataset for comparison.
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.Metric(name='exact_match'),
types.Metric(name='bleu'),
types.Metric(name='rouge_1'),
]
)
Implement a custom function metric
You can also implement custom evaluation logic by passing a custom Python function to the custom_function
parameter for complete control. The Vertex Gen AI Eval SDK executes this function for each row of your dataset.
# Define a custom function to check for the presence of a keyword
def contains_keyword(instance: dict) -> dict:
keyword = "magic"
response_text = instance.get("response", "")
score = 1.0 if keyword in response_text.lower() else 0.0
return {"score": score}
keyword_metric = types.Metric(
name="keyword_check",
custom_function=contains_keyword
)
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[keyword_metric]
)
Prepare your evaluation dataset
The Vertex Gen AI Eval SDK automatically detects and handles several common data formats. This means you can often use your data as-is without needing to perform manual conversions, whether you're generating new responses with run_inference
or evaluating existing ones with evaluate
.
The Vertex Gen AI Eval SDK supports the following formats:
Pandas DataFrame (flattened format)
For straightforward evaluations, you can use a
pandas.DataFrame
. The Vertex Gen AI Eval SDK looks for common column names likeprompt
,response
, andreference
. This format is fully backward-compatible.import pandas as pd # Simple DataFrame with prompts and ground truth references prompts_df = pd.DataFrame({ "prompt": [ "What is the capital of France?", "Who wrote 'Hamlet'?", ], "reference": [ "Paris", "William Shakespeare", ] }) # Generate responses using the DataFrame as a source inference_results = client.evals.run_inference( model="gemini-2.5-flash", src=prompts_df ) inference_results.show()
Gemini batch prediction format
You can directly use the output of a Vertex AI batch prediction job, which are typically JSONL files stored in Cloud Storage, where each line contains a
request
andresponse
object. The Vertex Gen AI Eval SDK parses this structure automatically to provide integration with other Vertex AI services.An example of a single line in a
jsonl
file:{"request": {"contents": [{"role": "user", "parts": [{"text": "Why is the sky blue?"}]}]}, "response": {"candidates": [{"content": {"role": "model", "parts": [{"text": "The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}
You can then evaluate pre-generated responses from a batch job directly:
# Cloud Storage path to your batch prediction output file batch_job_output_uri = "gs://path/to/your/batch_output.jsonl" # Evaluate the results directly from Cloud Storage eval_result = client.evals.evaluate( dataset=batch_job_output_uri, metrics=[ types.PrebuiltMetric.TEXT_QUALITY, types.PrebuiltMetric.FLUENCY, ] ) eval_result.show()
OpenAI Chat Completion format
For evaluating or comparing with third-party models, the Vertex Gen AI Eval SDK supports the OpenAI Chat Completion format. You can supply a dataset where each row is a JSON object structured like an OpenAI API request. The Vertex Gen AI Eval SDK automatically detects this format.
An example of a single line in this format:
{"request": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}], "model": "gpt-4o"}}
You can use this data to generate responses from a third-party model and evaluate the responses:
# Ensure your third-party API key is set # e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key' openai_request_uri = "gs://path/to/your/openai_requests.jsonl" # Generate responses using a LiteLLM-supported model string openai_responses = client.evals.run_inference( model="gpt-4o", src=openai_request_uri, ) # The resulting dataset can then be evaluated eval_result = client.evals.evaluate( dataset=openai_responses, metrics=[ types.PrebuiltMetric.TEXT_QUALITY, types.PrebuiltMetric.FLUENCY, ] ) eval_result.show()
Run evaluation
The Vertex Gen AI Eval SDK uses the following client-based process for running evaluations:
run_inference()
: Generate responses from your model for a given set of prompts.evaluate()
: Compute metrics on the generated responses.
eval_dataset = client.evals.run_inference(
model="gemini-2.5-flash",
src="gs://vertex-evaluation-llm-dataset-us-central1/genai_eval_sdk/test_prompts.jsonl",
)
eval_dataset.show()
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.QUESTION_ANSWERING_QUALITY,
types.Metric(name='bleu'),
types.Metric(name='rouge_1'),
]
)
eval_result.show()
To analyze the performance of multiple AI models or systems in a single evaluation, generate a response for each candidate and pass them in a list to the evaluate()
method:
inference_result_1 = client.evals.run_inference(
model="gemini-2.0-flash",
src=prompts_df,
)
inference_result_2 = client.evals.run_inference(
model="gemini-2.5-flash",
src=prompts_df,
)
# Compare the responses against each other
comparison_result = client.evals.evaluate(
dataset=[inference_result_1, inference_result_2],
metrics=[
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
]
)
comparison_result.show()
Asynchronous and large-scale evaluation
For large datasets, the Vertex Gen AI Eval SDK provides an asynchronous, long-running batch evaluation method. This is ideal for scenarios where you don't need immediate results and want to offload the computation.
The batch_evaluate()
method returns an operation object that you can poll to track its progress. The parameters are compatible with the evaluate()
method.
GCS_DEST_BUCKET = "gs://your-gcs-bucket/batch_eval_results/"
inference_result_saved = client.evals.run_inference(
model="gemini-2.0-flash",
src=prompts_df,
config={'dest': GCS_DEST_BUCKET}
)
print(f"Eval dataset uploaded to: {inference_result_saved.gcs_source}")
batch_eval_job = client.evals.batch_evaluate(
dataset = inference_result_saved,
metrics = [
types.PrebuiltMetric.TEXT_QUALITY,
types.PrebuiltMetric.INSTRUCTION_FOLLOWING,
types.PrebuiltMetric.FLUENCY,
types.Metric(name='bleu'),
],
dest=GCS_DEST_BUCKET
)
Evaluating third-party models
You can use the Vertex Gen AI Eval SDK to evaluate and compare models from providers such as OpenAI by passing the model name string to the run_inference
method. The Vertex Gen AI Eval SDK uses the litellm
library to call the model API.
Make sure to set the required API key as an environment variable (OPENAI_API_KEY
):
import os
# Set your third-party model API key
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_API_KEY'
# Run inference on an OpenAI model
gpt_response = client.evals.run_inference(
model='gpt-4o',
src=prompt_df
)
# You can now evaluate the responses
eval_result = client.evals.evaluate(
dataset=gpt_response,
metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
eval_result.show()
Visualization
The Vertex Gen AI Eval SDK lets you visualize your results directly within your development environment, such as a Colab or Jupyter notebook. The .show()
method, available on both EvaluationDataset
and EvaluationResult
objects, renders an interactive HTML report for analysis.
Visualizing inference results
After generating responses with run_inference()
, you can call .show()
on the resulting EvaluationDataset
object to inspect the model's outputs alongside your original prompts and references. This is useful for a quick quality check before running a full evaluation.
# First, run inference to get an EvaluationDataset
gpt_response = client.evals.run_inference(
model='gpt-4o',
src=prompt_df
)
# Now, visualize the inference results
gpt_response.show()
A table displays with each prompt, its corresponding reference (if provided), and the newly generated response.
Visualizing evaluation reports
When you call .show()
on an EvaluationResult
object, a report displays with two main sections:
Summary metrics: An aggregated view of all metrics, showing the mean score and standard deviation across the entire dataset.
Detailed results: A case-by-case breakdown, allowing you to inspect the prompt, reference, candidate response, and the specific score and explanation for each metric.
# First, run an evaluation on a single candidate
eval_result = client.evals.evaluate(
dataset=eval_dataset,
metrics=[types.PrebuiltMetric.TEXT_QUALITY]
)
# Visualize the detailed evaluation report
eval_result.show()
The report's format adapts depending on whether you are evaluating a single candidate or comparing multiple candidates. For a multi-candidate evaluation, the report provides a side-by-side view and includes win/tie-rate calculations in the summary table.
For all reports, you can expand a View Raw JSON section to inspect the data for any structured prompt or response.