Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Stay organized with collections
Save and categorize content based on your preferences.
After you build a Generative AI model, you can use it to power an agent, such as a chatbot. The Gen AI evaluation service service lets you measure your agent's ability to complete tasks and goals for your use case.
This guide shows you how to evaluate Generative AI agents using the Gen AI evaluation service service and covers the following topics:
Evaluation methods: Learn about the two main approaches for agent evaluation: final response and trajectory.
Supported agents: See the types of agents you can evaluate, including those built with Agent Engine, LangChain, or custom functions.
Evaluation metrics: Understand the metrics available for evaluating an agent's final response and its trajectory.
Preparing your dataset: See how to structure your data for both final response and trajectory evaluation.
Running an evaluation: Execute an evaluation task using the Vertex AI SDK and customize metrics for your specific needs.
You can evaluate your agent using the following methods. With the Gen AI evaluation service, you can trigger an agent execution and get metrics for both evaluation methods in one Vertex AI SDK query.
Evaluation Method
Description
Use Case
Final response evaluation
Evaluates only the final output of an agent to determine if it achieved its goal.
When the end result is the primary concern, and the intermediate steps are not important.
Trajectory evaluation
Evaluates the entire path (sequence of tool calls) the agent took to reach the final response.
When the process, reasoning path, and tool usage are critical for debugging, optimization, or ensuring compliance.
Supported agents
The Gen AI evaluation service supports the following categories of agents:
A flexible function that takes in a prompt for the agent and returns a response and trajectory in a dictionary.
Evaluation metrics
You can define metrics for final response or trajectory evaluation.
Final response metrics
Final response evaluation follows the same process as model response evaluation. For more information, see Define your evaluation metrics.
Trajectory metrics
The following metrics evaluate the model's ability to follow the expected trajectory.
Metric
What It Measures
When to Use
trajectory_exact_match
Whether the predicted tool call sequence is identical to the reference sequence.
For strict, non-flexible workflows where the exact sequence and parameters must be followed.
trajectory_in_order_match
Whether all reference tool calls are present in the correct order, allowing for extra calls.
When the core sequence is important, but the agent can perform additional helpful steps.
trajectory_any_order_match
Whether all reference tool calls are present, regardless of order or extra calls.
When a set of tasks must be completed, but the execution order is not important.
trajectory_precision
The proportion of predicted tool calls that are relevant (i.e., also in the reference).
To penalize agents that make many irrelevant or unnecessary tool calls.
trajectory_recall
The proportion of required (reference) tool calls that the agent actually made.
To ensure the agent performs all necessary steps to complete the task.
trajectory_single_tool_use
Whether a specific, single tool was used at least once in the trajectory.
To verify if a critical tool (e.g., a final confirmation or safety check) was part of the process.
All trajectory metrics, except trajectory_single_tool_use, require a predicted_trajectory and a reference_trajectory as input parameters.
Exact match
The trajectory_exact_match metric returns a score of 1 if the predicted trajectory is identical to the reference trajectory, with the same tool calls in the same order. Otherwise, it returns 0.
In-order match
The trajectory_in_order_match metric returns a score of 1 if the predicted trajectory contains all the tool calls from the reference trajectory in the same order. Extra tool calls are permitted. Otherwise, it returns 0.
Any-order match
The trajectory_any_order_match metric returns a score of 1 if the predicted trajectory contains all the tool calls from the reference trajectory, regardless of their order. Extra tool calls are permitted. Otherwise, it returns 0.
Precision
The trajectory_precision metric measures how many of the tool calls in the predicted trajectory are relevant according to the reference trajectory. The score is a float in the range of [0,1].
Precision is calculated by dividing the number of actions in the predicted trajectory that also appear in the reference trajectory by the total number of actions in the predicted trajectory.
Recall
The trajectory_recall metric measures how many of the essential tool calls from the reference trajectory are present in the predicted trajectory. The score is a float in the range of [0,1].
Recall is calculated by dividing the number of actions in the reference trajectory that also appear in the predicted trajectory by the total number of actions in the reference trajectory.
Single tool use
The trajectory_single_tool_use metric checks if a specific tool, specified in the metric spec, is used in the predicted trajectory. It doesn't check the order of tool calls or how many times the tool is used. It returns 1 if the tool is present and 0 if it is absent.
Default performance metrics
The following performance metrics are added to the evaluation results by default. You don't need to specify them in EvalTask.
latency
Time taken by the agent to return a response, calculated in seconds.
failure
A boolean that indicates if the agent invocation resulted in an error.
For computation-based trajectory evaluation, your dataset needs to provide the following information:
predicted_trajectory: The list of tool calls used by the agent to reach the final response.
reference_trajectory: The expected tool use for the agent to satisfy the query. This is not required for the trajectory_single_tool_use metric.
Evaluation dataset examples
The following examples show datasets for trajectory evaluation. reference_trajectory is required for all metrics except trajectory_single_tool_use.
reference_trajectory=[# example 1[{"tool_name":"set_device_info","tool_input":{"device_id":"device_2","updates":{"status":"OFF"}}}],# example 2[{"tool_name":"get_user_preferences","tool_input":{"user_id":"user_y"}},{"tool_name":"set_temperature","tool_input":{"location":"Living Room","temperature":23}},]]predicted_trajectory=[# example 1[{"tool_name":"set_device_info","tool_input":{"device_id":"device_3","updates":{"status":"OFF"}}}],# example 2[{"tool_name":"get_user_preferences","tool_input":{"user_id":"user_z"}},{"tool_name":"set_temperature","tool_input":{"location":"Living Room","temperature":23}},]]eval_dataset=pd.DataFrame({"predicted_trajectory":predicted_trajectory,"reference_trajectory":reference_trajectory,})
Import your evaluation dataset
You can import your dataset in the following formats:
JSONL or CSV file stored in Cloud Storage
BigQuery table
Pandas DataFrame
Gen AI evaluation service provides example public datasets to demonstrate how you can evaluate your agents. The following code shows how to import the public datasets from a Cloud Storage bucket:
# dataset name to be importeddataset="on-device"# Alternatives: "customer-support", "content-creation"# copy the tools and dataset file!gcloudstoragecpgs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/tools.py.!gcloudstoragecpgs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/eval_dataset.json.# load the dataset examplesimportjsoneval_dataset=json.loads(open('eval_dataset.json').read())# run the tools file%run-itools.py
where dataset is one of the following public datasets:
"on-device" for an On-Device Home Assistant, which controls home devices. The agent helps with queries such as "Schedule the air conditioning in the bedroom so that it is on between 11pm and 8am, and off the rest of the time."
"customer-support" for a Customer Support Agent. The agent helps with queries such as "Can you cancel any pending orders and escalate any open support tickets?"
"content-creation" for a Marketing Content Creation Agent. The agent helps with queries such as "Reschedule campaign X to be a one-time campaign on social media site Y with a 50% reduced budget, only on December 25, 2024."
Run an evaluation
For agent evaluation, you can mix response evaluation metrics and trajectory evaluation metrics in the same task.
You can customize a large language model-based metric for trajectory evaluation using a templated interface or from scratch. For more details, see model-based metrics. The following is a templated example:
response_follows_trajectory_prompt_template=PointwiseMetricPromptTemplate(criteria={"Follows trajectory":("Evaluate whether the agent's response logically follows from the ""sequence of actions it took. Consider these sub-points:\n"" - Does the response reflect the information gathered during the trajectory?\n"" - Is the response consistent with the goals and constraints of the task?\n"" - Are there any unexpected or illogical jumps in reasoning?\n""Provide specific examples from the trajectory and response to support your evaluation.")},rating_rubric={"1":"Follows trajectory","0":"Does not follow trajectory",},input_variables=["prompt","predicted_trajectory"],)response_follows_trajectory_metric=PointwiseMetric(metric="response_follows_trajectory",metric_prompt_template=response_follows_trajectory_prompt_template,)
You can also define a custom computation-based metric for trajectory evaluation or response evaluation.
The evaluation results are displayed in tables for final response metrics and trajectory metrics.
The evaluation results contain the following information:
Final response metrics
Instance-level results
Column
Description
response
Final response generated by the agent.
latency_in_seconds
Time taken to generate the response.
failure
Indicates whether a valid response was generated.
score
A score calculated for the response specified in the metric spec.
explanation
The explanation for the score specified in the metric spec.
Aggregate results
Column
Description
mean
Average score for all instances.
standard deviation
Standard deviation for all the scores.
Trajectory metrics
Instance-level results
Column
Description
predicted_trajectory
Sequence of tool calls followed by agent to reach the final response.
reference_trajectory
Sequence of expected tool calls.
score
A score calculated for the predicted trajectory and reference trajectory specified in the metric spec.
latency_in_seconds
Time taken to generate the response.
failure
Indicates whether a valid response was generated.
Aggregate results
Column
Description
mean
Average score for all instances.
standard deviation
Standard deviation for all the scores.
Agent2Agent (A2A) protocol
If you are building a multi-agent system, consider reviewing the A2A Protocol. A2A Protocol is an open standard that enables seamless communication and collaboration between AI agents, regardless of their underlying frameworks. It was donated by Google Cloud to the Linux Foundation in June 2025. To use the A2A SDKs, or try out the samples, see the GitHub repository.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-21 UTC."],[],[],null,[]]