After you build and evaluate your Gen AI model, you might use the model to build an agent such as a chatbot. The Gen AI evaluation service lets you measure your agent's ability to complete tasks and goals for your use case.
Overview
You have the following options to evaluate your agent:
Final response evaluation: Evaluate the final output of an agent (whether or not the agent achieved its goal).
Trajectory evaluation: Evaluate the path (sequence of tool calls) the agent took to reach the final response.
With the Gen AI evaluation service, you can trigger an agent execution and get metrics for both trajectory evaluation and final response evaluation in one Vertex AI SDK query.
Supported agents
The Gen AI evaluation service supports the following categories of agents:
Supported agents | Description |
---|---|
Agent built with Reasoning Engine's template | Reasoning Engine (LangChain on Vertex AI) is a Google Cloud platform where you can deploy and manage agents. |
LangChain agents built using Reasoning Engine's customizable template | LangChain is an open source platform. |
Custom agent function | Custom agent function is a flexible function that takes in a prompt for the agent and returns a response and trajectory in a dictionary. |
Defining metrics for agent evaluation
Define your metrics for final response or trajectory evaluation:
Final response evaluation
Final response evaluation follows the same process as model response evaluation. For more information, see Define your evaluation metrics.
Trajectory evaluation
The following metrics help you to evaluate the model's ability to follow the expected trajectory:
Exact match
If the predicted trajectory is identical to the reference trajectory, with the exact same tool calls in the exact same order, the trajectory_exact_match
metric returns a score of 1, otherwise 0.
Metric input parameters
Input parameter | Description |
---|---|
predicted_trajectory |
The list of tool calls used by the agent to reach the final response. |
reference_trajectory |
The expected tool use for the agent to satisfy the query. |
Output scores
Value | Description |
---|---|
0 | Predicted trajectory doesn't match the reference. |
1 | Predicted trajectory matches the reference. |
In-order match
If the predicted trajectory contains all the tool calls from the reference trajectory in the same order, and may also have extra tool calls, the trajectory_in_order_match
metric returns a score of 1, otherwise 0.
Metric input parameters
Input parameter | Description |
---|---|
predicted_trajectory |
The predicted trajectory used by the agent to reach the final response. |
reference_trajectory |
The expected predicted trajectory for the agent to satisfy the query. |
Output scores
Value | Description |
---|---|
0 | The tool calls in the predicted trajectory doesn't match the order in the reference trajectory. |
1 | Predicted trajectory matches the reference. |
Any-order match
If the predicted trajectory contains all the tool calls from the reference trajectory, but the order doesn't matter and may contain extra tool calls, then the trajectory_any_order_match
metric returns a score of 1, otherwise 0.
Metric input parameters
Input parameter | Description |
---|---|
predicted_trajectory |
The list of tool calls used by the agent to reach the final response. |
reference_trajectory |
The expected tool use for the agent to satisfy the query. |
Output scores
Value | Description |
---|---|
0 | Predicted trajectory doesn't contain all the tool calls in the reference trajectory. |
1 | Predicted trajectory matches the reference. |
Precision
The trajectory_precision
metric measures how many of the tool calls in the predicted trajectory are actually relevant or correct according to the reference trajectory.
Precision is calculated as follows: Count how many actions in the predicted trajectory also appear in the reference trajectory. Divide that count by the total number of actions in the predicted trajectory.
Metric input parameters
Input parameter | Description |
---|---|
predicted_trajectory |
The list of tool calls used by the agent to reach the final response. |
reference_trajectory |
The expected tool use for the agent to satisfy the query. |
Output scores
Value | Description |
---|---|
A float in the range of [0,1] | The higher the score, more precise the predicted trajectory. |
Recall
The trajectory_recall
metric measures how many of the essential tool calls from the reference trajectory are actually captured in the predicted trajectory.
Recall is calculated as follows: Count how many actions in the reference trajectory also appear in the predicted trajectory. Divide that count by the total number of actions in the reference trajectory.
Metric input parameters
Input parameter | Description |
---|---|
predicted_trajectory |
The list of tool calls used by the agent to reach the final response. |
reference_trajectory |
The expected tool use for the agent to satisfy the query. |
Output scores
Value | Description |
---|---|
A float in the range of [0,1] | The higher the score, the predicted trajectory has a good recall. |
Single tool use
The trajectory_single_tool_use
metric checks if a specific tool that is specified in the metric spec is used in the predicted trajectory. It doesn't check the order of tool calls or how many times the tool is used, just whether it's present or not.
Metric input parameters
Input parameter | Description |
---|---|
predicted_trajectory |
The list of tool calls used by the agent to reach the final response. |
Output scores
Value | Description |
---|---|
0 | The tool is absent |
1 | The tool is present. |
In addition, the following two agent performance metrics are added to the evaluation results by default. You don't need to specify them in EvalTask
.
latency
Time taken by the agent to return a response.
Value | Description |
---|---|
A float | Calculated in seconds. |
failure
A boolean to describe if the agent invocation resulted in an error or succeeds.
Output scores
Value | Description |
---|---|
1 | Error |
0 | Valid response returned |
Prepare your dataset for agent evaluation
Prepare your dataset for final response or trajectory evaluation.
The data schema for final response evaluation is similar to that of model response evaluation.
For computation-based trajectory evaluation, your dataset needs to provide the following information:
Input type | Input field contents |
---|---|
predicted_trajectory |
The list of tool calls used by the agents to reach the final response. |
reference_trajectory (not required for trajectory_single_tool_use metric ) |
The expected tool use for the agent to satisfy the query. |
Evaluation dataset examples
The following examples show datasets for trajectory evaluation. Note that reference_trajectory
is required for all metrics except trajectory_single_tool_use
.
reference_trajectory = [
# example 1
[
{
"tool_name": "set_device_info",
"tool_input": {
"device_id": "device_2",
"updates": {
"status": "OFF"
}
}
}
],
# example 2
[
{
"tool_name": "get_user_preferences",
"tool_input": {
"user_id": "user_y"
}
},
{
"tool_name": "set_temperature",
"tool_input": {
"location": "Living Room",
"temperature": 23
}
},
]
]
predicted_trajectory = [
# example 1
[
{
"tool_name": "set_device_info",
"tool_input": {
"device_id": "device_3",
"updates": {
"status": "OFF"
}
}
}
],
# example 2
[
{
"tool_name": "get_user_preferences",
"tool_input": {
"user_id": "user_z"
}
},
{
"tool_name": "set_temperature",
"tool_input": {
"location": "Living Room",
"temperature": 23
}
},
]
]
eval_dataset = pd.DataFrame({
"predicted_trajectory": predicted_trajectory,
"reference_trajectory": reference_trajectory,
})
Import your evaluation dataset
You can import your dataset in the following formats:
JSONL or CSV file stored in Cloud Storage
BigQuery table
Pandas DataFrame
The following code demonstrates how to import example datasets from a Cloud Storage bucket:
# dataset name to be imported
dataset = "on-device" # customer-support, content-creation
# copy the tools and dataset file
!gcloud storage cp gs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/tools.py .
!gcloud storage cp gs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/eval_dataset.json .
# load the dataset examples
import json
eval_dataset = json.loads(open('eval_dataset.json').read())
# run the tools file
%run -i tools.py
Run agent evaluation
Run an evaluation for trajectory or final response evaluation:
For agent evaluation, you can mix response evaluation metrics and trajectory evaluation metrics like in the following code:
single_tool_use_metric = TrajectorySingleToolUse(tool_name='tool_name')
eval_task = EvalTask(
dataset=EVAL_DATASET,
metrics=[
"rouge_l_sum",
"bleu",
custom_response_eval_metric,
"trajectory_exact_match",
"trajectory_precision",
single_tool_use_metric
pointwise_trajectory_eval_metric # llm-based metric
],
)
eval_result = eval_task.evaluate(
runnable=RUNNABLE,
)
Metric customization
You can customize a large language model-based metric for trajectory evaluation using a templated interface or from scratch. You can also define a custom computation-based metric for trajectory evaluation.
View and interpret results
For trajectory evaluation or final response evaluation, the evaluation results are displayed as follows:
The evaluation results contain the following information:
Final response metrics
Instance-level results
Column | Description |
---|---|
response | Final response generated by the agent. |
latency_in_seconds | Time taken to generate the response. |
failure | Indicates that a valid response was generated or not. |
score | A score calculated for the response specified in the metric spec. |
explanation | The explanation for the score specified in the metric spec. |
Aggregate results
Column | Description |
---|---|
mean | Average score for all instances. |
standard deviation | Standard deviation for all the scores. |
Trajectory metrics
Instance-level results
Column | Description |
---|---|
predicted_trajectory | Sequence of tool calls followed by agent to reach the final response. |
reference_trajectory | Sequence of expected tool calls. |
score | A score calculated for the predicted trajectory and reference trajectory specified in the metric spec. |
latency_in_seconds | Time taken to generate the response. |
failure | Indicates that a valid response was generated or not. |
Aggregate results
Column | Description |
---|---|
mean | Average score for all instances. |
standard deviation | Standard deviation for all the scores. |