Prepare your evaluation dataset

This page describes how to prepare your dataset for the Gen AI evaluation service.

Overview

The Gen AI evaluation service automatically detects and handles several common data formats. This means you can often use your data as-is without needing to perform manual conversions.

The fields you need to provide in your dataset depend on your goal:

Goal Required data SDK workflow
Generate new responses and then evaluate them prompt run_inference()evaluate()
Evaluate existing responses prompt and response evaluate()

When running client.evals.evaluate(), the Gen AI evaluation service automatically looks for the following common fields in your dataset:

  • prompt: (Required) The input to the model that you want to evaluate. For best results, you should provide example prompts that represent the types of inputs that your models process in production.

  • response: (Required) The output generated by the model or application that is being evaluated.

  • reference: (Optional) The ground truth or "golden" answer that you can compare the model's response against. This field is often required for computation-based metrics like bleu and rouge.

  • conversation_history: (Optional) A list of preceding turns in a multi-turn conversation. The Gen AI evaluation service automatically extracts this field from supported formats. For more information, see Handling multi-turn conversations.

Supported data formats

The Gen AI evaluation service supports the following formats:

Pandas DataFrame

For straightforward evaluations, you can use a pandas.DataFrame. The Gen AI evaluation service looks for common column names like prompt, response, and reference. This format is fully backward-compatible.

import pandas as pd

# Example DataFrame with prompts and ground truth references
prompts_df = pd.DataFrame({
    "prompt": [
        "What is the capital of France?",
        "Who wrote 'Hamlet'?",
    ],
    "reference": [
        "Paris",
        "William Shakespeare",
    ]
})

# You can use this DataFrame directly with run_inference or evaluate
eval_dataset = client.evals.run_inference(model="gemini-2.5-flash", src=prompts_df)
eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

Gemini batch prediction format

You can directly use the output of a Vertex AI batch prediction job, which are typically JSONL files stored in Cloud Storage, where each line contains a request and response object. The Gen AI evaluation service parses this structure automatically to provide integration with other Vertex AI services.

The following is an example of a single line in a JSONl file:

{"request": {"contents": [{"role": "user", "parts": [{"text": "Why is the sky blue?"}]}]}, "response": {"candidates": [{"content": {"role": "model", "parts": [{"text": "The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}

You can then evaluate pre-generated responses from a batch job directly:

# Cloud Storage path to your batch prediction output file
batch_job_output_uri = "gs://path/to/your/batch_output.jsonl"

# Evaluate the pre-generated responses directly
eval_result = client.evals.evaluate(
    dataset=batch_job_output_uri,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

OpenAI Chat Completion format

For evaluating or comparing with third-party models such as OpenAI and Anthropic, the Gen AI evaluation service supports the OpenAI Chat Completion format. You can supply a dataset where each row is a JSON object structured like an OpenAI API request. The Gen AI evaluation service automatically detects this format.

The following is an example of a single line in this format:

{"request": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}], "model": "gpt-4o"}}

You can use this data to generate responses from a third-party model and evaluate the responses:

# Ensure your third-party API key is set
# e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key'

openai_request_uri = "gs://path/to/your/openai_requests.jsonl"

# Generate responses using a LiteLLM-supported model string
openai_responses = client.evals.run_inference(
    model="gpt-4o",  # LiteLLM compatible model string
    src=openai_request_uri,
)

# The resulting openai_responses object can then be evaluated
eval_result = client.evals.evaluate(
    dataset=openai_responses,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

Handling multi-turn conversations

The Gen AI evaluation service automatically parses multi-turn conversation data from supported formats. When your input data includes a history of exchanges (such as within the request.contents field in the Gemini format, or request.messages in the OpenAI format), the Gen AI evaluation service identifies the previous turns and processes them as conversation_history.

This means you don't need to manually separate the current prompt from the prior conversation, since the evaluation metrics can use the conversation history to understand the context of the model's response.

Consider the following example of a multi-turn conversation in Gemini format:

{
  "request": {
    "contents": [
      {"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]},
      {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]},
      {"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]}
    ]
  },
  "response": {
    "candidates": [
      {"content": {"role": "model", "parts": [{"text": "For spring in Paris, you should definitely visit the Eiffel Tower, the Louvre Museum, and wander through Montmartre."}]}}
    ]
  }
}

The multi-turn conversation is automatically parsed as follows:

  • prompt: The last user message is identified as the current prompt ({"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]}).

  • conversation_history: The preceding messages are automatically extracted and made available as the conversation history ([{"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]}, {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]}]).

  • response: The model's reply is taken from the response field ({"role": "model", "parts": [{"text": "For spring in Paris..."}]}).

What's next