Evaluation dataset

The evaluation dataset typically consists of the model response that you want to evaluate, the input data used to generate your response, and it might include the ground truth response. This table provides the inputs required to construct the evaluation dataset.

Input type Input field contents
response Your LLM inference response to be evaluated.
instruction The inference instruction and prompt that is sent to your LLM.
context The context your LLM response is based on. For the summarization task, this is the text the LLM summarizes. For question-answering tasks, this is the background information provided for the LLM to answer the open-book question.
reference The ground truth to compare your LLM response to.
baseline_response The baseline LLM inference response that is used to compare your LLM response to in the side-by-side evaluation. This is also known as the baseline response.

The required inputs for the evaluation dataset differ based on the evaluation paradigm and metric you choose, as well as the nature of the tasks themselves. For a complete list of metrics and their expected inputs, see Task and Metrics.

How to use the evaluation dataset

After preparing the evaluation dataset, you can use it in the rapid evaluation Python SDK or through the evaluation pipelines service. The dataset can be imported from locations such as Cloud Storage. Vertex AI also provides some pre-processed Kaggle datasets for you to set up your evaluation workflow before your customized dataset is ready to use. You can find details regarding how to consume the dataset in the Perform evaluation.

Use a customized dataset

The generative AI evaluation service can consume your evaluation dataset in multiple ways. Our Python SDK and Pipelines have different requirements regarding the evaluation dataset input format. For information on how to import datasets in Python SDK and Pipelines, see the Evaluation examples.

Generative AI evaluation service features Supported dataset locations and format Required inputs
Python SDK JSONL or CSV file stored in Cloud Storage

BigQuery Table

Pandas DataFrame
The format should be consistent with the selected metric input requirements as per Task and Metrics. These columns might be required:
  • response
  • reference
  • instruction
  • context
Computation-based Pipeline JSONL file stored in Cloud Storage input_text
output_text
AutoSxS Pipeline JSONL file stored in Cloud Storage

BigQuery Table
The format should be consistent with what is needed by each model for inference, and the parameters are expected by the autorater for the evaluation task. Input parameters include the following:
  • Id columns
  • Input text for inference or pre-generated predictions
  • Autorater prompt parameters

Use a Kaggle dataset

If your customized dataset isn't ready for you to use with the generative AI evaluation service, Vertex AI provides pre-processed Kaggle datasets. The datasets support tasks including text generation, summarization, and question answering. The datasets are transformed into the following formats that can be used by Python SDK and Pipelines.

Kaggle dataset Supported tasks Preprocessed Dataset Cloud Storage URL Supported feature
BillSum General text generation

Summarization
summaries_evaluation.jsonl

summaries_evaluation_autorater.jsonl

summaries_evaluation_for_sdk.jsonl
gs://cloud-ai-public-datasets/kaggle/akornilo/billsum/evaluation/summaries_evaluation.

gs://cloud-ai-public-datasets/kaggle/akornilo/billsum/evaluation/summaries_evaluation_autorater.jsonl

gs://cloud-ai-public-datasets/kaggle/akornilo/billsum/evaluation/summaries_evaluation_for_sdk.jsonl
Computational-based Pipeline

AutoSxS pipeline

rapid evaluation Python SDK
Medical Transcriptions Text classification medical_speciality_from_transcription.jsonl

medical_speciality_from_transcription_autorater.jsonl
gs://cloud-ai-public-datasets/kaggle/tboyle10/medicaltranscriptions/evaluation/medical_speciality_from_transcription.jsonl

gs://cloud-ai-public-datasets/kaggle/tboyle10/medicaltranscriptions/evaluation/medical_speciality_from_transcription_autorater.jsonl
Computational-based pipeline

AutoSxS pipeline

While using the datasets, you can start by sampling a small portion of rows to test the workflow instead of using the full dataset. The datasets listed in the table have Requester Payers turned on, which means they incur Data Processing charges and Network usage charges.

What's next