The evaluation dataset typically consists of the model response that you want to evaluate, the input data used to generate your response, and it might include the ground truth response. This table provides the inputs required to construct the evaluation dataset.
Input type | Input field contents |
---|---|
response | Your LLM inference response to be evaluated. |
instruction | The inference instruction and prompt that is sent to your LLM. |
context | The context your LLM response is based on. For the summarization task, this is the text the LLM summarizes. For question-answering tasks, this is the background information provided for the LLM to answer the open-book question. |
reference | The ground truth to compare your LLM response to. |
baseline_response | The baseline LLM inference response that is used to compare your LLM response to in the side-by-side evaluation. This is also known as the baseline response. |
The required inputs for the evaluation dataset differ based on the evaluation paradigm and metric you choose, as well as the nature of the tasks themselves. For a complete list of metrics and their expected inputs, see Task and Metrics.
How to use the evaluation dataset
After preparing the evaluation dataset, you can use it in the rapid evaluation Python SDK or through the evaluation pipelines service. The dataset can be imported from locations such as Cloud Storage. Vertex AI also provides some pre-processed Kaggle datasets for you to set up your evaluation workflow before your customized dataset is ready to use. You can find details regarding how to consume the dataset in the Perform evaluation.
Use a customized dataset
The generative AI evaluation service can consume your evaluation dataset in multiple ways. Our Python SDK and Pipelines have different requirements regarding the evaluation dataset input format. For information on how to import datasets in Python SDK and Pipelines, see the Evaluation examples.
Generative AI evaluation service features | Supported dataset locations and format | Required inputs |
---|---|---|
Python SDK | JSONL or CSV file stored in Cloud Storage BigQuery Table Pandas DataFrame |
The format should be consistent with the selected metric input requirements as per Task and Metrics. These columns might be required:
|
Computation-based Pipeline | JSONL file stored in Cloud Storage | input_text output_text |
AutoSxS Pipeline | JSONL file stored in Cloud Storage BigQuery Table |
The format should be consistent with what is needed by each model for inference, and the parameters are expected by the autorater for the evaluation task. Input parameters include the following:
|
Use a Kaggle dataset
If your customized dataset isn't ready for you to use with the generative AI
evaluation service, Vertex AI provides pre-processed Kaggle datasets.
The datasets support tasks including text generation
, summarization
, and
question answering
. The datasets are transformed into the following formats
that can be used by Python SDK and Pipelines.
Kaggle dataset | Supported tasks | Preprocessed Dataset | Cloud Storage URL | Supported feature |
---|---|---|---|---|
BillSum | General text generation Summarization |
summaries_evaluation.jsonl summaries_evaluation_autorater.jsonl summaries_evaluation_for_sdk.jsonl |
gs://cloud-ai-public-datasets/kaggle/akornilo/billsum/evaluation/summaries_evaluation. gs://cloud-ai-public-datasets/kaggle/akornilo/billsum/evaluation/summaries_evaluation_autorater.jsonl gs://cloud-ai-public-datasets/kaggle/akornilo/billsum/evaluation/summaries_evaluation_for_sdk.jsonl |
Computational-based Pipeline AutoSxS pipeline rapid evaluation Python SDK |
Medical Transcriptions | Text classification | medical_speciality_from_transcription.jsonl medical_speciality_from_transcription_autorater.jsonl |
gs://cloud-ai-public-datasets/kaggle/tboyle10/medicaltranscriptions/evaluation/medical_speciality_from_transcription.jsonl gs://cloud-ai-public-datasets/kaggle/tboyle10/medicaltranscriptions/evaluation/medical_speciality_from_transcription_autorater.jsonl |
Computational-based pipeline AutoSxS pipeline |
While using the datasets, you can start by sampling a small portion of rows to test the workflow instead of using the full dataset. The datasets listed in the table have Requester Payers turned on, which means they incur Data Processing charges and Network usage charges.
What's next
- Try an evaluation example notebook.
- Learn about generative AI evaluation.
- Learn about online evaluation with rapid evaluation.
- Learn about model-based pairwise evaluation with AutoSxS pipeline.
- Learn about the computation-based evaluation pipeline.
- Learn how to tune a foundation model.