Evaluation notebooks

We provide several examples of how you can use the Generative AI on Vertex AI evaluation service to perform evaluations on your generative AI models.

Evaluate your models in real time

The Vertex AI rapid evaluation service lets you evaluate your generative AI models in real time. To learn how to use rapid evaluation, see Run a rapid evaluation.

Evaluate and optimize prompt template design

Use the rapid evaluation SDK to evaluate the effect of prompt engineering. Examine the statistics corresponding with each prompt template to understand how differences in prompting impacts evaluation results.

Evaluate and select LLMs using benchmark metrics

Use the rapid evaluation SDK to score both Gemini Pro and Text Bison models on a benchmark dataset and a task.

Evaluate and select model-generation settings

Use the rapid evaluation SDK to adjust the temperature of Gemini Pro on a summarization task and to evaluate quality, fluency, safety, and verbosity.

Define your metrics

Use the rapid evaluation SDK to evaluate multiple prompt templates with your custom defined metrics.

Evaluate tool use and function calling

Use the rapid evaluation SDK to define an API function and a tool for the Gemini model. You can also use the SDK to evaluate tool use and function-calling quality for Gemini.

Evaluate generated answers from RAG for question answering

Use the rapid evaluation SDK to evaluate a question-answering task from Retrieval-Augmented Generation (RAG) generated answers.

Evaluate an LLM in Vertex AI Model Registry against a third-party model

Use AutoSxS to evaluate responses between two models and determine a winner. You can either provide the responses or generate them using Vertex AI Batch Predictions.

Check autorater alignment against a human-preference dataset

Use AutoSxS to check how well autorater ratings align with a set of human ratings you provide for a particular task. Determine if AutoSxS is sufficient for your use case, or if it needs further customization.

Evaluate Langchain chains

Use the rapid evaluation SDK to evaluate your Langchain chains. Prepare your data, set up your Langchain chain, and run your evaluation.

What's next