The Gen AI Evaluation Service in Vertex AI lets you evaluate any generative model or application and benchmark the evaluation results against your own judgment, using your own evaluation criteria.
While leaderboards and reports offer insights into overall model performance, they don't reveal how a model handles your specific needs. The Gen AI Evaluation Service helps you define your own evaluation criteria, ensuring a clear understanding of how well generative AI models and applications align with your unique use case.
Evaluation is important at every step of your Gen AI development process including model selection, prompt engineering, and model customization. Evaluating Gen AI is integrated within Vertex AI to help you launch and reuse evaluations as needed.
Gen AI Evaluation Service capabilities
The Gen AI Evaluation Service can help you with the following tasks:
Model selection: Choose the best pre-trained model for your task based on benchmark results and its performance on your specific data.
Generation settings: Tweak model parameters (like temperature) to optimize output for your needs.
Prompt engineering: Craft effective prompts and prompt templates to guide the model towards your preferred behavior and responses.
Improve and safeguard fine-tuning: Fine-tune a model to improve performance for your use case, while avoiding biases or undesirable behaviors.
RAG optimization: Select the most effective Retrieval Augmented Generation (RAG) architecture to enhance performance for your application.
Migration: Continuously assess and improve the performance of your AI solution by migrating to newer models when they provide a clear advantage for your specific use case.
Evaluation process
The Gen AI Evaluation Service lets you evaluate any Gen AI model or application on your evaluation criteria by following these steps:
-
Learn how to tailor model-based metrics to your business criteria.
Evaluate a single model (pointwise) or determine the winner when comparing 2 models (pairwise).
Include computation-based metrics for additional insights.
Prepare your evaluation dataset.
- Provide a dataset that reflects your specific use case.
-
Start from scratch, use a template, or adapt existing examples.
Define candidate models and create an
EvalTask
to reuse your evaluation logic through Vertex AI.
Notebooks for evaluation use cases
The following table lists Vertex AI SDK for Python notebooks for various generative AI evaluation use cases:
Use case | Description | Links to notebooks |
---|---|---|
Evaluate models | Quickstart: Introduction to Gen AI Evaluation Service SDK. | Getting Started with Gen AI Evaluation Service SDK |
Evaluate and select first-party (1P) foundation models for your task. | Evaluate and select first-party (1P) foundation models for your task | |
Evaluate and select generative AI model settings: Adjust temperature, output token limit, safety settings and other model generation configurations of Gemini models on a summarization task and compare the evaluation results from different model settings on several metrics. |
Compare different model parameter settings for Gemini | |
Migrate from PaLM to Gemini model with Gen AI Evaluation Service SDK. This notebook guides you through evaluating PaLM and Gemini foundation models using multiple evaluation metrics to support decisions around migrating from one model to another. We visualize these metrics to gain insights into the strengths and weaknesses of each model, helping you make an informed decision about which one aligns best with the specific requirements of your use case. |
Compare and migrate from PaLM to Gemini model | |
Evaluate prompt templates | Prompt engineering and prompt evaluation with Gen AI Evaluation Service SDK. | Evaluate and Optimize Prompt Template Design for Better Results |
Evaluate Gen AI applications | Evaluate Gemini model tool use and function calling capabilities. | Evaluate Gemini Model Tool Use |
Evaluate generated answers from Retrieval-Augmented Generation (RAG) for a question-answering task with Gen AI Evaluation Service SDK. | Evaluate Generated Answers from Retrieval-Augmented Generation (RAG) | |
Metric customization | Customize model-based metrics and evaluate a generative AI model according to your specific criteria using the following features:
|
Customize Model-based Metrics to evaluate a Gen AI model |
Evaluate generative AI models with your locally-defined custom metric, and bring your own judge model to perform model-based metric evaluation. | Bring-Your-Own-Autorater using Custom Metric | |
Define your own computation-based custom metric functions, and use them for evaluation with Gen AI Evaluation Service SDK. | Bring your own computation-based Custom Metric | |
Other topics | Gen AI Evaluation Service SDK Preview-to-GA Migration Guide. This tutorial guides you through the migration process from the Preview version to the latest GA version of the Vertex AI SDK for Python for Gen AI Evaluation Service. The guide also showcases how to use the GA version SDK to evaluate Retrieval-Augmented Generation (RAG) and compare two models using pairwise evaluation. |
Gen AI Evaluation Service SDK Preview-to-GA Migration Guide |
Supported models
The Vertex AI Generative AI Evaluation Service supports Google's foundation models, third party models, and open models. You can provide pre-generated predictions directly, or automatically generate candidate model responses in the following ways:
Automatically generate responses for Google's foundation models (such as Gemini 1.5 Pro) and any model deployed in Vertex AI Model Registry.
Integrate with SDK text generation APIs from other third party and open models.
Wrap model endpoints from other providers using the Vertex AI SDK.
What's next
Try the evaluation quickstart.
Learn how to tune a foundation model.