Resource: EvaluationRun
EvaluationRun is a resource that represents a single evaluation run, which includes a set of prompts, model responses, evaluation configuration and the resulting metrics.
name
string
Identifier. The resource name of the EvaluationRun. This is a unique identifier. Format: projects/{project}/locations/{location}/evaluationRuns/{evaluationRun}
displayName
string
Required. The display name of the Evaluation Run.
Optional. metadata about the evaluation run, can be used by the caller to store additional tracking information about the evaluation run.
labels
map (key: string, value: string)
Optional. Labels for the evaluation run.
Required. The data source for the evaluation run.
Optional. The candidate to inference config map for the evaluation run. The candidate can be up to 128 characters long and can consist of any UTF-8 characters.
Required. The configuration used for the evaluation.
Output only. The state of the evaluation run.
Output only. Only populated when the evaluation run's state is FAILED or CANCELLED.
Output only. The results of the evaluation run. Only populated when the evaluation run's state is SUCCEEDED.
Output only. time when the evaluation run was created.
Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z"
, "2014-10-02T15:01:23.045123456Z"
or "2014-10-02T15:01:23+05:30"
.
Output only. time when the evaluation run was completed.
Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z"
, "2014-10-02T15:01:23.045123456Z"
or "2014-10-02T15:01:23+05:30"
.
evaluationSetSnapshot
string
Output only. The specific evaluation set of the evaluation run. For runs with an evaluation set input, this will be that same set. For runs with BigQuery input, it's the sampled BigQuery dataset.
JSON representation |
---|
{ "name": string, "displayName": string, "metadata": value, "labels": { string: string, ... }, "dataSource": { object ( |
DataSource
The data source for the evaluation run.
source
Union type
source
can be only one of the following:evaluationSet
string
The EvaluationSet resource name. Format: projects/{project}/locations/{location}/evaluationSets/{evaluationSet}
Evaluation data in bigquery.
JSON representation |
---|
{
// source
"evaluationSet": string,
"bigqueryRequestSet": {
object ( |
BigQueryRequestSet
The request set for the evaluation run.
uri
string
Required. The URI of a BigQuery table. e.g. bq://projectId.bqDatasetId.bqTableId
promptColumn
string
Optional. The name of the column that contains the requests to evaluate. This will be in evaluationItem.EvalPrompt format.
rubricsColumn
string
Optional. The name of the column that contains the rubrics. This is in evaluation_rubric.RubricGroup format.
candidateResponseColumns
map (key: string, value: string)
Optional. Map of candidate name to candidate response column name. The column will be in evaluationItem.CandidateResponse format.
Optional. The sampling config for the bigquery resource.
JSON representation |
---|
{
"uri": string,
"promptColumn": string,
"rubricsColumn": string,
"candidateResponseColumns": {
string: string,
...
},
"samplingConfig": {
object ( |
SamplingConfig
The sampling config.
samplingCount
integer
Optional. The total number of logged data to import. If available data is less than the sampling count, all data will be imported. Default is 100.
Optional. The sampling method to use.
Optional. How long to wait before sampling data from the BigQuery table. If not specified, defaults to 0.
A duration in seconds with up to nine fractional digits, ending with 's
'. Example: "3.5s"
.
JSON representation |
---|
{
"samplingCount": integer,
"samplingMethod": enum ( |
SamplingMethod
The sampling method to use.
Enums | |
---|---|
SAMPLING_METHOD_UNSPECIFIED |
Unspecified sampling method. |
RANDOM |
Random sampling. |
InferenceConfig
An inference config used for model inference during the evaluation run.
model
string
Required. The fully qualified name of the publisher model or endpoint to use.
Publisher model format: projects/{project}/locations/{location}/publishers/*/models/*
Endpoint format: projects/{project}/locations/{location}/endpoints/{endpoint}
model_config
Union type
model_config
can be only one of the following:Optional. Generation config.
JSON representation |
---|
{
"model": string,
// model_config
"generationConfig": {
object ( |
GenerationConfig
Generation config.
stopSequences[]
string
Optional. Stop sequences.
responseMimeType
string
Optional. Output response mimetype of the generated candidate text. Supported mimetype: - text/plain
: (default) Text output. - application/json
: JSON response in the candidates. The model needs to be prompted to output the appropriate response type, otherwise the behavior is undefined. This is a preview feature.
Optional. The modalities of the response.
Optional. Config for thinking features. An error will be returned if this field is set for models that don't support thinking.
Optional. Config for model selection.
temperature
number
Optional. Controls the randomness of predictions.
topP
number
Optional. If specified, nucleus sampling will be used.
topK
number
Optional. If specified, top-k sampling will be used.
candidateCount
integer
Optional. Number of candidates to generate.
maxOutputTokens
integer
Optional. The maximum number of output tokens to generate per message.
responseLogprobs
boolean
Optional. If true, export the logprobs results in response.
logprobs
integer
Optional. Logit probabilities.
presencePenalty
number
Optional. Positive penalties.
frequencyPenalty
number
Optional. Frequency penalties.
seed
integer
Optional. Seed.
Optional. The Schema
object allows the definition of input and output data types. These types can be objects, but also primitives and arrays. Represents a select subset of an OpenAPI 3.0 schema object. If set, a compatible responseMimeType must also be set. Compatible mimetypes: application/json
: Schema for JSON response.
Optional. Output schema of the generated response. This is an alternative to responseSchema
that accepts JSON Schema.
If set, responseSchema
must be omitted, but responseMimeType
is required.
While the full JSON Schema may be sent, not all features are supported. Specifically, only the following properties are supported:
$id
$defs
$ref
$anchor
type
format
title
description
enum
(for strings and numbers)items
prefixItems
minItems
maxItems
minimum
maximum
anyOf
oneOf
(interpreted the same asanyOf
)properties
additionalProperties
required
The non-standard propertyOrdering
property may also be set.
Cyclic references are unrolled to a limited degree and, as such, may only be used within non-required properties. (Nullable properties are not sufficient.) If $ref
is set on a sub-schema, no other properties, except for than those starting as a $
, may be set.
Optional. Routing configuration.
audioTimestamp
boolean
Optional. If enabled, audio timestamp will be included in the request to the model.
Optional. If specified, the media resolution specified will be used.
Optional. The speech generation config.
enableAffectiveDialog
boolean
Optional. If enabled, the model will detect emotions and adapt its responses accordingly.
JSON representation |
---|
{ "stopSequences": [ string ], "responseMimeType": string, "responseModalities": [ enum ( |
RoutingConfig
The configuration for routing the request to a specific model.
routing_config
Union type
routing_config
can be only one of the following:Automated routing.
Manual routing.
JSON representation |
---|
{ // routing_config "autoMode": { object ( |
AutoRoutingMode
When automated routing is specified, the routing will be determined by the pretrained routing model and customer provided model routing preference.
The model routing preference.
JSON representation |
---|
{
"modelRoutingPreference": enum ( |
ModelRoutingPreference
The model routing preference.
Enums | |
---|---|
UNKNOWN |
Unspecified model routing preference. |
PRIORITIZE_QUALITY |
Prefer higher quality over low cost. |
BALANCED |
Balanced model routing preference. |
PRIORITIZE_COST |
Prefer lower cost over higher quality. |
ManualRoutingMode
When manual routing is set, the specified model will be used directly.
modelName
string
The model name to use. Only the public LLM models are accepted. See Supported models.
JSON representation |
---|
{ "modelName": string } |
Modality
The modalities of the response.
Enums | |
---|---|
MODALITY_UNSPECIFIED |
Unspecified modality. Will be processed as text. |
TEXT |
Text modality. |
IMAGE |
Image modality. |
AUDIO |
Audio modality. |
MediaResolution
Media resolution for the input media.
Enums | |
---|---|
MEDIA_RESOLUTION_UNSPECIFIED |
Media resolution has not been set. |
MEDIA_RESOLUTION_LOW |
Media resolution set to low (64 tokens). |
MEDIA_RESOLUTION_MEDIUM |
Media resolution set to medium (256 tokens). |
MEDIA_RESOLUTION_HIGH |
Media resolution set to high (zoomed reframing with 256 tokens). |
SpeechConfig
The speech generation config.
The configuration for the speaker to use.
languageCode
string
Optional. Language code (ISO 639. e.g. en-US) for the speech synthesization.
JSON representation |
---|
{
"voiceConfig": {
object ( |
VoiceConfig
The configuration for the voice to use.
voice_config
Union type
voice_config
can be only one of the following:The configuration for the prebuilt voice to use.
JSON representation |
---|
{
// voice_config
"prebuiltVoiceConfig": {
object ( |
PrebuiltVoiceConfig
The configuration for the prebuilt speaker to use.
voiceName
string
The name of the preset voice to use.
JSON representation |
---|
{ "voiceName": string } |
ThinkingConfig
Config for thinking features.
includeThoughts
boolean
Optional. Indicates whether to include thoughts in the response. If true, thoughts are returned only when available.
thinkingBudget
integer
Optional. Indicates the thinking budget in tokens.
JSON representation |
---|
{ "includeThoughts": boolean, "thinkingBudget": integer } |
ModelConfig
Config for model selection.
Required. feature selection preference.
JSON representation |
---|
{
"featureSelectionPreference": enum ( |
FeatureSelectionPreference
Options for feature selection preference.
Enums | |
---|---|
FEATURE_SELECTION_PREFERENCE_UNSPECIFIED |
Unspecified feature selection preference. |
PRIORITIZE_QUALITY |
Prefer higher quality over lower cost. |
BALANCED |
Balanced feature selection preference. |
PRIORITIZE_COST |
Prefer lower cost over higher quality. |
EvaluationConfig
The Evalution configuration used for the evaluation run.
Required. The metrics to be calculated in the evaluation run.
Optional. The rubric configs for the evaluation run. They are used to generate rubrics which can be used by rubric-based metrics. Multiple rubric configs can be specified for rubric generation but only one rubric config can be used for a rubric-based metric. If more than one rubric config is provided, the evaluation metric must specify a rubric group key. Note that if a generation spec is specified on both a rubric config and an evaluation metric, the rubrics generated for the metric will be used for evaluation.
Optional. The output config for the evaluation run.
Optional. The autorater config for the evaluation run.
The prompt template used for inference. The values for variables in the prompt template are defined in EvaluationItem.EvaluationPrompt.PromptTemplateData.values.
JSON representation |
---|
{ "metrics": [ { object ( |
EvaluationRunMetric
The metric used for evaluation runs.
metric
string
Required. The name of the metric.
metric_spec
Union type
metric_spec
can be only one of the following:Spec for rubric based metric.
Spec for a pre-defined metric.
Spec for an LLM based metric.
JSON representation |
---|
{ "metric": string, // metric_spec "rubricBasedMetricSpec": { object ( |
RubricBasedMetricSpec
Specification for a metric that is based on rubrics.
metricPromptTemplate
string
Optional. Template for the prompt used by the judge model to evaluate against rubrics.
rubrics_source
Union type
rubrics_source
can be only one of the following:Use rubrics provided directly in the spec.
rubricGroupKey
string
Use a pre-defined group of rubrics associated with the input content. This refers to a key in the rubricGroups
map of RubricEnhancedContents
.
Dynamically generate rubrics for evaluation using this specification.
Optional. Optional configuration for the judge LLM (Autorater). The definition of AutoraterConfig needs to be provided.
JSON representation |
---|
{ "metricPromptTemplate": string, // rubrics_source "inlineRubrics": { object ( |
RepeatedRubrics
RubricGenerationSpec
Specification for how rubrics should be generated.
promptTemplate
string
Optional. Template for the prompt used to generate rubrics. The details should be updated based on the most-recent recipe requirements.
Optional. The type of rubric content to be generated.
rubricTypeOntology[]
string
Optional. An optional, pre-defined list of allowed types for generated rubrics. If this field is provided, it implies include_rubric_type
should be true, and the generated rubric types should be chosen from this ontology.
Optional. Configuration for the model used in rubric generation. Configs including sampling count and base model can be specified here. Flipping is not supported for rubric generation.
JSON representation |
---|
{ "promptTemplate": string, "rubricContentType": enum ( |
AutoraterConfig
The autorater config used for the evaluation run.
autoraterModel
string
Optional. The fully qualified name of the publisher model or tuned autorater endpoint to use.
Publisher model format: projects/{project}/locations/{location}/publishers/*/models/*
Tuned model endpoint format: projects/{project}/locations/{location}/endpoints/{endpoint}
Optional. Configuration options for model generation and outputs.
sampleCount
integer
Optional. Number of samples for each instance in the dataset. If not specified, the default is 4. Minimum value is 1, maximum value is 32.
JSON representation |
---|
{
"autoraterModel": string,
"generationConfig": {
object ( |
RubricContentType
Specifies the type of rubric content to generate.
Enums | |
---|---|
RUBRIC_CONTENT_TYPE_UNSPECIFIED |
The content type to generate is not specified. |
PROPERTY |
Generate rubrics based on properties. |
NL_QUESTION_ANSWER |
Generate rubrics in an NL question answer format. |
PYTHON_CODE_ASSERTION |
Generate rubrics in a unit test format. |
PredefinedMetricSpec
Specification for a pre-defined metric.
metricSpecName
string
Required. The name of a pre-defined metric, such as "instruction_following_v1" or "text_quality_v1".
Optional. The parameters needed to run the pre-defined metric.
JSON representation |
---|
{ "metricSpecName": string, "parameters": { object } } |
LLMBasedMetricSpec
Specification for an LLM based metric.
rubrics_source
Union type
rubrics_source
can be only one of the following:rubricGroupKey
string
Use a pre-defined group of rubrics associated with the input. Refers to a key in the rubricGroups map of EvaluationInstance.
Dynamically generate rubrics using this specification.
Dynamically generate rubrics using a predefined spec.
metricPromptTemplate
string
Required. Template for the prompt sent to the judge model.
systemInstruction
string
Optional. System instructions for the judge model.
Optional. Optional configuration for the judge LLM (Autorater).
Optional. Optional additional configuration for the metric.
JSON representation |
---|
{ // rubrics_source "rubricGroupKey": string, "rubricGenerationSpec": { object ( |
EvaluationRubricConfig
Configuration for a rubric group to be generated/saved for evaluation.
rubricGroupKey
string
Required. The key used to save the generated rubrics. If a generation spec is provided, this key will be used for the name of the generated rubric group. Otherwise, this key will be used to look up the existing rubric group on the evaluation item. Note that if a rubric group key is specified on both a rubric config and an evaluation metric, the key from the metric will be used to select the rubrics for evaluation.
generation_config
Union type
generation_config
can be only one of the following:Dynamically generate rubrics using this specification.
Dynamically generate rubrics using a predefined spec.
JSON representation |
---|
{ "rubricGroupKey": string, // generation_config "rubricGenerationSpec": { object ( |
OutputConfig
The output config for the evaluation run.
BigQuery destination for evaluation output.
Cloud Storage destination for evaluation output.
JSON representation |
---|
{ "bigqueryDestination": { object ( |
BigQueryDestination
The BigQuery location for the output content.
outputUri
string
Required. BigQuery URI to a project or table, up to 2000 characters long.
When only the project is specified, the Dataset and Table is created. When the full table reference is specified, the Dataset must exist and table must not exist.
Accepted forms:
- BigQuery path. For example:
bq://projectId
orbq://projectId.bqDatasetId
orbq://projectId.bqDatasetId.bqTableId
.
JSON representation |
---|
{ "outputUri": string } |
PromptTemplate
Prompt template used for inference.
source
Union type
source
can be only one of the following:promptTemplate
string
Inline prompt template. Template variables should be in the format "{var_name}". Example: "Translate the following from {source_lang} to {target_lang}: {text}"
gcsUri
string
Prompt template stored in Cloud Storage. Format: "gs://my-bucket/file-name.txt".
JSON representation |
---|
{ // source "promptTemplate": string, "gcsUri": string // Union type } |
State
The state of the evaluation run.
Enums | |
---|---|
STATE_UNSPECIFIED |
Unspecified state. |
PENDING |
The evaluation run is pending. |
RUNNING |
The evaluation run is running. |
SUCCEEDED |
The evaluation run has succeeded. |
FAILED |
The evaluation run has failed. |
CANCELLED |
The evaluation run has been cancelled. |
INFERENCE |
The evaluation run is performing inference. |
GENERATING_RUBRICS |
The evaluation run is performing rubric generation. |
EvaluationResults
The results of the evaluation run.
Optional. The summary metrics for the evaluation run.
evaluationSet
string
The evaluation set where item level results are stored.
JSON representation |
---|
{
"summaryMetrics": {
object ( |
SummaryMetrics
The summary metrics for the evaluation run.
Optional. Map of metric name to metric value.
totalItems
integer
Optional. The total number of items that were evaluated.
failedItems
integer
Optional. The number of items that failed to be evaluated.
JSON representation |
---|
{ "metrics": { string: value, ... }, "totalItems": integer, "failedItems": integer } |
Methods |
|
---|---|
|
Cancels an Evaluation Run. |
|
Creates an Evaluation Run. |
|
Deletes an Evaluation Run. |
|
Gets an Evaluation Run. |
|
Lists Evaluation Runs. |