自 2025 年 4 月 29 日起，Gemini 1.5 Pro 和 Gemini 1.5 Flash 模型將無法用於先前未使用這些模型的專案，包括新專案。詳情請參閱「模型版本和生命週期」。

本頁面由 Cloud Translation API 翻譯而成。

Gen AI Evaluation Service API

您可以使用 Gen AI 評估服務，根據自己的標準，評估大型語言模型 (LLM) 的多項指標。您可以提供推論時間輸入內容、LLM 回覆和其他參數，而生成式 AI 評估服務會傳回評估工作專屬的指標。

指標包括模型指標 (例如 PointwiseMetric 和 PairwiseMetric) 和記憶體中計算的指標 (例如 rouge、bleu 和工具函式呼叫指標)。PointwiseMetric 和 PairwiseMetric 是一般模型指標，您可以根據自己的條件自訂。由於這項服務會直接將模型的預測結果做為輸入內容，因此評估服務可針對 Vertex AI 支援的所有模型執行推論和後續評估。

如要進一步瞭解如何評估模型，請參閱「生成式 AI 評估服務總覽」。

限制

以下是評估服務的限制：

評估服務在第一次呼叫時可能會延遲傳播。
大部分以模型為基礎的指標都會消耗 gemini-2.0-flash 配額，因為 Gen AI 評估服務會利用 gemini-2.0-flash 做為基礎判斷模型，用於計算這些以模型為基礎的指標。
某些以模型為基礎的指標 (例如 MetricX 和 COMET) 會使用不同的機器學習模型，因此不會消耗 gemini-2.0-flash 配額。

語法範例

傳送評估呼叫的語法。

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \
-d '{
  "pointwise_metric_input" : {
    "metric_spec" : {
      ...
    },
    "instance": {
      ...
    },
  }
}'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

參數清單

參數
`exact_match_input`	選用：`ExactMatchInput` 輸入值，用於評估預測值是否與參考值完全相符。
`bleu_input`	選用：`BleuInput` 輸入內容，藉由比較預測結果與參考內容來計算 BLEU 分數。
`rouge_input`	選用：`RougeInput` 輸入資料，比較預測結果與參考值，以便計算 `rouge` 分數。`rouge_type` 支援不同的 `rouge` 分數。
`fluency_input`	選用：`FluencyInput` 輸入內容，以便評估單一回覆的語言精通程度。
`coherence_input`	選用：`CoherenceInput` 輸入內容，以便評估單一回覆是否能提供連貫、易懂的回覆。
`safety_input`	選用：`SafetyInput` 用於評估單一回覆的安全性等級。
`groundedness_input`	選用：`GroundednessInput` 輸入內容，用於評估單一回應是否能提供或參照輸入文字中包含的資訊。
`fulfillment_input`	選用：`FulfillmentInput` 輸入內容，用於評估單一回覆是否能完全執行指示。
`summarization_quality_input`	選用：`SummarizationQualityInput` 輸入內容，以便評估單一回應總結文字的重點能力。
`pairwise_summarization_quality_input`	選用：`PairwiseSummarizationQualityInput` 輸入內容，比較兩個回覆的整體摘要品質。
`summarization_helpfulness_input`	選用：`SummarizationHelpfulnessInput` 輸入內容，用於評估單一回覆是否能提供摘要，其中包含用於取代原始文字的必要詳細資料。
`summarization_verbosity_input`	選用：`SummarizationVerbosityInput` 輸入內容，以評估單一回覆是否能提供簡潔的摘要。
`question_answering_quality_input`	選用：`QuestionAnsweringQualityInput` 輸入內容，以便在參考文字內容的情況下，評估單一回覆的整體答題能力。
`pairwise_question_answering_quality_input`	選用：`PairwiseQuestionAnsweringQualityInput` 輸入內容，以便比較兩個回覆的整體答題能力，並提供可供參考的文字內容。
`question_answering_relevance_input`	選用：`QuestionAnsweringRelevanceInput` 輸入內容可評估單一回覆在回答問題時，提供相關資訊的能力。
`question_answering_helpfulness_input`	選用：`QuestionAnsweringHelpfulnessInput` 輸入內容，以評估單一回覆在回答問題時提供重要詳細資料的能力。
`question_answering_correctness_input`	選用：`QuestionAnsweringCorrectnessInput` 輸入內容，以評估單一回覆是否能正確回答問題。
`pointwise_metric_input`	選用：`PointwiseMetricInput` 一般逐點評估的輸入內容。
`pairwise_metric_input`	選用：`PairwiseMetricInput` 一般逐對評估的輸入內容。
`tool_call_valid_input`	選用：`ToolCallValidInput` 輸入內容，用於評估單一回覆是否能預測有效的工具呼叫。
`tool_name_match_input`	選用：`ToolNameMatchInput` 輸入內容，用於評估單一回覆是否能預測工具呼叫，並使用正確的工具名稱。
`tool_parameter_key_match_input`	選用：`ToolParameterKeyMatchInput` 輸入內容，用於評估單一回覆是否能預測工具呼叫的正確參數名稱。
`tool_parameter_kv_match_input`	選用：`ToolParameterKvMatchInput` 輸入內容：評估單一回應是否能使用正確的參數名稱和值預測工具呼叫
`comet_input`	選用：`CometInput` 使用 COMET 評估的輸入內容。
`metricx_input`	選用：`MetricxInput` 使用 MetricX 評估的輸入內容。

`ExactMatchInput`

{
  "exact_match_input": {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

參數
`metric_spec`	選用：`ExactMatchSpec`。指標規格，定義指標的行為。
`instances`	選用：`ExactMatchInstance[]` 評估輸入內容，包括 LLM 回覆和參考資料。
`instances.prediction`	選用：`string` LLM 回覆。
`instances.reference`	選用：`string` 可供參考的 LLM 黃金回覆。

`ExactMatchResults`

{
  "exact_match_results": {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`exact_match_metric_values`	`ExactMatchMetricValue[]` 每個輸入例項的評估結果。
`exact_match_metric_values.score`	`float` 可以是下列其中一項： `0`：例項並非完全比對 `1`：完全比對

exact_match_metric_values

ExactMatchMetricValue[]

每個輸入例項的評估結果。

exact_match_metric_values.score

float

可以是下列其中一項：

0：例項並非完全比對
1：完全比對

`BleuInput`

{
  "bleu_input": {
    "metric_spec": {
      "use_effective_order": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

參數
`metric_spec`	選用：`BleuSpec` 指標規格，定義指標的行為。
`metric_spec.use_effective_order`	選用：`bool` 是否要考慮沒有任何相符項目的 n-gram 順序。
`instances`	選用：`BleuInstance[]` 評估輸入內容，包括 LLM 回覆和參考資料。
`instances.prediction`	選用：`string` LLM 回覆。
`instances.reference`	選用：`string` 可供參考的 LLM 黃金回覆。

`BleuResults`

{
  "bleu_results": {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`bleu_metric_values`	`BleuMetricValue[]` 每個輸入例項的評估結果。
`bleu_metric_values.score`	`float`：`[0, 1]`，分數越高，表示預測結果越接近參考值。

bleu_metric_values

BleuMetricValue[]

每個輸入例項的評估結果。

bleu_metric_values.score

float：[0, 1]，分數越高，表示預測結果越接近參考值。

`RougeInput`

{
  "rouge_input": {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

參數
`metric_spec`	選用：`RougeSpec` 指標規格，定義指標的行為。
`metric_spec.rouge_type`	選用：`string` 可接受的值： `rougen[1-9]`：根據預測結果和參考資料之間的 n 元語法重疊部分，計算 `rouge` 分數。 `rougeL`：根據預測結果和參考資料之間的最長共同子序列 (LCS) 計算 `rouge` 分數。 `rougeLsum`：首先將預測結果和參考資料分割成句子，然後計算每個元組的 LCS。最終的 `rougeLsum` 分數是這些個別 LCS 分數的平均值。
`metric_spec.use_stemmer`	選用：`bool` 是否應使用 Porter stemmer 去除字詞後置字元，以便改善比對結果。
`metric_spec.split_summaries`	選用：`bool` 是否要為 rougeLsum 在句子之間加入換行符號。
`instances`	選用：`RougeInstance[]` 評估輸入內容，包括 LLM 回覆和參考資料。
`instances.prediction`	選用：`string` LLM 回覆。
`instances.reference`	選用：`string` 可供參考的 LLM 黃金回覆。

`RougeResults`

{
  "rouge_results": {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`rouge_metric_values`	`RougeValue[]` 每個輸入例項的評估結果。
`rouge_metric_values.score`	`float`：`[0, 1]`，分數越高，表示預測結果越接近參考值。

rouge_metric_values

RougeValue[]

每個輸入例項的評估結果。

rouge_metric_values.score

float：[0, 1]，分數越高，表示預測結果越接近參考值。

`FluencyInput`

{
  "fluency_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

參數

參數
`metric_spec`	選用：`FluencySpec` 指標規格，定義指標的行為。
`instance`	選用：`FluencyInstance` 評估輸入內容，包含 LLM 回覆。
`instance.prediction`	選用：`string` LLM 回覆。

metric_spec

選用：FluencySpec

指標規格，定義指標的行為。

instance

選用：FluencyInstance

評估輸入內容，包含 LLM 回覆。

instance.prediction

選用：string

LLM 回覆。

`FluencyResult`

{
  "fluency_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `1`：不清楚 `2`：發音不清楚 `3`：普通 `4`：流利程度中等 `5`：流利
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

1：不清楚
2：發音不清楚
3：普通
4：流利程度中等
5：流利

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`CoherenceInput`

{
  "coherence_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

參數

參數
`metric_spec`	選用：`CoherenceSpec` 指標規格，定義指標的行為。
`instance`	選用：`CoherenceInstance` 評估輸入內容，包含 LLM 回覆。
`instance.prediction`	選用：`string` LLM 回覆。

metric_spec

選用：CoherenceSpec

指標規格，定義指標的行為。

instance

選用：CoherenceInstance

評估輸入內容，包含 LLM 回覆。

instance.prediction

選用：string

LLM 回覆。

`CoherenceResult`

{
  "coherence_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `1`：不一致 `2`：不太連貫 `3`：普通 `4`：有點連貫 `5`：Coherent
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

1：不一致
2：不太連貫
3：普通
4：有點連貫
5：Coherent

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`SafetyInput`

{
  "safety_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

參數

參數
`metric_spec`	選用：`SafetySpec` 指標規格，定義指標的行為。
`instance`	選用：`SafetyInstance` 評估輸入內容，包含 LLM 回覆。
`instance.prediction`	選用：`string` LLM 回覆。

metric_spec

選用：SafetySpec

指標規格，定義指標的行為。

instance

選用：SafetyInstance

評估輸入內容，包含 LLM 回覆。

instance.prediction

選用：string

LLM 回覆。

`SafetyResult`

{
  "safety_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `0`：不安全 `1`：安全
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

0：不安全
1：安全

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`GroundednessInput`

{
  "groundedness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

參數	說明
`metric_spec`	選用：GroundednessSpec 指標規格，定義指標的行為。
`instance`	選用：GroundednessInstance 評估輸入內容，包含推論輸入內容和相應回應。
`instance.prediction`	選用：`string` LLM 回覆。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`GroundednessResult`

{
  "groundedness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `0`：未接地 `1`：已接地
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

0：未接地
1：已接地

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`FulfillmentInput`

{
  "fulfillment_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}

參數
`metric_spec`	選用：`FulfillmentSpec` 指標規格，定義指標的行為。
`instance`	選用：`FulfillmentInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.prediction`	選用：`string` LLM 回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。

`FulfillmentResult`

{
  "fulfillment_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `1`：未履行 `2`：執行不佳 `3`：部分執行 `4`：良好的執行情況 `5`：完成執行要求
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

1：未履行
2：執行不佳
3：部分執行
4：良好的執行情況
5：完成執行要求

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`SummarizationQualityInput`

{
  "summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	選用：`SummarizationQualitySpec` 指標規格，定義指標的行為。
`instance`	選用：`SummarizationQualityInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.prediction`	選用：`string` LLM 回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`SummarizationQualityResult`

{
  "summarization_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `1`：很差 `2`：不佳 `3`：OK `4`：良好 `5`：非常好
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

1：很差
2：不佳
3：OK
4：良好
5：非常好

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`PairwiseSummarizationQualityInput`

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	選用：`PairwiseSummarizationQualitySpec` 指標規格，定義指標的行為。
`instance`	選用：`PairwiseSummarizationQualityInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.baseline_prediction`	選用：`string` 基準模型 LLM 回覆。
`instance.prediction`	選用：`string` 候選模型 LLM 回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`PairwiseSummarizationQualityResult`

{
  "pairwise_summarization_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`pairwise_choice`	`PairwiseChoice`：列舉，可能的值如下： `BASELINE`：基準預測結果較佳 `CANDIDATE`：候選預測結果較佳 `TIE`：基準和候選預測結果相同。
`explanation`	`string`：pairwise_choice 指派的理由。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

pairwise_choice

PairwiseChoice：列舉，可能的值如下：

BASELINE：基準預測結果較佳
CANDIDATE：候選預測結果較佳
TIE：基準和候選預測結果相同。

explanation

string：pairwise_choice 指派的理由。

confidence

float：[0, 1] 結果的可信度分數。

`SummarizationHelpfulnessInput`

{
  "summarization_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	選用：`SummarizationHelpfulnessSpec` 指標規格，定義指標的行為。
`instance`	選用：`SummarizationHelpfulnessInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.prediction`	選用：`string` LLM 回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`SummarizationHelpfulnessResult`

{
  "summarization_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `1`：沒有幫助 `2`：不太實用 `3`：普通 `4`：還算有幫助 `5`：有幫助
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

1：沒有幫助
2：不太實用
3：普通
4：還算有幫助
5：有幫助

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`SummarizationVerbosityInput`

{
  "summarization_verbosity_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	選用：`SummarizationVerbositySpec` 指標規格，定義指標的行為。
`instance`	選用：`SummarizationVerbosityInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.prediction`	選用：`string` LLM 回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`SummarizationVerbosityResult`

{
  "summarization_verbosity_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`。下列其中一個值： `-2`：簡要 `-1`：略為簡短 `0`：最佳化 `1`：較詳細 `2`：詳細
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float。下列其中一個值：

-2：簡要
-1：略為簡短
0：最佳化
1：較詳細
2：詳細

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`QuestionAnsweringQualityInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	選用：`QuestionAnsweringQualitySpec` 指標規格，定義指標的行為。
`instance`	選用：`QuestionAnsweringQualityInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.prediction`	選用：`string` LLM 回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`QuestionAnsweringQualityResult`

{
  "question_answering_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `1`：很差 `2`：不佳 `3`：OK `4`：良好 `5`：非常好
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

1：很差
2：不佳
3：OK
4：良好
5：非常好

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`PairwiseQuestionAnsweringQualityInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	選用：`QuestionAnsweringQualitySpec` 指標規格，定義指標的行為。
`instance`	選用：`QuestionAnsweringQualityInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.baseline_prediction`	選用：`string` 基準模型 LLM 回覆。
`instance.prediction`	選用：`string` 候選模型 LLM 回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`PairwiseQuestionAnsweringQualityResult`

{
  "pairwise_question_answering_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`pairwise_choice`	`PairwiseChoice`：列舉，可能的值如下： `BASELINE`：基準預測結果較佳 `CANDIDATE`：候選預測結果較佳 `TIE`：基準和候選預測結果相同。
`explanation`	`string`：`pairwise_choice` 指派的理由。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

pairwise_choice

PairwiseChoice：列舉，可能的值如下：

BASELINE：基準預測結果較佳
CANDIDATE：候選預測結果較佳
TIE：基準和候選預測結果相同。

explanation

string：pairwise_choice 指派的理由。

confidence

float：[0, 1] 結果的可信度分數。

`QuestionAnsweringRelevanceInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	選用：`QuestionAnsweringRelevanceSpec` 指標規格，定義指標的行為。
`instance`	選用：`QuestionAnsweringRelevanceInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.prediction`	選用：`string` LLM 回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`QuestionAnsweringRelevancyResult`

{
  "question_answering_relevancy_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `1`：不相關 `2`：不太相關 `3`：普通 `4`：還算相關 `5`：相關
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

1：不相關
2：不太相關
3：普通
4：還算相關
5：相關

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`QuestionAnsweringHelpfulnessInput`

{
  "question_answering_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	選用：`QuestionAnsweringHelpfulnessSpec` 指標規格，定義指標的行為。
`instance`	選用：`QuestionAnsweringHelpfulnessInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.prediction`	選用：`string` LLM 回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`QuestionAnsweringHelpfulnessResult`

{
  "question_answering_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `1`：沒有幫助 `2`：不太實用 `3`：普通 `4`：還算有幫助 `5`：有幫助
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

1：沒有幫助
2：不太實用
3：普通
4：還算有幫助
5：有幫助

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`QuestionAnsweringCorrectnessInput`

{
  "question_answering_correctness_input": {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	選用：`QuestionAnsweringCorrectnessSpec` 指標規格，定義指標的行為。
`metric_spec.use_reference`	選用：`bool` 評估是否使用參考資料。
`instance`	選用：`QuestionAnsweringCorrectnessInstance` 評估輸入內容，包含推論輸入內容和相應回應。
`instance.prediction`	選用：`string` LLM 回覆。
`instance.reference`	選用：`string` 可供參考的 LLM 黃金回覆。
`instance.instruction`	選用：`string` 推論期間使用的指令。
`instance.context`	選用：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`QuestionAnsweringCorrectnessResult`

{
  "question_answering_correctness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列任一值： `0`：不正確 `1`：正確
`explanation`	`string`：評分依據。
`confidence`	`float`：`[0, 1]` 結果的可信度分數。

score

float：下列任一值：

0：不正確
1：正確

explanation

string：評分依據。

confidence

float：[0, 1] 結果的可信度分數。

`PointwiseMetricInput`

{
  "pointwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

參數
`metric_spec`	必要項目：`PointwiseMetricSpec` 指標規格，定義指標的行為。
`metric_spec.metric_prompt_template`	必要項目：`string` 定義指標的提示範本。此值會根據 instance.json_instance 中的鍵/值組合呈現
`instance`	必要項目：`PointwiseMetricInstance` 評估輸入內容，包含 json_instance。
`instance.json_instance`	選用：`string` 以 JSON 格式呈現的鍵/值組合。例如：{"key_1": "value_1", "key_2": "value_2"}。用於轉譯 metric_spec.metric_prompt_template。

`PointwiseMetricResult`

{
  "pointwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

輸出
`score`	`float`：逐點指標評估結果的分數。
`explanation`	`string`：評分依據。

`PairwiseMetricInput`

{
  "pairwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

參數
`metric_spec`	必要項目：`PairwiseMetricSpec` 指標規格，定義指標的行為。
`metric_spec.metric_prompt_template`	必要項目：`string` 定義指標的提示範本。此值會根據 instance.json_instance 中的鍵/值組合呈現
`instance`	必要項目：`PairwiseMetricInstance` 評估輸入內容，包含 json_instance。
`instance.json_instance`	選用：`string` 以 JSON 格式呈現的鍵/值組合。例如：{"key_1": "value_1", "key_2": "value_2"}。用於轉譯 metric_spec.metric_prompt_template。

`PairwiseMetricResult`

{
  "pairwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

輸出
`score`	`float`：逐對指標評估結果的分數。
`explanation`	`string`：評分依據。

`ToolCallValidInput`

{
  "tool_call_valid_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	選用：`ToolCallValidSpec` 指標規格，定義指標的行為。
`instance`	選用：`ToolCallValidInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	選用：`string` 候選模型 LLM 回應，這是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型的文字輸出內容。`tool_call` 值是工具呼叫清單的 JSON 序列化字串。範例如下： { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] }
`instance.reference`	選用：`string` 與預測結果相同格式的黃金模型輸出內容。

`ToolCallValidResults`

{
  "tool_call_valid_results": {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`tool_call_valid_metric_values`	重複 `ToolCallValidMetricValue`：每個例項輸入的評估結果。
`tool_call_valid_metric_values.score`	`float`：下列任一值： `0`：無效的工具呼叫 `1`：有效的工具呼叫

tool_call_valid_metric_values

重複 ToolCallValidMetricValue：每個例項輸入的評估結果。

tool_call_valid_metric_values.score

float：下列任一值：

0：無效的工具呼叫
1：有效的工具呼叫

`ToolNameMatchInput`

{
  "tool_name_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	選用：`ToolNameMatchSpec` 指標規格，定義指標的行為。
`instance`	選用：`ToolNameMatchInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	選用：`string` 候選模型 LLM 回應，這是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型的文字輸出內容。`tool_call` 值是工具呼叫清單的 JSON 序列化字串。
`instance.reference`	選用：`string` 與預測結果相同格式的黃金模型輸出內容。

`ToolNameMatchResults`

{
  "tool_name_match_results": {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`tool_name_match_metric_values`	重複 `ToolNameMatchMetricValue`：每個例項輸入的評估結果。
`tool_name_match_metric_values.score`	`float`：下列任一值： `0`：工具呼叫名稱與參照不符。 `1`：工具呼叫名稱與參考資料相符。

tool_name_match_metric_values

重複 ToolNameMatchMetricValue：每個例項輸入的評估結果。

tool_name_match_metric_values.score

float：下列任一值：

0：工具呼叫名稱與參照不符。
1：工具呼叫名稱與參考資料相符。

`ToolParameterKeyMatchInput`

{
  "tool_parameter_key_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	選用：`ToolParameterKeyMatchSpec` 指標規格，定義指標的行為。
`instance`	選用：`ToolParameterKeyMatchInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	選用：`string` 候選模型 LLM 回應，這是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型的文字輸出內容。`tool_call` 值是工具呼叫清單的 JSON 序列化字串。
`instance.reference`	選用：`string` 與預測結果相同格式的黃金模型輸出內容。

`ToolParameterKeyMatchResults`

{
  "tool_parameter_key_match_results": {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出
`tool_parameter_key_match_metric_values`	重複 `ToolParameterKeyMatchMetricValue`：每個例項輸入的評估結果。
`tool_parameter_key_match_metric_values.score`	`float`：`[0, 1]`，分數越高，代表越多參數與參照參數的名稱相符。

`ToolParameterKVMatchInput`

{
  "tool_parameter_kv_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	選用：`ToolParameterKVMatchSpec` 指標規格，定義指標的行為。
`instance`	選用：`ToolParameterKVMatchInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	選用：`string` 候選模型 LLM 回應，這是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型的文字輸出內容。`tool_call` 值是工具呼叫清單的 JSON 序列化字串。
`instance.reference`	選用：`string` 與預測結果相同格式的黃金模型輸出內容。

`ToolParameterKVMatchResults`

{
  "tool_parameter_kv_match_results": {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出
`tool_parameter_kv_match_metric_values`	重複 `ToolParameterKVMatchMetricValue`：每個例項輸入的評估結果。
`tool_parameter_kv_match_metric_values.score`	`float`：`[0, 1]`，分數越高，代表越多參數與參照參數的名稱和值相符。

`CometInput`

{
  "comet_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

參數
`metric_spec`	選用：`CometSpec` 指標規格，定義指標的行為。
`metric_spec.version`	選用：`string` `COMET_22_SRC_REF`：COMET 22 用於翻譯、來源和參考資料。會使用所有三個輸入內容評估翻譯 (預測)。
`metric_spec.source_language`	選用：`string` 以 BCP-47 格式輸入原始語言。例如「es」。
`metric_spec.target_language`	選用：`string` 以 BCP-47 格式輸入目標語言。例如「es」
`instance`	選用：`CometInstance` 評估輸入內容，包含 LLM 回覆和參考資料。用於評估的確切欄位取決於 COMET 版本。
`instance.prediction`	選用：`string` 候選模型 LLM 回覆。這是正在評估的 LLM 輸出內容。
`instance.source`	選用：`string` 原文內容。這是系統預測的內容所使用的原始語言。
`instance.reference`	選用：`string` 用於比較預測結果的基準真相。與預測內容使用相同語言。

`CometResult`

{
  "comet_result" : {
    "score": float
  }
}

輸出
`score`	`float`：`[0, 1]`，其中 1 代表完美翻譯。

`MetricxInput`

{
  "metricx_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

參數
`metric_spec`	選用：`MetricxSpec` 指標規格，定義指標的行為。
`metric_spec.version`	選填： `string` 可以是下列其中一項： `METRICX_24_REF`：MetricX 24 用於翻譯和參考。並透過比較參考文字輸入內容來評估預測結果 (翻譯內容)。 `METRICX_24_SRC`：MetricX 24 的翻譯和來源。這項功能會透過品質評估 (QE) 評估翻譯結果 (預測結果)，但不會輸入參考文字。 `METRICX_24_SRC_REF`：MetricX 24 用於翻譯、來源和參考。並使用所有三個輸入內容評估翻譯 (預測) 結果。
`metric_spec.source_language`	選用：`string` 以 BCP-47 格式輸入原始語言。例如「es」。
`metric_spec.target_language`	選用：`string` 以 BCP-47 格式輸入目標語言。例如「es」。
`instance`	選用：`MetricxInstance` 評估輸入內容，包含 LLM 回覆和參考資料。用於評估的確切欄位取決於 MetricX 版本。
`instance.prediction`	選用：`string` 候選模型 LLM 回覆。這是正在評估的 LLM 輸出內容。
`instance.source`	選用：`string` 預測結果的來源文字，也就是系統翻譯的來源語言。
`instance.reference`	選用：`string` 用於比較預測結果的基準真相。與預測內容使用相同語言。

`MetricxResult`

{
  "metricx_result" : {
    "score": float
  }
}

輸出
`score`	`float`：`[0, 25]`，其中 0 代表完美翻譯。

範例

評估輸出內容

以下範例說明如何呼叫 Gen AI Evaluation API，以各種評估指標評估 LLM 的輸出內容，包括：

summarization_quality
groundedness
fulfillment
summarization_helpfulness
summarization_verbosity

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count:      2
# summarization_quality/mean:     3.5
# summarization_quality/std:      2.1213203435596424
# ...

Go

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
func evaluateModelResponse(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	// evaluate the pre-generated model response against the reference (ground truth)
	responseToEvaluate := `
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
	reference := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		// Check the API reference for a full list of supported metric inputs:
		// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
			GroundednessInput: &aiplatformpb.GroundednessInput{
				MetricSpec: &aiplatformpb.GroundednessSpec{},
				Instance: &aiplatformpb.GroundednessInstance{
					Context:    &reference,
					Prediction: &responseToEvaluate,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetGroundednessResult()
	fmt.Fprintf(w, "score: %.2f\n", results.GetScore())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// score: 1.00
	// confidence: 1.00
	// explanation:
	// STEP 1: All aspects of the response are found in the context.
	// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
	// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.

	return nil
}

評估輸出內容：成對摘要品質

以下範例示範如何呼叫 Gen AI 評估服務 API，以成對摘要品質比較方式評估 LLM 的輸出內容。

REST

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：您的專案 ID。
LOCATION：處理要求的區域。
PREDICTION：大型語言模型回應。
BASELINE_PREDICTION：基準模型 LLM 回覆。
INSTRUCTION：在推論期間使用的指令。
CONTEXT：推論時間文字，包含可用於 LLM 回覆的所有相關資訊。

HTTP 方法和網址：

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \

JSON 要求主體：

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "PREDICTION",
      "baseline_prediction": "BASELINE_PREDICTION",
      "instruction": "INSTRUCTION",
      "context": "CONTEXT",
    }
  }
}

如要傳送要求，請選擇以下其中一個選項：

curl

注意：以下指令假設您已使用使用者帳戶登入 gcloud CLI，方法是執行 gcloud init 或 gcloud auth login，或是使用 Cloud Shell，後者會自動登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"

PowerShell

注意：下列指令假設您已透過執行 gcloud init 或 gcloud auth login 登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content

Python

Python 適用的 Vertex AI SDK

如要瞭解如何安裝或更新 Python 適用的 Vertex AI SDK，請參閱「安裝 Python 適用的 Vertex AI SDK」。詳情請參閱 Vertex AI SDK for Python API 參考說明文件。

import pandas as pd

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    MetricPromptTemplateExamples,
)

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

prompt = """
Summarize the text such that a five-year-old can understand.

# Text

As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""

eval_dataset = pd.DataFrame({"prompt": [prompt]})

# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")

# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
    "gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)

prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_summarization_quality"
)

summarization_quality_metric = PairwiseMetric(
    metric="pairwise_summarization_quality",
    metric_prompt_template=prompt_template,
    baseline_model=baseline_model,
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[summarization_quality_metric],
    experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)

baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
    "pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
    "pairwise_summarization_quality/explanation"
].iloc[0]

print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...

Go

在試用這個範例之前，請先按照 Vertex AI 快速入門：使用用戶端程式庫中的操作說明設定 Go。詳情請參閱 Vertex AI Go API 參考說明文件。

如要向 Vertex AI 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
func pairwiseEvaluation(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	context := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	instruction := "Summarize the text such that a five-year-old can understand."
	baselineResponse := `
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
	candidateResponse := `
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`

	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
			PairwiseSummarizationQualityInput: &aiplatformpb.PairwiseSummarizationQualityInput{
				MetricSpec: &aiplatformpb.PairwiseSummarizationQualitySpec{},
				Instance: &aiplatformpb.PairwiseSummarizationQualityInstance{
					Context:            &context,
					Instruction:        &instruction,
					Prediction:         &candidateResponse,
					BaselinePrediction: &baselineResponse,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetPairwiseSummarizationQualityResult()
	fmt.Fprintf(w, "choice: %s\n", results.GetPairwiseChoice())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// choice: BASELINE
	// confidence: 0.50
	// explanation:
	// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...

	return nil
}

取得 ROUGE 分數

以下範例會呼叫 Gen AI 評估服務 API，取得多個輸入內容產生的預測結果 ROUGE 分數。ROUGE 輸入會使用 metric_spec，用來決定指標的行為。

REST

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：您的專案 ID。
LOCATION：處理要求的區域。
PREDICTION：大型語言模型回應。
REFERENCE：用於參考的 LLM 黃金回應。
ROUGE_TYPE：用於判斷 Rouge 分數的計算方式。如要查看可接受的值，請參閱 metric_spec.rouge_type。
USE_STEMMER：決定是否要使用 Porter 字根詞去除字尾，以便改善比對結果。如需接受的值，請參閱 metric_spec.use_stemmer。
SPLIT_SUMMARIES：決定是否要在 rougeLsum 句之間新增行。如需接受的值，請參閱 metric_spec.split_summaries。

HTTP 方法和網址：

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \

JSON 要求主體：

{
  "rouge_input": {
    "instances": {
      "prediction": "PREDICTION",
      "reference": "REFERENCE.",
    },
    "metric_spec": {
      "rouge_type": "ROUGE_TYPE",
      "use_stemmer": USE_STEMMER,
      "split_summaries": SPLIT_SUMMARIES,
    }
  }
}

如要傳送要求，請選擇以下其中一個選項：

curl

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"

PowerShell

注意：下列指令假設您已透過執行 gcloud init 或 gcloud auth login 登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content

Python

Python 適用的 Vertex AI SDK

如要瞭解如何安裝或更新 Python 適用的 Vertex AI SDK，請參閱「安裝 Python 適用的 Vertex AI SDK」。詳情請參閱 Vertex AI SDK for Python API 參考說明文件。

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""

# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
    {
        "response": [
            """The Great Barrier Reef, the world's largest coral reef system located
        in Australia, is a vast and diverse ecosystem. However, it faces serious
        threats from climate change, ocean acidification, and coral bleaching,
        endangering its rich marine life.""",
            """The Great Barrier Reef, a vast coral reef system off the coast of
        Queensland, Australia, is the world's largest. It's a complex ecosystem
        supporting diverse marine life, including endangered species. However,
        climate change, ocean acidification, and coral bleaching are serious
        threats to its survival.""",
            """The Great Barrier Reef, the world's largest coral reef system off the
        coast of Australia, is a vast and diverse ecosystem with thousands of
        reefs and islands. It is home to a multitude of marine life, including
        endangered species, but faces serious threats from climate change, ocean
        acidification, and coral bleaching.""",
        ],
        "reference": [reference_summarization] * 3,
    }
)
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "rouge_1",
        "rouge_2",
        "rouge_l",
        "rouge_l_sum",
    ],
)
result = eval_task.evaluate()

print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count:      3
# rouge_1/mean:   0.7191161666666667
# rouge_1/std:    0.06765143922270488
# rouge_2/mean:   0.5441118566666666
# ...
# Metrics Table:
#
#                                        response                         reference  ...  rouge_l/score  rouge_l_sum/score
# 0  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.577320           0.639175
# 1  The Great Barrier Reef, a vast coral...  \n    The Great Barrier Reef, the ...  ...       0.552381           0.666667
# 2  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.774775           0.774775

Go

在試用這個範例之前，請先按照 Vertex AI 快速入門：使用用戶端程式庫中的操作說明設定 Go。詳情請參閱 Vertex AI Go API 參考說明文件。

如要向 Vertex AI 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import (
	"context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
func getROUGEScore(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	modelResponse := `
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
	reference := `
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_RougeInput{
			RougeInput: &aiplatformpb.RougeInput{
				// Check the API reference for the list of supported ROUGE metric types:
				// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
				MetricSpec: &aiplatformpb.RougeSpec{
					RougeType: "rouge1",
				},
				Instances: []*aiplatformpb.RougeInstance{
					{
						Prediction: &modelResponse,
						Reference:  &reference,
					},
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	fmt.Fprintln(w, "evaluation results:")
	fmt.Fprintln(w, resp.GetRougeResults().GetRougeMetricValues())
	// Example response:
	// [score:0.6597938]

	return nil
}

後續步驟

如需詳細說明文件，請參閱「執行評估」。

Gen AI Evaluation Service API 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

語法範例

curl

Python

參數清單

ExactMatchInput

ExactMatchResults

BleuInput

BleuResults

RougeInput

RougeResults

FluencyInput

FluencyResult

CoherenceInput

CoherenceResult

SafetyInput

SafetyResult

GroundednessInput

GroundednessResult

FulfillmentInput

FulfillmentResult

SummarizationQualityInput

SummarizationQualityResult

PairwiseSummarizationQualityInput

PairwiseSummarizationQualityResult

SummarizationHelpfulnessInput

SummarizationHelpfulnessResult

SummarizationVerbosityInput

SummarizationVerbosityResult

QuestionAnsweringQualityInput

QuestionAnsweringQualityResult

PairwiseQuestionAnsweringQualityInput

PairwiseQuestionAnsweringQualityResult

QuestionAnsweringRelevanceInput

QuestionAnsweringRelevancyResult

QuestionAnsweringHelpfulnessInput

QuestionAnsweringHelpfulnessResult

QuestionAnsweringCorrectnessInput

QuestionAnsweringCorrectnessResult

PointwiseMetricInput

PointwiseMetricResult

PairwiseMetricInput

PairwiseMetricResult

ToolCallValidInput

ToolCallValidResults

ToolNameMatchInput

ToolNameMatchResults

ToolParameterKeyMatchInput

ToolParameterKeyMatchResults

ToolParameterKVMatchInput

ToolParameterKVMatchResults

CometInput

CometResult

MetricxInput

MetricxResult

範例

評估輸出內容

Python

Go

評估輸出內容：成對摘要品質

REST

curl

PowerShell

Python

Python 適用的 Vertex AI SDK

Go

Go

取得 ROUGE 分數

REST

curl

PowerShell

Python

Python 適用的 Vertex AI SDK

Go

Go

後續步驟

Gen AI Evaluation Service API

`ExactMatchInput`

`ExactMatchResults`

`BleuInput`

`BleuResults`

`RougeInput`

`RougeResults`

`FluencyInput`

`FluencyResult`

`CoherenceInput`

`CoherenceResult`

`SafetyInput`

`SafetyResult`

`GroundednessInput`

`GroundednessResult`

`FulfillmentInput`

`FulfillmentResult`

`SummarizationQualityInput`

`SummarizationQualityResult`

`PairwiseSummarizationQualityInput`

`PairwiseSummarizationQualityResult`

`SummarizationHelpfulnessInput`

`SummarizationHelpfulnessResult`

`SummarizationVerbosityInput`

`SummarizationVerbosityResult`

`QuestionAnsweringQualityInput`

`QuestionAnsweringQualityResult`

`PairwiseQuestionAnsweringQualityInput`

`PairwiseQuestionAnsweringQualityResult`

`QuestionAnsweringRelevanceInput`

`QuestionAnsweringRelevancyResult`

`QuestionAnsweringHelpfulnessInput`

`QuestionAnsweringHelpfulnessResult`

`QuestionAnsweringCorrectnessInput`

`QuestionAnsweringCorrectnessResult`

`PointwiseMetricInput`

`PointwiseMetricResult`

`PairwiseMetricInput`

`PairwiseMetricResult`

`ToolCallValidInput`

`ToolCallValidResults`

`ToolNameMatchInput`

`ToolNameMatchResults`

`ToolParameterKeyMatchInput`

`ToolParameterKeyMatchResults`

`ToolParameterKVMatchInput`

`ToolParameterKVMatchResults`

`CometInput`

`CometResult`

`MetricxInput`

`MetricxResult`