自 2025 年 4 月 29 日起，Gemini 1.5 Pro 和 Gemini 1.5 Flash 模型將無法用於先前未使用這些模型的專案，包括新專案。詳情請參閱「模型版本和生命週期」。

本頁面由 Cloud Translation API 翻譯而成。

評估評審模型

針對以模型為基礎的指標，Gen AI 評估服務會使用已設定為評估模型的基礎模型 (例如 Gemini) 來評估模型。如要進一步瞭解評判模型，請參閱「進階評判模型自訂系列」，瞭解評判模型評估和設定時可使用的其他工具。

如需瞭解基本評估工作流程，請參閱 Gen AI Evaluation Service 快速入門。進階評審模型自訂系列包含以下頁面：

評估評審模型 (目前頁面)
提示評審模型自訂選項
設定評判模型

總覽

使用人類評審來評估大型語言模型 (LLM) 可能會耗費心力和時間。使用評判模型是評估 LLM 的更具擴充性方式。Gen AI 評估服務預設會使用已設定的 Gemini 2.0 Flash 模型做為評估模型，並提供可自訂的提示，評估模型的各種用途。

以下各節說明如何評估自訂評斷模型，以便用於理想用途。

準備資料集

如要評估以模型為基礎的指標，請準備評估資料集，並將人類評分設為真值。目的是比較模型指標和人為評分，瞭解模型指標是否能為您的用途提供理想品質。

針對 PointwiseMetric，請在資料集中準備 {metric_name}/human_rating 欄，做為模型指標產生的 {metric_name}/score 結果的真值。
針對 PairwiseMetric，請在資料集中準備 {metric_name}/human_pairwise_choice 欄，做為模型指標產生的 {metric_name}/pairwise_choice 結果的真值。

請使用下列資料集結構定義：

以模型為基準的指標	人工評分欄
`PointwiseMetric`	{metric_name}/human_rating
`PairwiseMetric`	{metric_name}/human_pairwise_choice

可用的指標

如果 PointwiseMetric 只傳回 2 個分數 (例如 0 和 1)，而 PairwiseMetric 只有 2 種偏好類型 (Model A 或 Model B)，則可使用以下指標：

指標	計算方式
2 類平衡準確度	\( (1/2)*(True Positive Rate + True Negative Rate) \)
2 個類別的平衡 F1 分數	\( ∑_{i=0,1} (cnt_i/sum) * f1(class_i) \)
混淆矩陣	使用 `confusion_matrix` 和 `confusion_matrix_labels` 欄位計算指標，例如真陽性率 (TPR)、真陰性率 (TNR)、偽陽性率 (FPR) 和偽陰性率 (FNR)。例如下列結果： confusion_matrix = [[20, 31, 15], [10, 11, 3], [ 3, 2, 2]] confusion_matrix_labels = ['BASELINE', 'CANDIDATE', 'TIE'] 會轉譯為下列混淆矩陣： BASELINE \| CANDIDATE \| TIE BASELINE. 20 31 15 CANDIDATE. 10 11 3 TIE. 3 2 2 \|

如果 PointwiseMetric 傳回的分數超過 2 個 (例如 1 到 5)，且 PairwiseMetric 有 2 個以上的偏好類型 (模型 A、模型 B 或平手)，則可使用下列指標：

指標	計算方式
多類別平衡準確度	\( (1/n) *∑_{i=1...n}(recall(class_i)) \)
多個平衡的 F1 分數	\( ∑_{i=1...n} (cnt_i/sum) * f1(class_i) \)

其中：

\( f1 = 2 * precision * recall / (precision + recall) \)
- \( precision = True Positives / (True Positives + False Positives) \)
- \( recall = True Positives / (True Positives + False Negatives) \)
\( n \) ：類別數量
\( cnt_i \) ：基本資料中的 \( class_i \) 數量
\( sum \)：基本資料中的元素數量

如要計算其他指標，您可以使用開放原始碼程式庫。

評估模型式指標

以下範例會使用自訂的流暢度定義更新模型式指標，然後評估指標的品質。

from vertexai.preview.evaluation import {
   AutoraterConfig,
   PairwiseMetric,
}
from vertexai.preview.evaluation.autorater_utils import evaluate_autorater


# Step 1: Prepare the evaluation dataset with the human rating data column.
human_rated_dataset = pd.DataFrame({
  "prompt": [PROMPT_1, PROMPT_2],
    "response": [RESPONSE_1, RESPONSE_2],
  "baseline_model_response": [BASELINE_MODEL_RESPONSE_1, BASELINE_MODEL_RESPONSE_2],
    "pairwise_fluency/human_pairwise_choice": ["model_A", "model_B"]
})

# Step 2: Get the results from model-based metric
pairwise_fluency = PairwiseMetric(
    metric="pairwise_fluency",
    metric_prompt_template="please evaluate pairwise fluency..."
)

eval_result = EvalTask(
    dataset=human_rated_dataset,
    metrics=[pairwise_fluency],
).evaluate()

# Step 3: Calibrate model-based metric result and human preferences.
# eval_result contains human evaluation result from human_rated_dataset.
evaluate_autorater_result = evaluate_autorater(
  evaluate_autorater_input=eval_result.metrics_table,
  eval_metrics=[pairwise_fluency]
)

評估評審模型 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

總覽

準備資料集

可用的指標

評估模型式指標

後續步驟

評估評審模型