准备评估数据集

对于 Gen AI Evaluation Service，评估数据集通常包含要评估的模型回答、用于生成回答的输入数据，以及可能的标准答案回答。

评估数据集架构

对于基于模型的典型指标应用场景，您的数据集需要提供以下信息：

输入类型	输入字段内容
提示	生成式 AI 模型或应用的用户输入。在某些情况下，它是可选项。
回答	需要评估的 LLM 推理回答。
baseline_model_response（为成对指标所需要）	基准 LLM 推理回答，用于在成对评估中比较 LLM 回答

如果您使用 Vertex AI SDK for Python 的生成式 AI 评估模块，Gen AI Evaluation Service 可以使用您指定的模型自动生成 response 和 baseline_model_response。

对于其他评估应用场景，您可能需要提供更多信息：

多轮对话或聊天

输入类型	输入字段内容
历史记录	当前回合之前用户和模型之间的对话历史记录。
提示	在当前回合中生成式 AI 模型或应用的用户输入。
回答	需要评估的 LLM 推理回答，该回答基于历史记录和当前回合提示。
baseline_model_response（为成对指标所需要）	基准 LLM 推理回答，用于在成对评估中比较 LLM 回答（基于历史记录和当前回合提示）。

基于计算的指标

您的数据集需要同时提供大语言模型的回答和用于比较的参考答案。

输入类型	输入字段内容
回答	需要评估的 LLM 推理回答。
引用	与 LLM 回答进行比较的标准答案。

根据您的应用场景，您还可以将输入用户提示细分为精细的部分（例如 instruction 和 context），并通过提供提示模板来组合它们以进行推断。如果需要，您还可以提供参考或标准答案信息：

输入类型	输入字段内容
指令	输入用户提示的一部分。它是指发送到 LLM 的推理说明。例如，“请总结以下文本”就是一条指令。
上下文	在当前回合中生成式 AI 模型或应用的用户输入。
引用	与 LLM 回答进行比较的标准答案。

评估数据集的必需输入应与您的指标一致。如需详细了解自定义指标，请参阅定义评估指标和运行评估。如需详细了解如何在基于模型的指标中添加参考数据，请参阅根据输入数据调整指标提示模板。

导入评估数据集

您可以使用以下格式导入数据集：

存储在 Cloud Storage 中的 JSONL 或 CSV 文件
BigQuery 表
Pandas DataFrame

评估数据集示例

本部分展示了使用 Pandas Dataframe 格式的数据集示例。请注意，此处仅以多条数据记录为例，评估数据集通常包含 100 个或更多数据点。如需了解准备数据集的最佳实践，请参阅最佳做法部分。

基于模型的逐点指标

以下摘要案例演示了基于模型的逐点指标的示例数据集：

prompts = [
    # Example 1
    (
        "Summarize the text in one sentence: As part of a comprehensive"
        " initiative to tackle urban congestion and foster sustainable urban"
        " living, a major city has revealed ambitious plans for an extensive"
        " overhaul of its public transportation system. The project aims not"
        " only to improve the efficiency and reliability of public transit but"
        " also to reduce the city's carbon footprint and promote eco-friendly"
        " commuting options. City officials anticipate that this strategic"
        " investment will enhance accessibility for residents and visitors"
        " alike, ushering in a new era of efficient, environmentally conscious"
        " urban transportation."
    ),
    # Example 2
    (
        "Summarize the text such that a five-year-old can understand: A team of"
        " archaeologists has unearthed ancient artifacts shedding light on a"
        " previously unknown civilization. The findings challenge existing"
        " historical narratives and provide valuable insights into human"
        " history."
    ),
]

responses = [
    # Example 1
    (
        "A major city is revamping its public transportation system to fight"
        " congestion, reduce emissions, and make getting around greener and"
        " easier."
    ),
    # Example 2
    (
        "Some people who dig for old things found some very special tools and"
        " objects that tell us about people who lived a long, long time ago!"
        " What they found is like a new puzzle piece that helps us understand"
        " how people used to live."
    ),
]

eval_dataset = pd.DataFrame({
    "prompt": prompts,
    "response": responses,
})

基于模型的成对指标

以下示例展示了一个开放式问答场景，以演示基于模型的成对指标的示例数据集。

prompts = [
    # Example 1
    (
        "Based on the context provided, what is the hardest material? Context:"
        " Some might think that steel is the hardest material, or even"
        " titanium. However, diamond is actually the hardest material."
    ),
    # Example 2
    (
        "Based on the context provided, who directed The Godfather? Context:"
        " Mario Puzo and Francis Ford Coppola co-wrote the screenplay for The"
        " Godfather, and the latter directed it as well."
    ),
]

responses = [
    # Example 1
    "Diamond is the hardest material. It is harder than steel or titanium.",
    # Example 2
    "Francis Ford Coppola directed The Godfather.",
]

baseline_model_responses = [
    # Example 1
    "Steel is the hardest material.",
    # Example 2
    "John Smith.",
]

eval_dataset = pd.DataFrame(
  {
    "prompt":  prompts,
    "response":  responses,
    "baseline_model_response": baseline_model_responses,
  }
)

基于计算的指标

对于基于计算的指标，通常需要 reference。

eval_dataset = pd.DataFrame({
  "response": ["The Roman Senate was filled with exuberance due to Pompey's defeat in Asia."],
  "reference": ["The Roman Senate was filled with exuberance due to successes against Catiline."],
})

工具使用（函数调用）指标

以下示例展示了基于计算的工具使用指标的输入数据：

json_responses = ["""{
    "content": "",
    "tool_calls":[{
      "name":"get_movie_info",
      "arguments": {"movie":"Mission Impossible", "time": "today 7:30PM"}
    }]
  }"""]

json_references = ["""{
    "content": "",
    "tool_calls":[{
      "name":"book_tickets",
      "arguments":{"movie":"Mission Impossible", "time": "today 7:30PM"}
      }]
  }"""]

eval_dataset = pd.DataFrame({
    "response": json_responses,
    "reference": json_references,
})

最佳做法

定义评估数据集时，请遵循以下最佳做法：

提供代表输入类型的样本，您的模型在生产环境中会处理这些样本。
您的数据集必须至少包含一个评估样本。我们建议使用约 100 个样本，以确保获得高质量的汇总指标和具有统计显著性的结果。此规模有助于在汇总评估结果中建立更高的置信度，从而最大限度地减少离群值的影响，并确保性能指标反映了模型在不同场景中的真实能力。如果提供的示例超过 400 个，则聚合指标质量的提升往往会降低。

后续步骤

运行评估。
试用评估示例笔记本。

准备评估数据集 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

评估数据集架构

多轮对话或聊天

基于计算的指标

导入评估数据集

评估数据集示例

基于模型的逐点指标

基于模型的成对指标

基于计算的指标

工具使用（函数调用）指标

最佳做法

后续步骤

准备评估数据集