快速入门:Gen AI Evaluation Service 工作流

本页介绍了如何使用 Python 版 Vertex AI SDK 通过 Gen AI Evaluation Service 执行基于模型的评估。

准备工作

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

    In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

    Make sure that billing is enabled for your Google Cloud project.

    In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

    Make sure that billing is enabled for your Google Cloud project.

  2. 安装包含 Gen AI Evaluation Service 依赖项的 Vertex AI SDK for Python:

    !pip install google-cloud-aiplatform[evaluation]
    
  3. 设置凭据。如果您是在 Colaboratory 中运行本快速入门,请运行以下命令:

    from google.colab import auth
    auth.authenticate_user()
    

    对于其他环境,请参阅向 Vertex AI 进行身份验证

导入库

导入您的库并设置项目和位置。

import pandas as pd

import vertexai
from vertexai.evaluation import EvalTask, PointwiseMetric, PointwiseMetricPromptTemplate
from google.cloud import aiplatform

PROJECT_ID = "PROJECT_ID"
LOCATION = "LOCATION"
EXPERIMENT_NAME = "EXPERIMENT_NAME"

vertexai.init(
    project=PROJECT_ID,
    location=LOCATION,
)

请注意,EXPERIMENT_NAME 只能包含小写字母数字字符和连字符,最多不得超过 127 个字符。

根据您的条件设置评估指标

以下指标定义根据两个条件(FluencyEntertaining)评估大型语言模型生成的文本质量。该代码使用以下两个条件定义名为 custom_text_quality 的指标:

custom_text_quality = PointwiseMetric(
    metric="custom_text_quality",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
            "fluency": (
                "Sentences flow smoothly and are easy to read, avoiding awkward"
                " phrasing or run-on sentences. Ideas and sentences connect"
                " logically, using transitions effectively where needed."
            ),
            "entertaining": (
                "Short, amusing text that incorporates emojis, exclamations and"
                " questions to convey quick and spontaneous communication and"
                " diversion."
            ),
        },
        rating_rubric={
            "1": "The response performs well on both criteria.",
            "0": "The response is somewhat aligned with both criteria",
            "-1": "The response falls short on both criteria",
        },
    ),
)

准备数据集

添加以下代码以准备数据集:

responses = [
    # An example of good custom_text_quality
    "Life is a rollercoaster, full of ups and downs, but it's the thrill that keeps us coming back for more!",
    # An example of medium custom_text_quality
    "The weather is nice today, not too hot, not too cold.",
    # An example of poor custom_text_quality
    "The weather is, you know, whatever.",
]

eval_dataset = pd.DataFrame({
    "response" : responses,
})

使用数据集运行评估

运行评估:

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[custom_text_quality],
    experiment=EXPERIMENT_NAME
)

pointwise_result = eval_task.evaluate()

metrics_table Pandas DataFrame 中查看每个回答的评估结果:

pointwise_result.metrics_table

清理

为避免因本页中使用的资源导致您的 Google Cloud 账号产生费用,请按照以下步骤操作。

删除评估创建的 ExperimentRun

aiplatform.ExperimentRun(
    run_name=pointwise_result.metadata["experiment_run"],
    experiment=pointwise_result.metadata["experiment"],
).delete()

后续步骤