針對以模型為基礎的指標,Gen AI 評估服務會使用基礎模型 (例如已設為判斷模型的 Gemini) 評估模型。本頁說明如何改善評斷模型的品質,並使用提示工程技巧,依據您的需求自訂模型。
如需瞭解基本評估工作流程,請參閱 Gen AI Evaluation Service 快速入門。進階評審模型自訂系列包含以下頁面:
總覽
使用人類評審來評估大型語言模型 (LLM) 可能會耗費大量時間和成本。使用評判模型是評估 LLM 的更具擴充性方式。
Gen AI 評估服務預設使用 Gemini 2.0 Flash 做為評估模型,並提供可自訂的提示,評估模型的各種用途。以模型為基礎的指標範本涵蓋許多基本用途,但您可以透過下列程序,進一步自訂評判模型,以便處理基本用途以外的情況:
建立資料集時,請使用代表用途的提示。建議的資料集大小介於 100 至 1000 個提示之間。
使用提示,透過提示工程技巧修改評斷模型。
使用評判模型執行評估。
提示工程技巧
本節列出可用於修改評斷模型的提示工程技巧。這些範例使用零樣本提示,但您也可以在提示中使用少量樣本示例,以提升模型品質。
請先從適用於整個評估資料集的提示開始。提示應包含高層級評估標準和評分標準,並要求評審模型做出最終判決。如需各種用途的評估標準和評分量表範例,請參閱「指標提示範本」。
使用思維鏈提示
提示評估模型以一系列邏輯一致的動作或步驟評估候選模型。
例如,您可以使用下列逐步說明:
"Please first list down the instructions in the user query."
"Please highlight such specific keywords."
"After listing down instructions, you should rank the instructions in the order of importance."
"After that, INDEPENDENTLY check if response A and response B for meeting each of the instructions."
"Writing quality/style should NOT be used to judge the response quality unless it was requested by the user."
"When evaluating the final response quality, please value Instruction Following a more important rubrics than Truthfulness."
以下提示範例要求評審模型使用思考鏈提示來評估文字任務:
# Rubrics
Your mission is to judge responses from two AI models, Model A and Model B, and decide which is better. You will be given the previous conversations between the user and the model, a prompt, and responses from both models.
Please use the following rubric criteria to judge the responses:
<START OF RUBRICS>
Your task is to first analyze each response based on the two rubric criteria: instruction_following, and truthfulness (factual correctness). Start your analysis with "Analysis".
(1) Instruction Listing
Please first list down the instructions in the user query. In general, an instruction is VERY important if it is specific asked in the prompt and deviate from the norm. Please highlight such specific keywords.
You should also derive the task type from the prompt and include the task specific implied instructions.
Sometimes, no instruction is available in the prompt. It is your job to infer if the instruction is to auto-complete the prompt or asking LLM for followups.
After listing down instructions, you should rank the instructions in the order of importance.
After that, INDEPENDENTLY check if response A and response B for meeting each of the instructions. You should itemize for each instruction, if response meet, partially meet or does not meet the requirement using reasoning. You should start reasoning first before reaching a conclusion whether response satisfies the requirement. Citing examples while making reasoning is preferred.
(2) Truthfulness
Compare response A and response B for factual correctness. The one with less hallucinated issues is better.
If response is in sentences and not too long, you should check every sentence separately.
For longer responses, to check factual correctness, focus specifically on places where response A and B differ. Find the correct information in the text to decide if one is more truthful to the other or they are about the same.
If you cannot determine validity of claims made in the response, or response is a punt ("I am not able to answer that type of question"), the response has no truthful issues.
Truthfulness check is not applicable in the majority of creative writing cases ("write me a story about a unicorn on a parade")
Writing quality/style should NOT be used to judge the response quality unless it was requested by the user.
In the end, express your final verdict in one of the following choices:
1. Response A is better: [[A>B]]
2. Tie, relatively the same: [[A=B]]
3. Response B is better: [[B>A]]
Example of final verdict: "My final verdict is tie, relatively the same: [[A=B]]".
When evaluating the final response quality, please value Instruction Following a more important rubrics than Truthfulness.
When for both response, instruction and truthfulness are fully meet, it is a tie.
<END OF RUBRICS>
根據分級規範引導模型推理
使用評分準則,協助評分模型評估模型推理。評分指南與評分標準不同。
舉例來說,下列提示使用評分標準,指示評審模型根據「重大問題」、「次要問題」和「無問題」評分標準評估「遵循指示」任務。
Your task is to first analyze each response based on the three rubric criteria: verbosity, instruction_following, truthfulness (code correctness) and (coding) executability. Please note that the model responses should follow "response system instruction" (if provided). Format your judgment in the following way:
Response A - verbosity:too short|too verbose|just right
Response A - instruction_following:major issues|minor issues|no issues
Response A - truthfulness:major issues|minor issues|no issues
Response A - executability:no|no code present|yes-fully|yes-partially
Then do the same for response B.
After the rubric judgements, you should also give a brief rationale to summarize your evaluation considering each individual criteria as well as the overall quality in a new paragraph starting with "Reason: ".
In the last line, express your final judgment in the format of: "Which response is better: [[verdict]]" where "verdict" is one of {Response A is much better, Response A is better, Response A is slightly better, About the same, Response B is slightly better, Response B is better, Response B is much better}. Do not use markdown format or output anything else.
以下提示會使用評分指南,協助評分模型評估「遵循指示」任務:
You are a judge for coding related tasks for LLMs. You will be provided with a coding prompt, and two responses (Response A and Response B) attempting to answer the prompt. Your task is to evaluate each response based on the following criteria:
Correctness: Does the code produce the correct output and solve the problem as stated?
Executability: Does the code run without errors?
Instruction Following: Does the code adhere to the given instructions and constraints?
Please think about the three criteria, and provide a side-by-side comparison rating to to indicate which one is better.
使用參考答案校正評分模型
您可以使用部分或所有提示的參考答案校正評斷模型。
以下提示會引導評分模型如何使用參考答案:
"Note that you can compare the responses with the reference answer to make your judgment, but the reference answer may not be the only correct answer to the query."
以下範例也使用推理、思考鏈提示和評分指南,引導「遵循指示」任務的評估程序:
# Rubrics
Your mission is to judge responses from two AI models, Model A and Model B, and decide which is better. You will be given a user query, source summaries, and responses from both models. A reference answer
may also be provided - note that you can compare the responses with the reference answer to make your judgment, but the reference answer may not be the only correct answer to the query.
Please use the following rubric criteria to judge the responses:
<START OF RUBRICS>
Your task is to first analyze each response based on the three rubric criteria: grounding, completeness, and instruction_following. Start your analysis with "Analysis".
(1) Grounding
Please first read through all the given sources in the source summaries carefully and make sure you understand the key points in each one.
After that, INDEPENDENTLY check if response A and response B use ONLY the given sources in the source summaries to answer the user query. It is VERY important to check that all
statements in the response MUST be traceable back to the source summaries and ACCURATELY cited.
(2) Completeness
Please first list down the aspects in the user query. After that, INDEPENDENTLY check if response A and response B for covering each of the aspects by using ALL RELEVANT information from the sources.
(3) Instruction Following
Please read through the following instruction following rubrics carefully. After that, INDEPENDENTLY check if response A and response B for following each of the instruction following rubrics successfully.
* Does the response provide a final answer based on summaries of 3 potential answers to a user query?
* Does the response only use the technical sources provided that are relevant to the query?
* Does the response use only information from sources provided?
* Does the response select all the sources that provide helpful details to answer the question in the Technical Document?
* If the sources have significant overlapping or duplicate details, does the response select sources which are most detailed and comprehensive?
* For each selected source, does the response prepend source citations?
* Does the response use the format: "Source X" where x represents the order in which the technical source appeared in the input?
* Does the response use original source(s) directly in its response, presenting each source in its entirety, word-for-word, without omitting and altering any details?
* Does the response create a coherent technical final answer from selected Sources without inter-mixing text from any of the Sources?
Writing quality/style can be considered, but should NOT be used as critical rubric criteria to judge the response quality.
In the end, express your final verdict in one of the following choices:
1. Response A is better: [[A>B]]
2. Tie, relatively the same: [[A=B]]
3. Response B is better: [[B>A]]
Example of final verdict: "My final verdict is tie, relatively the same: [[A=B]]".
When for both response, grounding, completeness, and instruction following are fully meet, it is a tie.
<END OF RUBRICS>