Questa pagina è stata tradotta dall'API Cloud Translation.

Modelli di prompt delle metriche per la valutazione basata su modelli

Questa pagina fornisce un elenco di modelli che puoi utilizzare per la valutazione basata su modelli utilizzando Gen AI Evaluation Service. Per ulteriori informazioni sulle metriche basate su modelli, consulta Definire metriche personalizzate.

Panoramica

Per la valutazione basata su modelli, inviamo un prompt al modello giudice per generare il punteggio della metrica in base a criteri, rubriche di punteggio e altre istruzioni specifici.

La tabella seguente fornisce una panoramica degli esempi di modelli di prompt delle metriche disponibili:

	Caso d'uso del testo	Caso d'uso della chat multi-turno	Altri casi d'uso principali
Basata su punti	Fluency Coherence Groundedness Sicurezza Segui le istruzioni Lunghezza Qualità del testo	Qualità della chat a più turni Sicurezza multi-turn	Qualità del riassunto Qualità del question answering
Basata su coppie	Fluency Coherence Groundedness Sicurezza Segui le istruzioni Lunghezza Qualità del testo	Qualità della chat a più turni Sicurezza multi-turn	Qualità del riassunto Qualità del question answering

Strutturare un modello di prompt della metrica

Un modello di prompt della metrica deve includere le seguenti sezioni principali:

Istruzione
Valutazione
Input dell'utente e risposta generata dall'AI.

Ogni sezione può contenere sottosezioni.

Istruzione

Componente	Funzione	Tipo	Esempio
Istruzione	Include una persona per il modello di giudice e una breve descrizione del suo compito.	Valore predefinito	You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user input and AI-generated responses. You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below. You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

Valutazione

Componente	Funzione	Tipo	Esempio
Definizione della metrica	Specifica il nome e la definizione della metrica.	Input utente facoltativi	`You will be assessing a metric called SummarizationQuality, which measures the overall ability to summarize text`
Criteri	Definisce i criteri (e, facoltativamente, i sottocriteri) per la metrica.	Input utente obbligatori	`Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements. Groundedness: The response contains information included only in the context. The response does not reference any outside information.`
Griglia di valutazione	Specifica la scala di punteggio per la metrica, con spiegazioni sul significato di ogni punteggio.	Input utente obbligatori	`5: (Very good). The summary follows instructions, is grounded, is concise, and fluent. 4: (Good). The summary follows instructions, is grounded, concise, and fluent. 3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent. 2: (Bad). The summary is grounded, but does not follow the instructions. 1: (Very bad). The summary is not grounded.`
Esempi few-shot	Esempi dell'attività.	Input utente facoltativi. Nota: gli esempi few-shot non solo possono migliorare le prestazioni, ma anche la formattazione della risposta del modello giudice. Ti consigliamo di iniziare con 5-10 esempi few-shot.	`RESPONSE: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs. EXPLANATION: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence. SCORE: 1`
Passaggi di valutazione	Istruzioni passo passo su come svolgere l'attività	Input utente facoltativi Nota: puoi specificare le classifiche dei criteri nei passaggi di valutazione.	`STEP 1: Assess the response in aspects of instruction following, groundedness, helpfulness, and verbosity according to the criteria. STEP 2: Score based on the rubrics.`

Input utente

Componente	Funzione	Tipo	Esempio
Variabili di input	Gli input che gli utenti devono fornire per completare il prompt per l'autorater e ricevere una risposta.	Input utente obbligatori	`## User Inputs ### Prompt {prompt} ## AI-generated Response {response}`

Inoltre, se le colonne nei dati utente e nelle variabili di input non corrispondono e non vuoi rinominare i dati, puoi fornire una mappatura:

Componente	Funzione	Tipo	Esempio
Mappatura delle colonne delle metriche	Una mappatura dalle variabili di input nel prompt utente ai dati utente.	Input utente facoltativi Nota: `prompt`, `response` e `baseline_model_response` non supportano il mapping se `evaluate()` esegue l'inferenza del modello.	`metric_column_mapping = {"reference":"ground_truth"}`

Adattare un modello di prompt della metrica ai dati di input

Per adattare un modello ai tuoi dati e criteri di valutazione specifici:

Identifica i criteri mancanti: determina quali criteri non sono adeguatamente trattati dal modello esistente.
Aggiungi nuovi criteri: includi i criteri mancanti nel prompt, definendo chiaramente cosa ti aspetti che il modello prenda in considerazione.
Modifica i input utente dell'utente: se hai colonne aggiuntive del set di dati di valutazione che vuoi utilizzare per la valutazione, aggiungile nei campi di input utente dell'utente e indica al modello di valutazione come utilizzare questo campo.
Aggiorna la griglia di valutazione: modifica la griglia di valutazione in modo che rifletta i nuovi criteri e la loro importanza relativa.

Ad esempio, se vuoi valutare un modello di riepilogo in base al grado di allineamento del riepilogo della risposta con un riepilogo di riferimento, puoi aggiungere un nuovo criterio denominato "allineamento di riferimento" e aggiungere i dati di riferimento come parte di User Inputs:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.
Reference alignment: The response is consistent and aligned with the reference response.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, concise, fluent and aligned with reference summary.
4: (Good). The summary follows instructions, is grounded, concise, and fluent but not aligned with reference summary.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent and is not aligned with reference summary.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, fluency and reference alignment according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Reference
{reference}

### Prompt
{prompt}

## AI-generated Response
{response}

Fornire esempi few-shot per migliorare la qualità

Gli esempi few-shot possono migliorare significativamente la qualità e la coerenza delle risposte di valutazione guidando il modello verso i formati e gli stili di output scelti. Ti consigliamo di iniziare con 5-10 esempi few-shot.

Per incorporare esempi few-shot:

Identifica esempi pertinenti: seleziona esempi simili al tipo di dati di input che valuterai.
Includi esempi nel prompt: inserisci gli esempi direttamente nel prompt di valutazione, prima dell'attività o del contesto.
Esempi di formattazione: assicurati che gli esempi seguano il formato e lo stile di output scelti.

Ad esempio, puoi fornire esempi few-shot per la metrica coherence e aggiungere l'istruzione per utilizzare gli esempi nel seguente modo:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps as shown in few shot examples. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
...

## Criteria
...

## Rating Rubric
...

## Few-shot Examples
Response: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs.
Explanation: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence.
Score: 1

Response: Learning a new language can be a rewarding experience for children, opening doors to different cultures and expanding their understanding of the world. There are many resources available to help children learn languages, from online courses and apps to language exchange programs and immersion schools.
Explanation: The response presents two related ideas: the benefits of learning a new language for children and the resources available to aid in that process. However, there is no clear transition or connection between these two distinct points. While both sentences are relevant to the topic of language acquisition in children, the relationship between them could be made more explicit.
Score: 3

Response: Although the internet has revolutionized communication and information sharing, it has also created echo chambers where individuals are only exposed to opinions and beliefs that align with their own. This polarization can lead to increased hostility and misunderstanding between different groups, making it difficult to find common ground on important issues. Consequently, fostering media literacy and critical thinking skills is essential for navigating the vast and often biased landscape of online information. By teaching individuals to evaluate sources, identify biases, and consider diverse perspectives, we can empower them to break free from echo chambers and engage in meaningful dialogue with those who hold differing views.
Explanation: The response exhibits a clear and logical flow of ideas. The transition words 'although' and 'consequently' effectively signal the relationship between the internet's advantages, its drawbacks (echo chambers), and the proposed solution (media literacy). The text maintains cohesion through consistent focus on the central theme of online polarization and its remedies.
Score: 5

## Evaluation Steps
...

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}