# The ML.EVALUATE function

This document describes the `ML.EVALUATE`

function, which lets you
evaluate model metrics.

## Syntax

# Remote models over LLMs: ML.EVALUATE( MODEL `project_id.dataset.model`, { TABLE `project_id.dataset.table` | (query_statement) }, STRUCT( [task_type AS task_type] [, max_output_tokens AS max_output_tokens] [, temperature AS temperature] [, top_k AS top_k] [, top_p AS top_p]) ) # ARIMA_PLUS and ARIMA_PLUS_XREG models: ML.EVALUATE( MODEL `project_id.dataset.model` [, { TABLE `project_id.dataset.table` | (query_statement) }], STRUCT( [threshold_value AS threshold] [, perform_aggregation AS perform_aggregation] [, horizon_value AS horizon] [, confidence_level AS confidence_level] [, trial_id AS trial_id]) ) # All other types of models: ML.EVALUATE( MODEL `project_id.dataset.model` [, { TABLE `project_id.dataset.table` | (query_statement) }], STRUCT( [threshold_value AS threshold] [, trial_id AS trial_id]) )

### Arguments

`ML.EVALUATE`

takes the following arguments:

`project_id`

: your project ID.`dataset`

: the BigQuery dataset that contains the model.`model`

: the name of the model.This function works with all model types except for imported TensorFlow models and remote models over Cloud AI services. If you use

`ML.EVALUATE`

with a remote model over a Vertex AI large language model (LLM), the remote model must use one of the following LLMs:`gemini-pro`

`text-bison`

`text-unicorn`

Using

`ML.EVALUATE`

with a remote model over an LLM is in preview.`table`

: the name of the input table that contains the evaluation data.If

`table`

is specified, the input column names in the table must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules.The following column naming requirements apply:

**For remote models over tuned**:`text-bison`

LLMs- The table must have a column whose name matches the prompt column name
that is provided during model training. You can provide this value
by using the
`prompt_col`

option during model training. If`prompt_col`

is unspecified, the column named`prompt`

in the training data is used. An error is returned if there is no column named`prompt`

. - The table must have a column whose name matches the label column name that
is provided during model training. You can provide this value by using
the
`input_label_cols`

option during model training. If`input_label_cols`

is unspecified, the column named`label`

in the training data is used. An error is returned if there is no column named`label`

.

You can find information about the label and prompt columns by looking at the model schema information in the Google Cloud console.

For more information, see

`AS SELECT`

.- The table must have a column whose name matches the prompt column name
that is provided during model training. You can provide this value
by using the
**For remote models over pre-trained**:`text-bison`

,`text-unicorn`

, or`gemini-pro`

models- The table must have a column named
`input_text`

that contains the prompt text to use when evaluating the model. - The table must have a column named
`output_text`

that contains the generated text that you would expect to be returned by the model.

- The table must have a column named
**For classification and regression models**: The input must have a column that matches the label column name that is provided during model training. You can provide this value by using the`input_label_cols`

option during model training. If`input_label_cols`

is unspecified, the column named`label`

in the training data is used.

`query_statement`

: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of the`query_statement`

clause in GoogleSQL, see Query syntax.If you used the

`TRANSFORM`

clause in the`CREATE MODEL`

statement that created the model, then only the input columns present in the`TRANSFORM`

clause must appear in`query_statement`

.The following column naming requirements apply:

**For remote models over tuned**:`text-bison`

LLMs- The query must contain a column whose name matches the prompt column name
that is provided during model training. You can provide this value
by using the
`prompt_col`

option during model training. If`prompt_col`

is unspecified, the column named`prompt`

in the training data is used. An error is returned if there is no column named`prompt`

. - The query must contain a column whose name matches the label column name that
is provided during model training. You can provide this value by using
the
`input_label_cols`

option during model training. If`input_label_cols`

is unspecified, the column named`label`

in the training data is used. An error is returned if there is no column named`label`

.

You can find information about the label and prompt columns by looking at the model schema information in the Google Cloud console.

For more information, see

`AS SELECT`

.- The query must contain a column whose name matches the prompt column name
that is provided during model training. You can provide this value
by using the
**For remote models over pre-trained**:`text-bison`

,`text-unicorn`

, or`gemini-pro`

models- The query must contain a column named
`input_text`

that contains the prompt text to use when evaluating the model. - The query must contain a column named
`output_text`

that contains the generated text that you would expect to be returned by the model.

- The query must contain a column named
**For classification and regression models**: The input must have a column that matches the label column name that is provided during model training. You can provide this value by using the`input_label_cols`

option during model training. If`input_label_cols`

is unspecified, the column named`label`

in the training data is used.

`threshold`

: a`FLOAT64`

value that specifies a custom threshold for the binary-class classification model to use for evaluation. The default value is`0.5`

.A

`0`

value for precision or recall means that the selected threshold produced no true positive labels. A`NaN`

value for precision means that the selected threshold produced no positive labels, neither true positives nor false positives.If both

`table_name`

and`query_statement`

are unspecified, you can't use a threshold.You can only use

`threshold`

with binary-class classification models.`perform_aggregation`

: a`BOOL`

value that indicates the level of evaluation for forecasting accuracy. If you specify`TRUE`

, then the forecasting accuracy is on the time series level. If you specify`FALSE`

, the forecasting accuracy is on the timestamp level. The default value is`TRUE`

.`horizon`

: an`INT64`

value that specifies the number of forecasted time points against which the evaluation metrics are computed. The default value is the horizon value specified in the`CREATE MODEL`

statement for the time series model, or`1000`

if unspecified. When evaluating multiple time series at the same time, this parameter applies to each time series.You can only use

`horizon`

when the model type is`ARIMA_PLUS`

and either`table_name`

or`query_statement`

is specified.`confidence_level`

: a`FLOAT64`

value that specifies the percentage of the future values that fall in the prediction interval. The default value is`0.95`

. The valid input range is`[0, 1)`

.You can only use

`confidence_level`

when the model type is`ARIMA_PLUS`

, either`table_name`

or`query_statement`

is specified, and`perform_aggregation`

is set to`FALSE`

. The value of`confidence_level`

affects the`upper_bound`

and`lower_bound`

values in the output.`trial_id`

: an`INT64`

value that identifies the hyperparameter tuning trial that you want the function to evaluate. The function uses the optimal trial by default. Only specify this argument if you ran hyperparameter tuning when creating the model.`task_type`

: a`STRING`

value that specifies the type of task for which you want to evaluate the model's performance. The valid options are the following:`TEXT_GENERATION`

`CLASSIFICATION`

`SUMMARIZATION`

`QUESTION_ANSWERING`

The default value is

`TEXT_GENERATION`

.You can only use this option with a remote model that targets a Vertex AI LLM.

`max_output_tokens`

: an`INT64`

value that sets the maximum number of tokens output by the model. This value must be in the range`[1,1024]`

. Specify a lower value for shorter responses and a higher value for longer responses. The default value is`128`

.A token might be smaller than a word and is approximately four characters. 100 tokens correspond to approximately 60-80 words.

You can only use this option with a remote model that targets a Vertex AI LLM.

`temperature`

: a`FLOAT64`

value in the range`[0.0,1.0]`

that is used for sampling during the response generation, which occurs when`top_k`

and`top_p`

are applied. It controls the degree of randomness in token selection. Lower`temperature`

values are good for prompts that require a more deterministic and less open-ended or creative response, while higher`temperature`

values can lead to more diverse or creative results. A`temperature`

value of`0`

is deterministic, meaning that the highest probability response is always selected. The default is`0`

.You can only use this option with a remote model that targets a Vertex AI LLM.

`top_k`

: an`INT64`

value in the range`[1,40]`

that changes how the model selects tokens for output. Specify a lower value for less random responses and a higher value for more random responses. The default is`40`

.A

`top_k`

value of`1`

means the next selected token is the most probable among all tokens in the model's vocabulary, while a`top_k`

value of`3`

means that the next token is selected from among the three most probable tokens by using the`temperature`

value.For each token selection step, the

`top_k`

tokens with the highest probabilities are sampled. Then tokens are further filtered based on the`top_p`

value, with the final token selected using temperature sampling.You can only use this option with a remote model that targets a Vertex AI LLM.

`top_p`

: a`FLOAT64`

value in the range`[0.0,1.0]`

that changes how the model selects tokens for output. Specify a lower value for less random responses and a higher value for more random responses. The default is`0.95`

.Tokens are selected from the most (based on the

`top_k`

value) to least probable until the sum of their probabilities equals the`top_p`

value. For example, if tokens A, B, and C have a probability of`0.3`

,`0.2`

, and`0.1`

and the`top_p`

value is`0.5`

, then the model selects either A or B as the next token by using the`temperature`

value and doesn't consider C.You can only use this option with a remote model that targets a Vertex AI LLM.

## Output

`ML.EVALUATE`

returns a single row of metrics applicable to the
type of model specified.

For models that return them, the `precision`

, `recall`

, `f1_score`

, `log_loss`

,
and `roc_auc`

metrics are macro-averaged for all of the class labels. For a
macro-average, metrics are calculated for each label and then an unweighted
average is taken of those values.

For models that return the `accuracy`

metric, `accuracy`

is computed as a global
total or micro-average. For a micro-average, the metric is calculated globally
by counting the total number of correctly predicted rows.

### Regression models

Regression models include the following:

- Linear regression
- Boosted tree regressor
- Random forest regressor
- Deep neural network (DNN) regressor
- Wide & Deep regressor
- AutoML Tables regressor

`ML.EVALUATE`

returns the following columns for regression models:

`trial_id`

: an`INT64`

value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model. This column doesn't apply for AutoML Tables models.`mean_absolute_error`

: a`FLOAT64`

value that contains the mean absolute error for the model.`mean_squared_error`

: a`FLOAT64`

value that contains the mean squared error for the model.`mean_squared_log_error`

: a`FLOAT64`

value that contains the mean squared logarithmic error for the model. The mean squared logarithmic error measures the distance between the actual and predicted values.`median_absolute_error`

: a`FLOAT64`

value that contains the median absolute error for the model.`r2_score`

: a`FLOAT64`

value that contains the R2 score for the model.`explained_variance`

: a`FLOAT64`

value that contains the explained variance for the model.

### Classification models

Classification models include the following:

- Logistic regressor
- Boosted tree classifier
- Random forest classifier
- DNN classifier
- Wide & Deep classifier
- AutoML Tables classifier

`ML.EVALUATE`

returns the following columns for classification models:

`trial_id`

: an`INT64`

value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model. This column doesn't apply for AutoML Tables models.`precision`

: a`FLOAT64`

value that contains the precision for the model.`recall`

: a`FLOAT64`

value that contains the recall for the model.`accuracy`

: a`FLOAT64`

value that contains the accuracy for the model.`f1_score`

: a`FLOAT64`

value that contains the F1 score for the model.`log_loss`

: a`FLOAT64`

value that contains the logistic loss for the model.`roc_auc`

: a`FLOAT64`

value that contains the area under the receiver operating characteristic curve for the model.

### K-means models

`ML.EVALUATE`

returns the following columns for k-means models:

`trial_id`

: an`INT64`

value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model.`davies_bouldin_index`

: a`FLOAT64`

value that contains the Davies-Bouldin Index for the model.`mean_squared_distance`

: a`FLOAT64`

value that contains the mean squared distance for the model, which is the average of the distances between training data points to their closest centroid.

### Matrix factorization models

`ML.EVALUATE`

returns the following columns for matrix factorization models
with implicit feedback:

`trial_id`

: an`INT64`

value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model.`recall`

: a`FLOAT64`

value that contains the recall for the model.`mean_squared_error`

: a`FLOAT64`

value that contains the mean squared error for the model.`normalized_discounted_cumulative_gain`

: a`FLOAT64`

value that contains the normalized discounted cumulative gain for the model.`average_rank`

: a`FLOAT64`

value that contains the average rank (PDF download) for the model.

`ML.EVALUATE`

returns the following columns for matrix factorization models
with explicit feedback:

`trial_id`

: an`INT64`

value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model.`mean_absolute_error`

: a`FLOAT64`

value that contains the mean absolute error for the model.`mean_squared_error`

: a`FLOAT64`

value that contains the mean squared error for the model.`mean_squared_log_error`

: a`FLOAT64`

value that contains the mean squared logarithmic error for the model. The mean squared logarithmic error measures the distance between the actual and predicted values.`mean_absolute_error`

: a`FLOAT64`

value that contains the mean absolute error for the model.`r2_score`

: a`FLOAT64`

value that contains the R2 score for the model.`explained_variance`

: a`FLOAT64`

value that contains the explained variance for the model.

### PCA models

`ML.EVALUATE`

returns the following column for PCA models:

`total_explained_variance_ratio`

: a`FLOAT64`

value that contains the percentage of the cumulative variance explained by all the returned principal components. For more information, see the`ML.PRINCIPAL_COMPONENT_INFO`

function.

### Time series models

`ML.EVALUATE`

returns the following columns for `ARIMA_PLUS`

or
`ARIMA_PLUS_XREG`

models when input data is provided and `perform_aggregation`

is `FALSE`

:

`time_series_id_col`

or`time_series_id_cols`

: a value that contains the identifiers of a time series.`time_series_id_col`

can be an`INT64`

or`STRING`

value.`time_series_id_cols`

can be an`ARRAY<INT64>`

or`ARRAY<STRING>`

value. Only present when forecasting multiple time series at once. The column names and types are inherited from the`TIME_SERIES_ID_COL`

option as specified in the`CREATE MODEL`

statement.`ARIMA_PLUS_XREG`

models don't support this column.`time_series_timestamp_col`

: a`STRING`

value that contains the timestamp column for a time series. The column name and type are inherited from the`TIME_SERIES_TIMESTAMP_COL`

option as specified in the`CREATE MODEL`

statement.`time_series_data_col`

: a`STRING`

value that contains the data column for a time series. The column name and type are inherited from the`TIME_SERIES_DATA_COL`

option as specified in the`CREATE MODEL`

statement.`forecasted_time_series_data_col`

: a`STRING`

value that contains the same data as`time_series_data_col`

but with`forecasted_`

prefixed to the column name.`lower_bound`

: a`FLOAT64`

value that contains the lower bound of the prediction interval.`upper_bound`

: a`FLOAT64`

value that contains the upper bound of the prediction interval.`absolute_error`

: a`FLOAT64`

value that contains the absolute value of the difference between the forecasted value and the actual data value.`absolute_percentage_error`

: a`FLOAT64`

value that contains the absolute value of the absolute error divided by the actual value.

`ML.EVALUATE`

returns the following columns for `ARIMA_PLUS`

or
`ARIMA_PLUS_XREG`

models when input data is provided and `perform_aggregation`

is `TRUE`

:

`time_series_id_col`

or`time_series_id_cols`

: the identifiers of a time series. Only present when forecasting multiple time series at once. The column names and types are inherited from the`TIME_SERIES_ID_COL`

option as specified in the`CREATE MODEL`

statement.`ARIMA_PLUS_XREG`

models don't support this column.`mean_absolute_error`

: a`FLOAT64`

value that contains the mean absolute error for the model.`mean_squared_error`

: a`FLOAT64`

value that contains the mean squared error for the model.`root_mean_squared_error`

: a`FLOAT64`

value that contains the root mean squared error for the model.`mean_absolute_percentage_error`

: a`FLOAT64`

value that contains the mean absolute percentage error for the model.`symmetric_mean_absolute_percentage_error`

: a`FLOAT64`

value that contains the symmetric mean absolute percentage error for the model.

`ML.EVALUATE`

returns the following columns for an `ARIMA_PLUS`

model when
input data isn't provided:

`time_series_id_col`

or`time_series_id_cols`

: the identifiers of a time series. Only present when forecasting multiple time series at once. The column names and types are inherited from the`TIME_SERIES_ID_COL`

option as specified in the`CREATE MODEL`

statement.`non_seasonal_p`

: an`INT64`

value that contains the order for the autoregressive model. For more information, see Autoregressive integrated moving average.`non_seasonal_d`

: an`INT64`

that contains the degree of differencing for the non-seasonal model. For more information, see Autoregressive integrated moving average.`non_seasonal_q`

: an`INT64`

that contains the order for the moving-average model. For more information, see Autoregressive integrated moving average.`has_drift`

: a`BOOL`

value that indicates whether the model includes a linear drift term.`log_likelihood`

: a`FLOAT64`

value that contains the log likelihood for the model.`aic`

: a`FLOAT64`

value that contains the Akaike information criterion for the model.`variance`

: a`FLOAT64`

value that measures how far the observed value differs from the predicted value mean.`seasonal_periods`

: a`STRING`

value that contains the seasonal period for the model.`has_holiday_effect`

: a`BOOL`

value that indicates whether the model includes any holiday effects.`has_spikes_and_dips`

: a`BOOL`

value that indicates whether the model performs automatic spikes and dips detection and cleanup.`has_step_changes`

: a`BOOL`

value that indicates whether the model has step changes.

### Autoencoder models

`ML.EVALUATE`

returns the following columns for autoencoder models:

`mean_absolute_error`

: a`FLOAT64`

value that contains the mean absolute error for the model.`mean_squared_error`

: a`FLOAT64`

value that contains the mean squared error for the model.`mean_squared_log_error`

: a`FLOAT64`

value that contains the mean squared logarithmic error for the model. The mean squared logarithmic error measures the distance between the actual and predicted values.

### Remote models over Vertex AI endpoints

`ML.EVALUATE`

returns the following column:

`remote_eval_metrics`

: a`JSON`

column containing appropriate metrics for the model type.

### Remote models over Vertex AI LLMs

`ML.EVALUATE`

returns different columns for remote models over
Vertex AI LLMs, depending on the `task_type`

value
that you specify.

When you specify the `TEXT_GENERATION`

task type, the following columns are
returned:

`bleu4_score`

: a`FLOAT64`

column that contains the bilingual evaluation understudy (BLEU4) score for the model.`rouge-l_precision`

: a`FLOAT64`

column that contains the Recall-oriented understudy for gisting evaluation (ROUGE-L) precision for the model .`rouge-l_recall`

: a`FLOAT64`

column that contains the ROUGE-L recall for the model.`rouge-l_f1`

: a`FLOAT64`

column that contains the ROUGE-L F1 score for the model.`evaluation_status`

: a`STRING`

column in JSON format that contains the following elements:`num_successful_rows`

: the number of successful inference rows returned from Vertex AI.`num_total_rows`

: the number of total input rows.

When you specify the `CLASSIFICATION`

task type, the following columns are
returned:

`precision`

: a`FLOAT64`

column that contains the precision for the model .`recall`

: a`FLOAT64`

column that contains the recall for the model.`f1`

: a`FLOAT64`

column that contains the F1 score for the model.`label`

: a`STRING`

column that contains the label generated for the input data.`evaluation_status`

: a`STRING`

column in JSON format that contains the following elements:`num_successful_rows`

: the number of successful inference rows returned from Vertex AI.`num_total_rows`

: the number of total input rows.

When you specify the `SUMMARIZATION`

task type, the following columns are
returned:

`rouge-l_precision`

: a`FLOAT64`

column that contains the Recall-oriented understudy for gisting evaluation (ROUGE-L) precision for the model.`rouge-l_recall`

: a`FLOAT64`

column that contains the ROUGE-L recall for the model.`rouge-l_f1`

: a`FLOAT64`

column that contains the ROUGE-L F1 score for the model.`evaluation_status`

: a`STRING`

column in JSON format that contains the following elements:`num_successful_rows`

: the number of successful inference rows returned from Vertex AI.`num_total_rows`

: the number of total input rows.

When you specify the `QUESTION_ANSWERING`

task type, the following columns are
returned:

`exact_match`

: a`FLOAT64`

column that indicates if the generated text exactly matches the ground truth. This value is`1`

if the generated text equals the ground truth, otherwise it is`0`

. This metric is an average across all of the input rows.`evaluation_status`

: a`STRING`

column in JSON format that contains the following elements:`num_successful_rows`

: the number of successful inference rows returned from Vertex AI.`num_total_rows`

: the number of total input rows.

## Limitations

`ML.EVALUATE`

is subject to the following limitations:

`ML.EVALUATE`

doesn't support imported TensorFlow models or remote models over Cloud AI services.- For remote models over Vertex AI endpoints,
`ML.EVALUATE`

fetches evaluation result from the Vertex AI endpoint and doesn't take any input data.

## Costs

When used with remote models over Vertex AI LLMs,
`ML.EVALUATE`

costs are calculated based on the following:

- The bytes processed from the input table. These charges are billed from BigQuery to your project. For more information, see BigQuery pricing.
- The input to and output from the LLM. These charges are billed from Vertex AI to your project. For more information, see Vertex AI pricing.

## Examples

The following examples show how to use `ML.EVALUATE`

.

`ML.EVALUATE`

with no input data specified

The following query evaluates a model with no input data specified:

SELECT * FROM ML.EVALUATE(MODEL `mydataset.mymodel`)

`ML.EVALUATE`

with a custom threshold and input data

The following query evaluates a model with input data and a custom
threshold of `0.55`

:

SELECT * FROM ML.EVALUATE(MODEL `mydataset.mymodel`, ( SELECT custom_label, column1, column2 FROM `mydataset.mytable`), STRUCT(0.55 AS threshold))

`ML.EVALUATE`

to calculate forecasting accuracy of a time series

The following query evaluates the 30-point forecasting accuracy for a time series model:

SELECT * FROM ML.EVALUATE(MODEL `mydataset.my_arima_model`, ( SELECT timeseries_date, timeseries_metric FROM `mydataset.mytable`), STRUCT(TRUE AS perform_aggregation, 30 AS horizon))

`ML.EVALUATE`

to calculate ARIMA_PLUS forecasting accuracy for each forecasted timestamp

The following query evaluates the forecasting accuracy for each of the 30
forecasted points of a time series model. It also computes the prediction
interval based on a confidence level of `0.9`

.

SELECT * FROM ML.EVALUATE(MODEL `mydataset.my_arima_model`, ( SELECT timeseries_date, timeseries_metric FROM `mydataset.mytable`), STRUCT(FALSE AS perform_aggregation, 0.9 AS confidence_level, 30 AS horizon))

`ML.EVALUATE`

to calculate ARIMA_PLUS_XREG forecasting accuracy for each forecasted timestamp

The following query evaluates the forecasting accuracy for each of the 30
forecasted points of a time series model. It also computes the prediction
interval based on a confidence level of `0.9`

. Note that you need to include the
side features for the evaluation data.

SELECT * FROM ML.EVALUATE(MODEL `mydataset.my_arima_xreg_model`, ( SELECT timeseries_date, timeseries_metric, feature1, feature2 FROM `mydataset.mytable`), STRUCT(FALSE AS perform_aggregation, 0.9 AS confidence_level, 30 AS horizon))

`ML.EVALUATE`

to calculate LLM text generation accuracy

The following query evaluates the LLM text generation accuracy for the classification task type for each label from the evaluation table.

SELECT * FROM ML.EVALUATE(MODEL `mydataset.my_llm`, ( SELECT prompt, label FROM `mydataset.mytable`), STRUCT('classification' AS task_type))

## What's next

- For information about model evaluation, see BigQuery ML model evaluation overview.
- For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.