The ML.DETECT_ANOMALIES function
This document describes the ML.DETECT_ANOMALIES
function, which lets you
perform anomaly detection in BigQuery ML.
You can use the following types of models with ML.DETECT_ANOMALIES
, depending
on the type of input data you want to analyze:
- For time series data, use one of the following models:
- For independent and identically distributed random variables (IID) data, use one of the following models:
Syntax
# ARIMA_PLUS and ARIMA_PLUS_XREG models: ML.DETECT_ANOMALIES( MODEL `project_id.dataset.model` [, STRUCT(anomaly_prob_threshold AS anomaly_prob_threshold)] [, { TABLE `project_id.dataset.table` | (query_statement) }] ) #Autoencoder, k-means, or PCA models: ML.DETECT_ANOMALIES( MODEL `project_id.dataset.model`, STRUCT(contamination AS contamination), { TABLE `project_id.dataset.table` | (query_statement) } )
Arguments
ML.DETECT_ANOMALIES
takes the following arguments:
project_id
: Your project ID.dataset
: The BigQuery dataset that contains the model.model
: The name of the model.table
: The name of the table to use to perform anomaly detection.query_statement
: The GoogleSQL query that generates the data to use to perform anomaly detection. For the supported SQL syntax for thequery_statement
clause in GoogleSQL, see Query syntax.anomaly_prob_threshold
: aFLOAT64
value that identifies the custom threshold to use for anomaly detection. The value must be in the range[0, 1)
, with a default value of0.95
.The value of the anomaly probability at each timestamp is calculated using the actual time series data value and the values of the predicted time series data and the variance from the model training. The actual time series data value at a specific timestamp is identified as anomalous if the anomaly probability exceeds the
anomaly_prob_threshold
value. Theanomaly_prob_threshold
value also determines the lower and upper bounds, where a larger threshold value results in a larger interval size.contamination
: aFLOAT64
value that identifies the proportion of anomalies in the training dataset that are used to create the autoencoder, k-means, or PCA input models. The value must be in the range[0, 0.5]
.For example,
contamination
value of0.1
means that 10% of the training data that was used to create the input model is anomalous. Thecontamination
value determines the cutoff threshold of the target metric to become anomalous, and any input data with a target metric greater than the cutoff threshold is identified as anomalous. The target metric is mean squared error for autoencoder and PCA models, and the target metric is normalized distance for k-means models. For more information on normalized distance, see K-means model output.
Input
The input requirements for the ML.DETECT_ANOMALIES
function depend upon the
input model type.
Time series model input
Anomaly detection with ARIMA_PLUS
and ARIMA_PLUS_XREG
models has the
following requirements:
- To detect anomalies in historical time-series data, the
DECOMPOSE_TIME_SERIES
training option must be set as its default value ofTRUE
when the input model is created. Neithertable_name
norquery_statement
is accepted. - The
anomaly_prob_threshold
value must be specified to detect anomalies in new time-series data. - The column names of either the
table_name
input table or thequery_statement
clause must match the column names that are used to create the input model. - The data types of the
TIME_SERIES_ID_COL
columns must match the data types of the columns that are used to create the input model.
For a list of supported data types, see
TIME_SERIES_TIMESTAMP_COL
and
TIME_SERIES_DATA_COL
.
Autoencoder, k-means, or PCA model input
Anomaly detection with autoencoder, k-means, or PCA models has the following requirements:
- The column names of the input data from either the
table
or thequery_statement
argument must match the column names of the model. The column data types must be compatible according to BigQuery implicit coercion rules. - If you used the
TRANSFORM
clause in theCREATE MODEL
statement that created the model, then only the input columns present in theTRANSFORM
clause can appear inquery_statement
.
Output
ML.DETECT_ANOMALIES
always returns the is_anomaly
column that contains the
anomaly detection results. Other output columns differ based upon the
input model type and input data table.
Time series model output
ARIMA_PLUS
and ARIMA_PLUS_XREG
model output includes the following columns,
followed by the input table columns, if present. Output can include the
following:
time_series_id_col
ortime_series_id_cols
: the identifiers of a time series. Only present when forecasting multiple time series at once. The column names and types are inherited from theTIME_SERIES_ID_COL
option as specified in the model creation query.time_series_timestamp
: aSTRING
value that contains the timestamp column for a time series. The column name is inherited from theTIME_SERIES_TIMESTAMP_COL
option as specified in theCREATE MODEL
statement. The column has a type ofTIMESTAMP
, regardless of theTIME_SERIES_TIMESTAMP_COL
input column data type.time_series_data
: aSTRING
value that contains the data column for a time series. The column name is inherited from theTIME_SERIES_DATA_COL
option as specified in theCREATE MODEL
statement. The column has a type ofFLOAT64
, regardless of theTIME_SERIES_DATA_COL
input column data type.is_anomaly
: aBOOL
value that indicates whether the value at a specific timestamp is an anomaly. If theanomaly_probability
value is above theanomaly_prob_threshold
value, then thetime_series_data
value is out of the range for the lower and upper bounds and theis_anomaly
value isTRUE
.lower_bound
: aFLOAT64
value that contains the lower bound of the prediction result.upper_bound
: aFLOAT64
value that that contains the upper bound of the prediction result.anomaly_probability
: aFLOAT64
value that contains the probability that this point is an anomaly. For example, ananomaly_probability
value of0.95
means that, among all possible values at the given timestamp, there is a 95% chance that the value is closer to the predicted value than it is to the given time series data value. This indicates a 95% probability that the given time series data value is an anomaly.
ML.DETECT_ANOMALIES
output for time series models has the
following properties:
- The function returns
NULL
values in theis_anomaly
,upper_bound
,lower_bound
andanomaly_probability
columns for rows with invalid input, which include the following cases:
Autoencoder and PCA model output
Autoencoder and PCA model output includes the following columns, followed by the input table columns:
is_anomaly
: aBOOL
value that indicates whether the value is anomalous.mean_squared_error
: aFLOAT64
value that contains the mean squared error.
K-means model output
K-means model output includes the following, followed by the input table columns:
is_anomaly
: aBOOL
value that indicates whether the value is anomalous.normalized_distance
: aFLOAT64
value that contains the shortest distance among the normalized distances from the input data to each cluster centroid. Normalized distances are computed as the absolute distance from the input data to a cluster centroid, divided by the cluster's radius. The cluster radius is defined as the root mean square of all of the distances from each cluster's assigned data points to its centroid. Normalized distance is used in favor of absolute distance to determine anomalies because anomalies might not be detected as effectively using absolute distances, since they don't account for cluster radius. The distance type is determined by theDISTANCE_TYPE
value specified during model training.centroid_id
: anINT64
value that contains the centroid ID.
Examples
The following examples show how to use ML.DETECT_ANOMALIES
with different
input models and settings.
ARIMA_PLUS
model without specified settings
The following example detects anomalies using an ARIMA_PLUS
model that has the
DECOMPOSE_TIME_SERIES
training option set to its default value of TRUE
, without specifying the
anomaly_prob_threshold
argument.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`)
If the time series input column names are ts_timestamp
and ts_data
, then
this query returns results similar to the following:
+-------------------------+----------+------------+-------------+-------------+---------------------+ | ts_timestamp | ts_data | is_anomaly | lower_bound | upper_bound | anomaly_probability | +-------------------------+----------+------------+-------------+-------------+---------------------+ | 2021-01-01 00:00:01 UTC | 125.3 | FALSE | 123.5 | 139.1 | 0.93 | | 2021-01-02 00:00:01 UTC | 145.3 | TRUE | 128.5 | 143.1 | 0.96 | +-------------------------+----------+------------+-------------+-------------+---------------------+
ARIMA_PLUS
model with a custom anomaly_prob_threshold
value
The following example detects anomalies using an ARIMA_PLUS
model that has the
DECOMPOSE_TIME_SERIES
training option set to its default value of TRUE
, using a custom
anomaly_prob_threshold
value of 0.8
:
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`, STRUCT(0.8 AS anomaly_prob_threshold))
If the time series input column names are ts_timestamp
and ts_data
, then
this query returns results similar to the following:
+-------------------------+----------+------------+-------------+-------------+---------------------+ | ts_timestamp | ts_data | is_anomaly | lower_bound | upper_bound | anomaly_probability | +-------------------------+----------+------------+-------------+-------------+---------------------+ | 2021-01-01 00:00:01 UTC | 125.3 | TRUE | 129.5 | 133.6 | 0.93 | | 2021-01-02 00:00:01 UTC | 145.3 | TRUE | 131.5 | 136.6 | 0.96 | +-------------------------+----------+------------+-------------+-------------+---------------------+
ARIMA_PLUS
model with input data as a query statement
The following example detects anomalies using an ARIMA_PLUS
model, using a
custom anomaly_prob_threshold
value of 0.9
and passing an input data table
into the query:
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`, STRUCT(0.9 AS anomaly_prob_threshold), ( SELECT state, city, date, temperature, weather FROM `mydataset.my_time_series_data_table`))
This example uses the following column values:
TIME_SERIES_ID_COL
isstate
,city
.TIME_SERIES_TIMESTAMP_COL
isdate
.TIME_SERIES_DATA_COL
istemperature
.
This example returns results similar to the following:
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+ | state | city | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather | +-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+ | "WA" | "Kirkland" | 2021-01-01 00:00:00 UTC | 38.1 | FALSE | 36.4 | 42.0 | 0.8293 | "sunny" | | "WA" | "Kirkland" | 2021-01-02 00:00:00 UTC | 37.1 | TRUE | 37.4 | 43.3 | 0.9124 | "rainy" | +-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
ARIMA_PLUS
model with input data as a table
The following example detects anomalies using an ARIMA_PLUS
model, using
a custom anomaly_prob_threshold
value of 0.9
and passing an input data table
into the query:
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_arima_plus_model`, STRUCT(0.9 AS anomaly_prob_threshold), TABLE `mydataset.my_time_series_data_table`)
If the TIME_SERIES_ID_COL
column names are state
, city
, and
TIME_SERIES_TIMESTAMP_COL
, and the TIME_SERIES_DATA_COL
column names are
date
and temperature
, and one additional column weather
is in the input
data table, then this query returns results similar to the following:
+-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+ | state | city | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | weather | +-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+ | "WA" | "Kirkland" | 2021-01-01 00:00:00 UTC | 38.1 | FALSE | 36.4 | 42.0 | 0.8293 | "sunny" | | "WA" | "Kirkland" | 2021-01-02 00:00:00 UTC | 37.1 | TRUE | 37.4 | 43.3 | 0.9124 | "rainy" | +-------+------------+-------------------------+-------------+------------+-------------+-------------+---------------------+---------+
ARIMA_PLUS_XREG
model with a custom anomaly_prob_threshold
value
The following example detects anomalies using an ARIMA_PLUS_XREG
model that
uses a custom anomaly_prob_threshold
value of 0.6
:
SELECT * FROM ML.DETECT_ANOMALIES ( MODEL `mydataset.my arima_plus_xreg_model`, STRUCT(0.6 AS anomaly_prob_threshold) ) ORDER BY date ASC;
If the time series input column names are date
and temperature
, then
this query returns results similar to the following:
+-------------------------+-------------+------------+---------------------+---------------------+----------------------+ | date | temperature | is_anomaly | lower_bound | upper_bound | anomaly_probability | +-------------------------+-------------+------------+---------------------+---------------------+----------------------+ | 2009-08-11 00:00:00 UTC | 70.1 | false | 67.65879917809896 | 72.541200821901029 | 0.0 | | 2009-08-12 00:00:00 UTC | 73.4 | false | 71.714971312549849 | 76.597372956351919 | 0.20573021642489953 | | 2009-08-13 00:00:00 UTC | 64.6 | true | 67.7428898975034 | 72.625291541305472 | 0.94632610424009034 | +-------------------------+-------------+------------+---------------------+---------------------+----------------------+
Autoencoder model
The following example detects anomalies using an autoencoder model and a
contamination
value of 0.1
.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_autoencoder_model`, STRUCT(0.1 AS contamination), TABLE `mydataset.mytable`)
If the feature column names are f1
and f2
, then this query returns results
similar to the following:
+------------+--------------------+---------+--------+ | is_anomaly | mean_squared_error | f1 | f2 | +------------+--------------------+---------+--------+ | FALSE | 0.63456 | 120 | "a" | | TRUE | 11.342 | 15000 | "b" | +------------+--------------------+---------+--------+
K-means model
The following example detects anomalies using a k-means model and a
contamination
value of 0.2
.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_kmeans_model`, STRUCT(0.2 AS contamination), ( SELECT f1, f2 FROM `mydataset.mytable`))
This query returns results similar to the following:
+------------+---------------------+-------------+--------+--------+ | is_anomaly | normalized_distance | centroid_id | f1 | f2 | +------------+---------------------+-------------+--------+--------+ | FALSE | 0.63456 | 1 | 120 | "a" | | TRUE | 6.3243 | 2 | 15000 | "b" | +------------+---------------------+-------------+--------+--------+
PCA model
The following example detects anomalies using a PCA model and a
contamination
value of 0.1
.
SELECT * FROM ML.DETECT_ANOMALIES(MODEL `mydataset.my_pca_model`, STRUCT(0.1 AS contamination), TABLE `mydataset.mytable`)
If the feature column names are f1
, f2
and f3
, then this query returns
results similar to the following:
+------------+--------------------+---------+--------+------+ | is_anomaly | mean_squared_error | f1 | f2 | f3 | +------------+--------------------+---------+--------+------+ | FALSE | 0.63456 | 120 | "a" | 0.9 | | TRUE | 11.342 | 15000 | "b" | 25 | +------------+--------------------+---------+--------+------+
Pricing
All queries that use the ML.DETECT_ANOMALIES
function are billable, regardless
of the pricing model.
What's next
- For information about model inference, see Model inference overview.
- For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.