The CREATE MODEL statement for contribution analysis models
This document describes the CREATE MODEL
statement for creating
contribution analysis models in
BigQuery. You can use contribution analysis models to
generate insights about changes to key metrics in your multi-dimensional data.
After you have created a contribution analysis model, you can use the
ML.GET_INSIGHTS
function
to retrieve the metric information calculated by the model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} `project_id.dataset.model_name` OPTIONS( MODEL_TYPE = 'CONTRIBUTION_ANALYSIS', CONTRIBUTION_METRIC = 'contribution_metric', DIMENSION_ID_COLS = dimension_column_array, IS_TEST_COL = 'is_test_col' [, MIN_APRIORI_SUPPORT = min_apriori_support] ) AS query_statement;
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model doesn't exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example, `myproject.mydataset.mymodel`.
MODEL_TYPE
Syntax
MODEL_TYPE = 'CONTRIBUTION_ANALYSIS'
Description
Specify the model type. This option is required.
CONTRIBUTION_METRIC
Syntax
CONTRIBUTION_METRIC = 'contribution_metric'
Description
Provides the expression to use to calculate the metric you are analyzing.
To calculate a summable metric,
the expression must be in the form SUM(metric_column_name)
.
To calculate a
summable ratio metric,
the expression must be in the form
SUM(numerator_metric_column_name)/SUM(denominator_metric_column_name)
.
The expression is case insensitive.
You can't use any additional numerical computations in the contribution metric
expression. For example, neither SUM(AVG(metric_col))
nor
AVG(SUM(round(metric_col_numerator))/(SUM(metric_col_denominator))
is valid.
You can perform additional computations in the
query_statement
if necessary.
The values in the metrics columns that you use in the CONTRIBUTION_METRIC
option must be positive, unless you specify 0
for the MIN_APRIORI_SUPPORT
value.
Arguments
A STRING
value. There is no default value.
DIMENSION_ID_COLS
Syntax
DIMENSION_ID_COLS = dimension_column_array
Description
Provides the names of the columns to use as dimensions when summarizing the
metric specified in the CONTRIBUTION_METRIC
option.
The dimension columns that you specify must have an INT64
, BOOL
, or STRING
data type.
If you want to use a
numerical
column that has a data type other than INT64
as a dimension, you can use a
numerical preprocessing function
in the query_statement
to transform continuous values into categorical values.
You can only use a preprocessing function that outputs a STRING
value,
for example
ML.BUCKETIZE
.
Using fewer dimensions helps reduce the query runtime when you use the model
with the ML.GET_INSIGHTS
function.
Arguments
An ARRAY<STRING>
value. For example,
['dimension_column_1','dimension_column_2','dimension_column_3']
.
The array can have 13 or fewer array elements when the MIN_APRIORI_SUPPORT
option value is 0
, and 50 or fewer array elements when the
MIN_APRIORI_SUPPORT
option value is greater than 0
. There is no
default value.
IS_TEST_COL
Syntax
IS_TEST_COL = 'is_test_col'
Description
Provides the name of the column to use to determine whether a given row is
test data or control data. The column that you specify must have a BOOL
data type.
Arguments
A STRING
value. There is no default value.
MIN_APRIORI_SUPPORT
Syntax
MIN_APRIORI_SUPPORT = min_apriori_support
Description
Provide the minimum
apriori support threshold for including
segments in the model output. All segments whose
apriori_support
value
is less than the MIN_APRIORI_SUPPORT
value that you specify are excluded.
Arguments
A FLOAT64
value in the range [0, 1]
. The default value is 0.1
.
query_statement
The
GoogleSQL query
that contains the test and control data to analyze. The table of input data
that you specify in the query must contain the columns that you reference in the
CONTRIBUTION_METRIC
, DIMENSION_ID_COLS
, and IS_TEST_COL
options.
Use a summable metric
A summable metric contribution analysis model summarizes the values of the a metric column that you specify, and then determines a total for each segment of the data. For example, you might summarize by revenue, or by number of items sold.
An example input dataset for a summable metric contribution analysis model might look similar to the following table:
product | store | city | revenue | is_test |
---|---|---|---|---|
shoe_1 | store_2 | Mountain View | 100 | true |
shoe_1 | store_3 | Sunnyvale | 200 | true |
shoe_2 | store_1 | San Francisco | 85 | true |
shoe_1 | store_2 | Mountain View | 100 | false |
shoe_1 | store_3 | Sunnyvale | 150 | false |
shoe_2 | store_1 | San Francisco | 175 | false |
In this case, the numerical revenue
column provides the metric to
analyze, so that is the column you would use in the CONTRIBUTION_METRIC
option. The product
, store
, and city
columns are the
dimensions to analyze as contributors to the revenue, so those are the columns
you would specify in the DIMENSION_ID_COLS
option. The is_test
column
value indicates whether the row belongs to the test set or the control set.
Use a summable ratio metric
A summable ratio metric contribution analysis model summarizes the values of two numeric columns that you specify, and determines the ratio between them for each segment of the data. For example, you might analyze the ratio between cost and clicks values to determine the cost per click in an ad campaign.
An example input dataset for a summable ratio metric contribution analysis model might look similar to the following table:
browser | device | cost | clicks | is_test |
---|---|---|---|---|
Chrome | iPad | 100.00 | 30 | true |
Firefox | Pixel | 100.00 | 10 | true |
Edge | Pixel | 250.00 | 40 | true |
Chrome | iPad | 100.00 | 40 | false |
Firefox | Pixel | 50.00 | 5 | false |
In this case, the numerical cost
and clicks
columns provide the metrics to
analyze, so those are the columns you would use in the CONTRIBUTION_METRIC
option. The browser
and device
columns are the dimensions to analyze as
contributors to the cost and click values, so those are the columns
you would specify in the DIMENSION_ID_COLS
option. The is_test
column
value indicates whether the row belongs to the test set or the control set.
Use an apriori support threshold
You can optimize a contribution analysis model by specifying an apriori support
threshold. The model uses the threshold value that you specify to prune the
search space and include only larger segments. It does this by comparing the
size of each segment relative to the rest of the population, and dropping the
segments where this ratio is less than the model's MIN_APRIORI_SUPPORT
value.
This pruning lets you analyze only the segments of data that are large enough to
be of interest, and also reduces model creation time.
Examples
The following examples show how to create contribution analysis models.
Example 1
The following example creates a contribution analysis model that uses a summable metric:
CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model` OPTIONS( MODEL_TYPE = 'CONTRIBUTION_ANALYSIS', CONTRIBUTION_METRIC = 'SUM(house_price)', DIMENSION_ID_COLS = ['city', 'floor_space'], IS_TEST_COL = 'dataset_type' ) AS SELECT * FROM mydataset.house_sales WHERE state = 'WA';
Example 2
The following example creates a contribution analysis model that uses a summable ratio metric:
CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model` OPTIONS( MODEL_TYPE = 'CONTRIBUTION_ANALYSIS', CONTRIBUTION_METRIC = 'SUM(concessions_sold)/SUM(attendee_count)', DIMENSION_ID_COLS = ['venue_ID','event_date','concession_type'], IS_TEST_COL = 'test_col', MIN_APRIORI_SUPPORT = .08 ) AS SELECT concessions_sold, attendee_count, venue_ID, event_date, concession_type, test_col FROM mydataset.regional_sales;
What's next
Get data insights from a contribution analysis model.