The CREATE MODEL statement for contribution analysis models
This document describes the CREATE MODEL
statement for creating
contribution analysis models in
BigQuery. You can use contribution analysis models to
generate insights about changes to key metrics in your multi-dimensional data.
After you have created a contribution analysis model, you can use the
ML.GET_INSIGHTS
function
to retrieve the metric information calculated by the model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} `project_id.dataset.model_name` OPTIONS( MODEL_TYPE = 'CONTRIBUTION_ANALYSIS', CONTRIBUTION_METRIC = 'contribution_metric', DIMENSION_ID_COLS = dimension_column_array, IS_TEST_COL = 'is_test_col' [, MIN_APRIORI_SUPPORT = min_apriori_support] [, TOP_K_INSIGHTS_BY_APRIORI_SUPPORT = top_k_insights_by_apriori_support] [, PRUNING_METHOD = {'NO_PRUNING', 'PRUNE_REDUNDANT_INSIGHTS'}] ) AS query_statement;
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model doesn't exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example, `myproject.mydataset.mymodel`.
MODEL_TYPE
Syntax
MODEL_TYPE = 'CONTRIBUTION_ANALYSIS'
Description
Specify the model type. This option is required.
CONTRIBUTION_METRIC
Syntax
CONTRIBUTION_METRIC = 'contribution_metric'
Description
Provides the expression to use to calculate the metric you are analyzing.
To calculate a summable metric,
the expression must be in the form SUM(metric_column_name)
, where
metric_column_name
is a numeric data type.
To calculate a
summable ratio metric,
the expression must be in the form
SUM(numerator_metric_column_name)/SUM(denominator_metric_column_name)
, where
numerator_metric_column_name
and denominator_metric_column_name
are
numeric data types.
To calculate a
summable by category metric,
the expression must be in the form
SUM(metric_sum_column_name)/COUNT(DISTINCT categorical_column_name)
. The summed
column must be a numeric data type. The categorical column must have type
BOOL
, DATE
, DATETIME
, TIME
, TIMESTAMP
, STRING
, or INT64
.
The expression is case insensitive.
You can't use any additional numerical computations in the contribution metric
expression. For example, neither SUM(AVG(metric_col))
nor
AVG(SUM(round(metric_col_numerator))/(SUM(metric_col_denominator))
is valid.
You can perform additional computations in the
query_statement
if necessary.
The values in the metrics columns that you use in the CONTRIBUTION_METRIC
option must be non-negative, unless you specify 0
for the
MIN_APRIORI_SUPPORT
value.
Arguments
A STRING
value. There is no default value.
DIMENSION_ID_COLS
Syntax
DIMENSION_ID_COLS = dimension_column_array
Description
Provides the names of the columns to use as dimensions when summarizing the
metric specified in the CONTRIBUTION_METRIC
option.
The dimension columns that you specify must have an INT64
, BOOL
, or STRING
data type.
Any rows in the dimension columns that contain NULL
values are removed. If
you have NULL
values in your data, you can preprocess the data to fill in the
NULL
values by using the
ML.IMPUTER
function,
or by using the
IFNULL
expression
to replace NULL
values with a custom value.
If you want to use a
numerical
column that has a data type other than INT64
as a dimension, you can use any
function that outputs an INT64
, BOOL
, or STRING
value in the
query_statement
to transform continuous values. For
example, you could use one of the
numerical preprocessing functions,
such as
ML.BUCKETIZE
.
Using fewer dimensions helps reduce the model creation runtime.
Arguments
An ARRAY<STRING>
value. For example,
['dimension_column_1','dimension_column_2','dimension_column_3']
.
The array can have 13 or fewer array elements when the MIN_APRIORI_SUPPORT
option value is 0
, and 50 or fewer array elements when the
MIN_APRIORI_SUPPORT
option value is greater than 0
. There is no
default value.
IS_TEST_COL
Syntax
IS_TEST_COL = 'is_test_col'
Description
Provides the name of the column to use to determine whether a given row is
test data or control data. The column that you specify must have a BOOL
data type.
Arguments
A STRING
value. There is no default value.
MIN_APRIORI_SUPPORT
Syntax
MIN_APRIORI_SUPPORT = min_apriori_support
Description
Provide the minimum
apriori support threshold for including
segments in the model output. All segments whose
apriori_support
value
is less than the MIN_APRIORI_SUPPORT
value that you specify are excluded.
The MIN_APRIORI_SUPPORT
option can't be used together with
TOP_K_INSIGHTS_BY_APRIORI_SUPPORT
.
Arguments
A FLOAT64
value in the range [0, 1]
. The default value is 0.1
.
TOP_K_INSIGHTS_BY_APRIORI_SUPPORT
Syntax
TOP_K_INSIGHTS_BY_APRIORI_SUPPORT = top_k_insights_by_apriori_support
Description
Provide the maximum number of insights to output with the highest apriori
support value. The model automatically retrieves the insights with the
highest apriori support values and prunes insights with low apriori support,
decreasing query runtime compared to using no apriori support thresholding. The
TOP_K_INSIGHTS_BY_APRIORI_SUPPORT
option can't be used together with
MIN_APRIORI_SUPPORT
.
Arguments
An INT64
value between 1 and 1,000,000. There is no default value. If no
TOP_K_INSIGHTS_BY_APRIORI_SUPPORT
is specified, a MIN_APRIORI_SUPPORT
threshold of 0.1 is applied.
PRUNING_METHOD
Syntax
PRUNING_METHOD = {'NO_PRUNING' | 'PRUNE_REDUNDANT_INSIGHTS'}
Description
The pruning method used to filter insights generated by the model.
Arguments
NO_PRUNING
: All insights are returned, respecting any apriori support thresholds specified byMIN_APRIORI_SUPPORT
orTOP_K_INSIGHTS_BY_APRIORI_SUPPORT
.PRUNE_REDUNDANT_INSIGHTS
: Redundant insights are pruned from the model. Two rows are considered to be redundant insights if the following conditions are met:- Their metric values are equal.
- The dimension IDs and corresponding values of one row are a subset of the dimension IDs and corresponding values of the other. In this case, the row with more dimension IDs (the more descriptive row) is kept.
The
all
insight row is never pruned.
The default value is NO_PRUNING
.
query_statement
The
GoogleSQL query
that contains the test and control data to analyze. The table of input data
that you specify in the query must contain exactly the columns that you
reference in the
CONTRIBUTION_METRIC
, DIMENSION_ID_COLS
, and IS_TEST_COL
options.
Choose test and control data
Contribution analysis models require test and control data in a single table as input. The control set is used as a baseline to compare against the test set, which contains the data points of interest. The following are examples of test and control sets:
- Periods of time: Compare two different months of revenue data.
- Geographic regions: Compare retention rate across different countries.
- Product types: Compare sales per user across different product types.
- Campaign or promotion: Compare website engagement across different ad campaigns.
To set up the input data, you can create tables of test and control data separately and take the union of the two tables. The final input table must contain your dimension columns, numeric metric columns, and a categorical column if you're using a summable by category metric. For best results, we recommend having roughly equal numbers of test and control rows when using the summable and summable ratio metrics to avoid biased results. If the number of test and control rows are unbalanced, then the summable by category metric might reduce bias by normalizing the numeric metric against categorical values, allowing for easier comparisons between contributors.
Use a summable metric
A summable metric contribution analysis model summarizes the values of a metric column that you specify, and then determines a total for each segment of the data. For example, you might summarize by revenue, or by number of items sold.
An example input dataset for a summable metric contribution analysis model might look similar to the following table:
product | store | city | revenue | is_test |
---|---|---|---|---|
shoe_1 | store_2 | Mountain View | 100 | true |
shoe_1 | store_3 | Sunnyvale | 200 | true |
shoe_2 | store_1 | San Francisco | 85 | true |
shoe_1 | store_2 | Mountain View | 100 | false |
shoe_1 | store_3 | Sunnyvale | 150 | false |
shoe_2 | store_1 | San Francisco | 175 | false |
In this example, the numerical revenue
column provides the metric to
analyze. The product
, store
, and city
columns are the
dimensions to analyze as contributors to total revenue. The is_test
column
value indicates whether the row belongs to the test set or the control set.
Your CREATE MODEL
statement would include the following:
CONTRIBUTION_METRIC = 'SUM(revenue)',
DIMENSION_ID_COLS = ['product', 'store', 'city'],
IS_TEST_COL = 'is_test',
Use a summable ratio metric
A summable ratio metric contribution analysis model summarizes the values of two numeric columns that you specify, and determines the ratio between them for each segment of the data. For example, you might analyze the ratio between cost and clicks values to determine the cost per click in an ad campaign.
An example input dataset for a summable ratio metric contribution analysis model might look similar to the following table:
browser | device | cost | clicks | is_test |
---|---|---|---|---|
Chrome | iPad | 100.00 | 30 | true |
Firefox | Pixel | 100.00 | 10 | true |
Edge | Pixel | 250.00 | 40 | true |
Chrome | iPad | 100.00 | 40 | false |
Firefox | Pixel | 50.00 | 5 | false |
In this case, the numerical cost
and clicks
columns provide the metrics to
analyze. The browser
and device
columns are the dimensions to analyze as
contributors to the cost and click values. The is_test
column
value indicates whether the row belongs to the test set or the control set.
Your CREATE MODEL
statement would include the following:
CONTRIBUTION_METRIC = 'SUM(cost)/SUM(clicks)',
DIMENSION_ID_COLS = ['browser', 'device'],
IS_TEST_COL = 'is_test',
Use a summable by category metric
A summable by category metric contribution analysis model sums the value of a numeric column and divides it by the number of distinct values from a categorical column, defined as a column of one of the following types:
BOOL
DATE
DATETIME
TIME
TIMESTAMP
STRING
INT64
.
For example, if sales
is your numeric column and month
is your categorical column, then the metric would analyze the contributions of
each segment to monthly sales.
An example input dataset for the summable by category metric contribution analysis model might look similar to the following table:
item_id | category | sales | month | is_test |
---|---|---|---|---|
1 | shirts | 150 | December | true |
2 | jackets | 100 | January | true |
3 | hats | 50 | April | true |
1 | shirts | 125 | February | false |
4 | hats | 10 | February | false |
2 | jackets | 75 | November | false |
In this case, the numerical sales and categorical month columns provide the
metrics to analyze. The item_id
and category
columns are
the dimensions to analyze as contributors to the sales per month values.
The is_test
column value indicates whether the row belongs to the test set or
the control set. Your CREATE MODEL
statement would include the following:
CONTRIBUTION_METRIC = 'SUM(sales)/COUNT(DISTINCT month)',
DIMENSION_ID_COLS = ['item_id', 'category'],
IS_TEST_COL = 'is_test',
Use an apriori support threshold
You can optimize a contribution analysis model by specifying an apriori support threshold. The model uses the threshold value that you specify to prune the search space and include only larger segments. Apriori support is measured by comparing the size of each segment relative to the rest of the population, and dropping the segments where this ratio is less than the model's apriori support threshold value. This pruning lets you analyze only the segments of data that are large enough to be of interest, and also reduces model creation time. There are two ways that you can use an apriori support threshold in a contribution analysis model:
- Tune your own apriori support value with
MIN_APRIORI_SUPPORT
. - Provide the maximum number of insights to retrieve using
TOP_K_INSIGHTS_BY_APRIORI_SUPPORT
, which automatically fetches thek
insights with the highest apriori support values.
Examples
The following examples show how to create contribution analysis models.
Example 1
The following example creates a contribution analysis model that uses a summable metric:
CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model` OPTIONS( MODEL_TYPE = 'CONTRIBUTION_ANALYSIS', CONTRIBUTION_METRIC = 'SUM(house_price)', DIMENSION_ID_COLS = ['city', 'floor_space'], IS_TEST_COL = 'dataset_type' ) AS SELECT * FROM mydataset.house_sales WHERE state = 'WA';
Example 2
The following example creates a contribution analysis model that uses a summable ratio metric:
CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model` OPTIONS( MODEL_TYPE = 'CONTRIBUTION_ANALYSIS', CONTRIBUTION_METRIC = 'SUM(concessions_sold)/SUM(attendee_count)', DIMENSION_ID_COLS = ['venue_ID','event_date','concession_type'], IS_TEST_COL = 'test_col', MIN_APRIORI_SUPPORT = .08 ) AS SELECT concessions_sold, attendee_count, venue_ID, event_date, concession_type, test_col FROM mydataset.regional_sales;
Example 3
The following example creates a contribution analysis model that uses a summable by category metric:
CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model` OPTIONS( MODEL_TYPE = 'CONTRIBUTION_ANALYSIS', CONTRIBUTION_METRIC = 'SUM(concessions_sold)/COUNT(DISTINCT sales_date)', DIMENSION_ID_COLS = ['venue_ID','event_date', 'attendee_count', 'concession_type'], IS_TEST_COL = 'test_col', TOP_K_INSIGHTS_BY_APRIORI_SUPPORT = 25, PRUNING_METHOD = 'PRUNE_REDUNDANT_INSIGHTS' ) AS SELECT concessions_sold, attendee_count, venue_ID, event_date, concession_type, test_col, sales_date FROM mydataset.regional_sales;
What's next
- Learn how to call the
ML.GET_INSIGHTS
function to get insights from your contribution analysis model. - Try a tutorial on how to Get data insights from a contribution analysis model using a summable metric.