The CREATE MODEL statement for contribution analysis models

This document describes the CREATE MODEL statement for creating contribution analysis models in BigQuery. You can use contribution analysis models to generate insights about changes to key metrics in your multi-dimensional data.

After you have created a contribution analysis model, you can use the ML.GET_INSIGHTS function to retrieve the metric information calculated by the model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} 
`project_id.dataset.model_name`
OPTIONS(
  MODEL_TYPE = 'CONTRIBUTION_ANALYSIS',
  CONTRIBUTION_METRIC = 'contribution_metric',
  DIMENSION_ID_COLS = dimension_column_array,
  IS_TEST_COL = 'is_test_col'
  [, MIN_APRIORI_SUPPORT = min_apriori_support]
)
AS query_statement;

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model doesn't exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

MODEL_TYPE

Syntax

MODEL_TYPE = 'CONTRIBUTION_ANALYSIS'

Description

Specify the model type. This option is required.

CONTRIBUTION_METRIC

Syntax

CONTRIBUTION_METRIC = 'contribution_metric'

Description

Provides the expression to use to calculate the metric you are analyzing.

To calculate a summable metric, the expression must be in the form SUM(metric_column_name).

To calculate a summable ratio metric, the expression must be in the form SUM(numerator_metric_column_name)/SUM(denominator_metric_column_name).

The expression is case insensitive.

You can't use any additional numerical computations in the contribution metric expression. For example, neither SUM(AVG(metric_col)) nor AVG(SUM(round(metric_col_numerator))/(SUM(metric_col_denominator)) is valid. You can perform additional computations in the query_statement if necessary.

The values in the metrics columns that you use in the CONTRIBUTION_METRIC option must be positive, unless you specify 0 for the MIN_APRIORI_SUPPORT value.

Arguments

A STRING value. There is no default value.

DIMENSION_ID_COLS

Syntax

DIMENSION_ID_COLS = dimension_column_array

Description

Provides the names of the columns to use as dimensions when summarizing the metric specified in the CONTRIBUTION_METRIC option. The dimension columns that you specify must have an INT64, BOOL, or STRING data type.

If you want to use a numerical column that has a data type other than INT64 as a dimension, you can use a numerical preprocessing function in the query_statement to transform continuous values into categorical values. You can only use a preprocessing function that outputs a STRING value, for example ML.BUCKETIZE.

Using fewer dimensions helps reduce the query runtime when you use the model with the ML.GET_INSIGHTS function.

Arguments

An ARRAY<STRING> value. For example, ['dimension_column_1','dimension_column_2','dimension_column_3']. The array can have 13 or fewer array elements when the MIN_APRIORI_SUPPORT option value is 0, and 50 or fewer array elements when the MIN_APRIORI_SUPPORT option value is greater than 0. There is no default value.

IS_TEST_COL

Syntax

IS_TEST_COL = 'is_test_col'

Description

Provides the name of the column to use to determine whether a given row is test data or control data. The column that you specify must have a BOOL data type.

Arguments

A STRING value. There is no default value.

MIN_APRIORI_SUPPORT

Syntax

MIN_APRIORI_SUPPORT = min_apriori_support

Description

Provide the minimum apriori support threshold for including segments in the model output. All segments whose apriori_support value is less than the MIN_APRIORI_SUPPORT value that you specify are excluded.

Arguments

A FLOAT64 value in the range [0, 1]. The default value is 0.1.

query_statement

The GoogleSQL query that contains the test and control data to analyze. The table of input data that you specify in the query must contain the columns that you reference in the CONTRIBUTION_METRIC, DIMENSION_ID_COLS, and IS_TEST_COL options.

Use a summable metric

A summable metric contribution analysis model summarizes the values of the a metric column that you specify, and then determines a total for each segment of the data. For example, you might summarize by revenue, or by number of items sold.

An example input dataset for a summable metric contribution analysis model might look similar to the following table:

product store city revenue is_test
shoe_1 store_2 Mountain View 100 true
shoe_1 store_3 Sunnyvale 200 true
shoe_2 store_1 San Francisco 85 true
shoe_1 store_2 Mountain View 100 false
shoe_1 store_3 Sunnyvale 150 false
shoe_2 store_1 San Francisco 175 false

In this case, the numerical revenue column provides the metric to analyze, so that is the column you would use in the CONTRIBUTION_METRIC option. The product, store, and city columns are the dimensions to analyze as contributors to the revenue, so those are the columns you would specify in the DIMENSION_ID_COLS option. The is_test column value indicates whether the row belongs to the test set or the control set.

Use a summable ratio metric

A summable ratio metric contribution analysis model summarizes the values of two numeric columns that you specify, and determines the ratio between them for each segment of the data. For example, you might analyze the ratio between cost and clicks values to determine the cost per click in an ad campaign.

An example input dataset for a summable ratio metric contribution analysis model might look similar to the following table:

browser device cost clicks is_test
Chrome iPad 100.00 30 true
Firefox Pixel 100.00 10 true
Edge Pixel 250.00 40 true
Chrome iPad 100.00 40 false
Firefox Pixel 50.00 5 false

In this case, the numerical cost and clicks columns provide the metrics to analyze, so those are the columns you would use in the CONTRIBUTION_METRIC option. The browser and device columns are the dimensions to analyze as contributors to the cost and click values, so those are the columns you would specify in the DIMENSION_ID_COLS option. The is_test column value indicates whether the row belongs to the test set or the control set.

Use an apriori support threshold

You can optimize a contribution analysis model by specifying an apriori support threshold. The model uses the threshold value that you specify to prune the search space and include only larger segments. It does this by comparing the size of each segment relative to the rest of the population, and dropping the segments where this ratio is less than the model's MIN_APRIORI_SUPPORT value. This pruning lets you analyze only the segments of data that are large enough to be of interest, and also reduces model creation time.

Examples

The following examples show how to create contribution analysis models.

Example 1

The following example creates a contribution analysis model that uses a summable metric:

CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model`
  OPTIONS(
    MODEL_TYPE = 'CONTRIBUTION_ANALYSIS',
    CONTRIBUTION_METRIC = 'SUM(house_price)',
    DIMENSION_ID_COLS = ['city', 'floor_space'],
    IS_TEST_COL = 'dataset_type'
) AS
SELECT * FROM mydataset.house_sales WHERE state = 'WA';

Example 2

The following example creates a contribution analysis model that uses a summable ratio metric:

CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model`
  OPTIONS(
    MODEL_TYPE = 'CONTRIBUTION_ANALYSIS',
    CONTRIBUTION_METRIC = 'SUM(concessions_sold)/SUM(attendee_count)',
    DIMENSION_ID_COLS = ['venue_ID','event_date','concession_type'],
    IS_TEST_COL = 'test_col',
    MIN_APRIORI_SUPPORT = .08
) AS
SELECT concessions_sold, attendee_count, venue_ID,
  event_date, concession_type, test_col
FROM mydataset.regional_sales;

What's next

Get data insights from a contribution analysis model.