Single time-series forecasting from Google Analytics data


In this tutorial, you will learn how to create a time series model to perform single time-series forecasts using the google_analytics_sample.ga_sessions sample table.

The ga_sessions table contains information about a slice of session data collected by Google Analytics 360 and sent to BigQuery.

Objectives

In this tutorial, you use the following:

  • The CREATE MODEL statement: to create a time series model.
  • The ML.ARIMA_EVALUATE function: to evaluate the model.
  • The ML.ARIMA_COEFFICIENTS function: to inspect the model coefficients.
  • The ML.FORECAST function: to forecast daily total visits.
  • The ML.EXPLAIN_FORECAST function: to retrieve various components of the time series (such as seasonality and trend) that can be used to explain the forecast results.
  • Looker Studio: to visualize the forecasting results.

Costs

This tutorial uses billable components of Google Cloud, including the following:

  • BigQuery
  • BigQuery ML

For more information about BigQuery costs, see the BigQuery pricing page.

For more information about BigQuery ML costs, see BigQuery ML pricing.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to

    Enable the BigQuery API.

    Enable the API

Step one: Create your dataset

Create a BigQuery dataset to store your ML model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset.

    Create dataset.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter bqml_tutorial.

    • For Location type, select Multi-region, and then select US (multiple regions in United States).

      The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.

    • Leave the remaining default settings as they are, and click Create dataset.

      Create dataset page.

Step two (optional): Visualize the time series you want to forecast

Before creating the model, it is useful to see what your input time series looks like.

SQL

You can do this by using Looker Studio.

In the following GoogleSQL query, the FROM bigquery-public-data.google_analytics_sample.ga_sessions_* clause indicates that you are querying the ga_sessions_* tables in the google_analytics_sample dataset. These tables are partitioned tables.

In the SELECT statement, the query parses the date column from the input table to the TIMESTAMP type and renames it to parsed_date. The query uses the SUM(...) clause and the GROUP BY date clause to accumulate totals.visits on a daily basis.

#standardSQL
SELECT
PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date,
SUM(totals.visits) AS total_visits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
GROUP BY date

To run the query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
    PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date,
    SUM(totals.visits) AS total_visits
    FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    GROUP BY date
  3. Click Run.

    The query takes about 7 seconds to complete. After this query runs, the output is similar to the following screenshot. In the screenshot, you can see that this time series has 366 data points. Click the Explore data button and then Explore with Looker Studio. Looker Studio opens in a new tab. Complete the following steps in the new tab.

    Query output

    In the Chart panel, choose Time series chart:

    Time_series_chart

    In the Data panel, below the Chart panel, go to the Metric section. Add the total_visits field and then remove the default metric Record Count. This is shown in the following figure.

    Time_series_data_fields

    After you complete these steps, the following plot appears. The plot shows that that the input time series has a weekly seasonal pattern.

    Result_visualization

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import bigframes.pandas as bpd

# Start by loading the historical data from BigQuerythat you want to analyze and forecast.
# This clause indicates that you are querying the ga_sessions_* tables in the google_analytics_sample dataset.
# Read and visualize the time series you want to forecast.
df = bpd.read_gbq("bigquery-public-data.google_analytics_sample.ga_sessions_*")
parsed_date = bpd.to_datetime(df.date, format="%Y%m%d", utc=True)
visits = df["totals"].struct.field("visits")
total_visits = visits.groupby(parsed_date).sum()

# Expected output: total_visits.head()
# date
# 2016-08-01 00:00:00+00:00    1711
# 2016-08-02 00:00:00+00:00    2140
# 2016-08-03 00:00:00+00:00    2890
# 2016-08-04 00:00:00+00:00    3161
# 2016-08-05 00:00:00+00:00    2702
# Name: visits, dtype: Int64

total_visits.plot.line()

The result is similar to the following: Result_visualization

Step three: Create your time series model

Next, create a time series model using the Google Analytics 360 data. The following GoogleSQL query creates a model used to forecast totals.visits.

The CREATE MODEL clause creates and trains a model named bqml_tutorial.ga_arima_model.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.ga_arima_model`
OPTIONS
  (model_type = 'ARIMA_PLUS',
   time_series_timestamp_col = 'parsed_date',
   time_series_data_col = 'total_visits',
   auto_arima = TRUE,
   data_frequency = 'AUTO_FREQUENCY',
   decompose_time_series = TRUE
  ) AS
SELECT
  PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date,
  SUM(totals.visits) AS total_visits
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
GROUP BY date

The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...) clause indicates that you are creating a ARIMA-based time-series model. By default, auto_arima=TRUE, so the auto.ARIMA algorithm automatically tunes the hyper-parameters in ARIMA_PLUS models. The algorithm fits dozens of candidate models and chooses the best one with the lowest Akaike information criterion (AIC). Additionally, because the default is data_frequency='AUTO_FREQUENCY', the training process automatically infers the data frequency of the input time series. Lastly, the CREATE MODEL statement uses decompose_time_series=TRUE by default, and users can further understand how the time series is forecasted by fetching the separate time series components, such as seasonality and holiday effect.

Run the CREATE MODEL query to create and train your model:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    CREATE OR REPLACE MODEL `bqml_tutorial.ga_arima_model`
    OPTIONS
     (model_type = 'ARIMA_PLUS',
      time_series_timestamp_col = 'parsed_date',
      time_series_data_col = 'total_visits',
      auto_arima = TRUE,
      data_frequency = 'AUTO_FREQUENCY',
      decompose_time_series = TRUE
    ) AS
    SELECT
     PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date,
     SUM(totals.visits) AS total_visits
    FROM
     `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    GROUP BY date
  3. Click Run.

    The query takes about 43 seconds to complete, after which your model (ga_arima_model) appears in the navigation panel. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.

Step four: Inspect the evaluation metrics of all evaluated models

After creating your model, you can use the ML.ARIMA_EVALUATE function to see the evaluation metrics of all the candidate models evaluated during the process of automatic hyperparameter tuning.

In the following GoogleSQL query, the FROM clause uses the ML.ARIMA_EVALUATE function against your model, bqml_tutorial.ga_arima_model. By default this query returns the evaluation metrics of all the candidate models.

To run the ML.ARIMA_EVALUATE query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.ARIMA_EVALUATE(MODEL `bqml_tutorial.ga_arima_model`)
  3. Click Run.

  4. The query takes less than a second to complete. When the query is complete, click the Results tab below the query text area. The results should look similar to the following screenshot:

    ML.ARIMA_EVALUATE output.

    The results include the following columns:

    • non_seasonal_p
    • non_seasonal_d
    • non_seasonal_q
    • has_drift
    • log_likelihood
    • AIC
    • variance
    • seasonal_periods
    • has_holiday_effect
    • has_spikes_and_dips
    • has_step_changes
    • error_message

    The following four columns (non_seasonal_{p,d,q} and has_drift) define an ARIMA model in the training pipeline. The three metrics after that (log_likelihood, AIC, and variance) are relevant to the ARIMA model fitting process.

    The auto.ARIMA algorithm first uses the KPSS test to decide that the best value for non_seasonal_d is 1. When non_seasonal_d is 1, auto.ARIMA then trains 42 different candidate ARIMA models in parallel. Note that when non_seasonal_d is not 1, auto.ARIMA trains 21 different candidate models. In this example, all 42 candidate models are valid. Therefore, the output contains 42 rows, where each row is associated with a candidate ARIMA model. Note that for some time series, some candidate models are invalid as they are either non-invertible or non-stationary. These invalid models are excluded from the output, which will make the output have less than 42 rows. These candidate models are ordered by AIC in the ascending order. The model in the first row has the lowest AIC, and is considered as the best model. This best model is saved as the final model and is used when you call ML.EXPLAIN_FORECAST, ML.FORECAST, and ML.ARIMA_COEFFICIENTS as shown in the following steps.

    The seasonal_periods column is about the seasonal pattern inside the input time series. It has nothing to do with the ARIMA modeling, therefore it has the same value across all output rows. It reports a weekly pattern, which is within our expectation as described in step two above.

    The has_holiday_effect, has_spikes_and_dips, and has_step_changes columns are only populated when decompose_time_series=TRUE. They are about the holiday effect, spikes and dips, and step changes inside the input time series, which are not related to the ARIMA modeling. Therefore they are all the same across all output rows, except for those failed models.

    The error_message column shows that the possible error incurred during the auto.ARIMA fitting process. A possible reason might be that the selected non_seasonal_p, non_seasonal_d, non_seasonal_q, and has_drift columns are not able to stabilize the time series. To retrieve the possible error message of all the candidate models, set show_all_candidate_models=true.

Step five: Inspect the coefficients of your model

The ML.ARIMA_COEFFICIENTS function retrieves the model coefficients of your ARIMA_PLUS model, bqml_tutorial.ga_arima_model. ML.ARIMA_COEFFICIENTS takes the model as the only input.

Run the ML.ARIMA_COEFFICIENTS query:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.ARIMA_COEFFICIENTS(MODEL `bqml_tutorial.ga_arima_model`)
  3. Click Run.

    The query takes less than a second to complete. The results should look like the following:

    ML.ARIMA_COEFFICIENTS output.

    The results include the following columns:

    • ar_coefficients
    • ma_coefficients
    • intercept_or_drift

    ar_coefficients shows the model coefficients of the autoregressive (AR) part of the ARIMA model. Similarly, ma_coefficients shows the model coefficients of moving-average (MA) part. They are both arrays, whose lengths are equal to non_seasonal_p and non_seasonal_q, respectively. From the output of ML.ARIMA_EVALUATE, the best model as in the top row has a non_seasonal_p of 2 and a non_seasonal_q of 3. Therefore, ar_coefficients is a length-2 array and ma_coefficients is a length-3 array. The intercept_or_drift is the constant term in the ARIMA model.

Step six: Use your model to forecast the time series

The ML.FORECAST function forecasts future time series values with a prediction interval using your model: bqml_tutorial.ga_arima_model.

In the following GoogleSQL query, the STRUCT(30 AS horizon, 0.8 AS confidence_level) clause indicates that the query forecasts 30 future time points, and generates a prediction interval with a 80% confidence level. ML.FORECAST takes the model, as well as a couple of optional arguments.

To run the ML.FORECAST query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.FORECAST(MODEL `bqml_tutorial.ga_arima_model`,
                 STRUCT(30 AS horizon, 0.8 AS confidence_level))
  3. Click Run.

    The query takes less than a second to complete. The results should look like the following:

    ML.FORECAST output.

    The results include the following columns:

    • forecast_timestamp
    • forecast_value
    • standard_error
    • confidence_level
    • prediction_interval_lower_bound
    • prediction_interval_upper_bound
    • confidence_interval_lower_bound (soon to be deprecated)
    • confidence_interval_upper_bound (soon to be deprecated)

    The output rows are ordered in the chronological order of forecast_timestamp. In time series forecasting, the prediction interval, which is captured by the lower and upper bounds, is as important as the forecast_value. The forecast_value is the middle point of the prediction interval. The prediction interval depends on the standard_error and confidence_level.

Step seven: Explain and visualize the forecasting results

To understand how the time series is forecasted, and to visualize the forecasted time series together with the history time series and all the separate components, the ML.EXPLAIN_FORECAST function forecasts future time series values with a prediction interval using your model, bqml_tutorial.ga_arima_model, and at the same time returns all the separate components of the time series.

Like the ML.FORECAST function, the STRUCT(30 AS horizon, 0.8 AS confidence_level) clause indicates that the query forecasts 30 future time points and generates a prediction interval with 80% confidence. The ML.EXPLAIN_FORECAST function takes the model, as well as a couple of optional arguments.

To run the ML.EXPLAIN_FORECAST query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.EXPLAIN_FORECAST(MODEL `bqml_tutorial.ga_arima_model`,
                         STRUCT(30 AS horizon, 0.8 AS confidence_level))
  3. Click Run.

    The query takes less than a second to complete. The results should look like the following:

    ML.EXPLAIN_FORECAST output1. ML.EXPLAIN_FORECAST output2.

    The results include the following columns:

    • time_series_timestamp
    • time_series_type
    • time_series_data
    • time_series_adjusted_data
    • standard_error
    • confidence_level
    • prediction_interval_lower_bound
    • prediction_interval_lower_bound
    • trend
    • seasonal_period_yearly
    • seasonal_period_quarterly
    • seasonal_period_monthly
    • seasonal_period_weekly
    • seasonal_period_daily
    • holiday_effect
    • spikes_and_dips
    • step_changes
    • residual

    The output rows are ordered in the chronological order of time_series_timestamp. Different components are listed as columns of the output. For more information, see the definition of ML.EXPLAIN_FORECAST.

  4. After the query completes, click the Explore data button, and then click Explore with Looker Studio. Looker Studio opens in a new tab.

    Union_all_query output

  5. In the Chart panel, choose Time series chart:

    Time_series_chart

  6. In the Data panel, do the following:

    1. In the Date Range Dimension section, select time_series_timestamp (Date).
    2. In the Dimension section, select time_series_timestamp (Date).
    3. In the Metric section, remove the default metric Record Count, and add the following:
      • time_series_data
      • prediction_interval_lower_bound
      • prediction_interval_upper_bound
      • trend
      • seasonal_period_weekly
      • step_changes

    Data_panel

  7. In the Style panel, scroll down to the Missing Data option, and use Line Breaks instead of Line to Zero.

    Style_section

    The following plot appears:

    Result_visualization

Step eight (optional): Visualize the forecasting results without having decompose_time_series enabled

If decompose_time_series is set to false in the ARIMA_PLUS training, then you can concatenate the history time series and the forecasted time series using the UNION ALL clause and the ML.FORECAST function.

In the following query, the SQL before the UNION ALL clause forms the history time series. The SQL after the UNION ALL clause uses the ML.FORECAST function to generate the forecasted time series as well as the prediction interval. The query uses different fields for history_value and forecasted_value to plot them in different colors.

To run the query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
     history_timestamp AS timestamp,
     history_value,
     NULL AS forecast_value,
     NULL AS prediction_interval_lower_bound,
     NULL AS prediction_interval_upper_bound
    FROM
     (
       SELECT
         PARSE_TIMESTAMP("%Y%m%d", date) AS history_timestamp,
         SUM(totals.visits) AS history_value
       FROM
         `bigquery-public-data.google_analytics_sample.ga_sessions_*`
       GROUP BY date
       ORDER BY date ASC
     )
    UNION ALL
    SELECT
     forecast_timestamp AS timestamp,
     NULL AS history_value,
     forecast_value,
     prediction_interval_lower_bound,
     prediction_interval_upper_bound
    FROM
     ML.FORECAST(MODEL `bqml_tutorial.ga_arima_model`,
                 STRUCT(30 AS horizon, 0.8 AS confidence_level))
  3. Click Run.

  4. After the query completes, click the Explore data button, and then click Explore with Looker Studio. Looker Studio opens in a new tab. Complete the following steps in the new tab.

    Union_all_query output

  5. In the Chart panel, choose Time series chart:

    Time_series_chart

  6. In the Data panel, below the Chart panel, go to the Metric section. Add the following metrics: history_value, forecast_value, prediction_interval_lower_bound, and prediction_interval_upper_bound. Then, remove the default metric Record Count.

    Data_section

  7. In the Style panel, scroll down to the Missing Data option and use Line Breaks instead of Line to Zero.

    Style_section

    After you complete these steps, the following plot appears in the left panel. The input history time series is in blue, while the forecasted series is in green. The prediction interval is the region between the lower bound series and the upper bound series.

    Result_visualization

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.

  4. In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Delete your project

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next