In this tutorial, you will learn how to create a time series model to perform
single time-series forecasts using the
google_analytics_sample.ga_sessions
sample table.
The ga_sessions
table contains information about a slice of session data
collected by Google Analytics 360 and sent to
BigQuery.
Objectives
In this tutorial, you use the following:
- The
CREATE MODEL
statement: to create a time series model. - The
ML.ARIMA_EVALUATE
function: to evaluate the model. - The
ML.ARIMA_COEFFICIENTS
function: to inspect the model coefficients. - The
ML.FORECAST
function: to forecast daily total visits. - The
ML.EXPLAIN_FORECAST
function: to retrieve various components of the time series (such as seasonality and trend) that can be used to explain the forecast results. - Looker Studio: to visualize the forecasting results.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For more information about BigQuery costs, see the BigQuery pricing page.
For more information about BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Step one: Create your dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Step two (optional): Visualize the time series you want to forecast
Before creating the model, it is useful to see what your input time series looks like.
SQL
You can do this by using Looker Studio.
In the following GoogleSQL query, the
FROM bigquery-public-data.google_analytics_sample.ga_sessions_*
clause
indicates that you are querying the ga_sessions_*
tables in the
google_analytics_sample
dataset. These tables are
partitioned tables.
In the SELECT
statement, the query parses the date
column from the input
table to the TIMESTAMP
type and renames it to parsed_date
. The query uses
the SUM(...)
clause and the GROUP BY date
clause to accumulate
totals.visits
on a daily basis.
#standardSQL SELECT PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date, SUM(totals.visits) AS total_visits FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` GROUP BY date
To run the query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date, SUM(totals.visits) AS total_visits FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` GROUP BY date
Click Run.
The query takes about 7 seconds to complete. After this query runs, the output is similar to the following screenshot. In the screenshot, you can see that this time series has 366 data points. Click the Explore data button and then Explore with Looker Studio. Looker Studio opens in a new tab. Complete the following steps in the new tab.
In the Chart panel, choose Time series chart:
In the Data panel, below the Chart panel, go to the Metric section. Add the total_visits field and then remove the default metric Record Count. This is shown in the following figure.
After you complete these steps, the following plot appears. The plot shows that that the input time series has a weekly seasonal pattern.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
The result is similar to the following:
Step three: Create your time series model
Next, create a time series model using the Google Analytics 360 data.
The following GoogleSQL query creates a model used to forecast
totals.visits
.
The CREATE MODEL
clause creates and trains a model named bqml_tutorial.ga_arima_model
.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.ga_arima_model` OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'parsed_date', time_series_data_col = 'total_visits', auto_arima = TRUE, data_frequency = 'AUTO_FREQUENCY', decompose_time_series = TRUE ) AS SELECT PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date, SUM(totals.visits) AS total_visits FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` GROUP BY date
The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...)
clause indicates that you are creating a
ARIMA-based
time-series model. By default,
auto_arima=TRUE
,
so the auto.ARIMA
algorithm automatically tunes the hyper-parameters in
ARIMA_PLUS
models. The algorithm fits dozens of candidate models and chooses
the best one with the lowest
Akaike information criterion (AIC).
Additionally, because the default is
data_frequency='AUTO_FREQUENCY'
, the training process automatically infers
the data frequency of the input time series. Lastly, the CREATE MODEL
statement uses
decompose_time_series=TRUE
by default, and users can further understand how the time series is forecasted
by fetching the separate time series components, such as seasonality and holiday
effect.
Run the CREATE MODEL
query to create and train your model:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.ga_arima_model` OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'parsed_date', time_series_data_col = 'total_visits', auto_arima = TRUE, data_frequency = 'AUTO_FREQUENCY', decompose_time_series = TRUE ) AS SELECT PARSE_TIMESTAMP("%Y%m%d", date) AS parsed_date, SUM(totals.visits) AS total_visits FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` GROUP BY date
Click Run.
The query takes about 43 seconds to complete, after which your model (
ga_arima_model
) appears in the navigation panel. Because the query uses aCREATE MODEL
statement to create a model, you do not see query results.
Step four: Inspect the evaluation metrics of all evaluated models
After creating your model, you can use the
ML.ARIMA_EVALUATE
function to see the evaluation metrics of all the candidate models evaluated
during the process of automatic hyperparameter tuning.
In the following GoogleSQL query, the FROM
clause uses the
ML.ARIMA_EVALUATE
function against your model, bqml_tutorial.ga_arima_model
. By default this
query returns the evaluation metrics of all the candidate models.
To run the ML.ARIMA_EVALUATE
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.ARIMA_EVALUATE(MODEL `bqml_tutorial.ga_arima_model`)
Click Run.
The query takes less than a second to complete. When the query is complete, click the Results tab below the query text area. The results should look similar to the following screenshot:
The results include the following columns:
non_seasonal_p
non_seasonal_d
non_seasonal_q
has_drift
log_likelihood
AIC
variance
seasonal_periods
has_holiday_effect
has_spikes_and_dips
has_step_changes
error_message
The following four columns (
non_seasonal_{p,d,q}
andhas_drift
) define an ARIMA model in the training pipeline. The three metrics after that (log_likelihood
,AIC
, andvariance
) are relevant to the ARIMA model fitting process.The
auto.ARIMA
algorithm first uses the KPSS test to decide that the best value fornon_seasonal_d
is 1. Whennon_seasonal_d
is 1, auto.ARIMA then trains 42 different candidate ARIMA models in parallel. Note that whennon_seasonal_d
is not 1, auto.ARIMA trains 21 different candidate models. In this example, all 42 candidate models are valid. Therefore, the output contains 42 rows, where each row is associated with a candidate ARIMA model. Note that for some time series, some candidate models are invalid as they are either non-invertible or non-stationary. These invalid models are excluded from the output, which will make the output have less than 42 rows. These candidate models are ordered by AIC in the ascending order. The model in the first row has the lowest AIC, and is considered as the best model. This best model is saved as the final model and is used when you callML.EXPLAIN_FORECAST
,ML.FORECAST
, andML.ARIMA_COEFFICIENTS
as shown in the following steps.The
seasonal_periods
column is about the seasonal pattern inside the input time series. It has nothing to do with the ARIMA modeling, therefore it has the same value across all output rows. It reports a weekly pattern, which is within our expectation as described in step two above.The
has_holiday_effect
,has_spikes_and_dips
, andhas_step_changes
columns are only populated whendecompose_time_series=TRUE
. They are about the holiday effect, spikes and dips, and step changes inside the input time series, which are not related to the ARIMA modeling. Therefore they are all the same across all output rows, except for those failed models.The
error_message
column shows that the possible error incurred during theauto.ARIMA
fitting process. A possible reason might be that the selectednon_seasonal_p
,non_seasonal_d
,non_seasonal_q
, andhas_drift
columns are not able to stabilize the time series. To retrieve the possible error message of all the candidate models, setshow_all_candidate_models=true
.
Step five: Inspect the coefficients of your model
The ML.ARIMA_COEFFICIENTS
function retrieves the model coefficients of your ARIMA_PLUS model,
bqml_tutorial.ga_arima_model
. ML.ARIMA_COEFFICIENTS
takes the model as the
only input.
Run the ML.ARIMA_COEFFICIENTS
query:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.ARIMA_COEFFICIENTS(MODEL `bqml_tutorial.ga_arima_model`)
Click Run.
The query takes less than a second to complete. The results should look like the following:
The results include the following columns:
ar_coefficients
ma_coefficients
intercept_or_drift
ar_coefficients
shows the model coefficients of the autoregressive (AR) part of the ARIMA model. Similarly,ma_coefficients
shows the model coefficients of moving-average (MA) part. They are both arrays, whose lengths are equal tonon_seasonal_p
andnon_seasonal_q
, respectively. From the output ofML.ARIMA_EVALUATE
, the best model as in the top row has anon_seasonal_p
of 2 and anon_seasonal_q
of 3. Therefore,ar_coefficients
is a length-2 array andma_coefficients
is a length-3 array. Theintercept_or_drift
is the constant term in the ARIMA model.
Step six: Use your model to forecast the time series
The ML.FORECAST
function forecasts future time series values with a prediction interval using
your model: bqml_tutorial.ga_arima_model
.
In the following GoogleSQL query, the
STRUCT(30 AS horizon, 0.8 AS confidence_level)
clause indicates that the
query forecasts 30 future time points, and generates a prediction interval
with a 80% confidence level. ML.FORECAST
takes the model, as well as a
couple of optional arguments.
To run the ML.FORECAST
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.FORECAST(MODEL `bqml_tutorial.ga_arima_model`, STRUCT(30 AS horizon, 0.8 AS confidence_level))
Click Run.
The query takes less than a second to complete. The results should look like the following:
The results include the following columns:
forecast_timestamp
forecast_value
standard_error
confidence_level
prediction_interval_lower_bound
prediction_interval_upper_bound
confidence_interval_lower_bound
(soon to be deprecated)confidence_interval_upper_bound
(soon to be deprecated)
The output rows are ordered in the chronological order of
forecast_timestamp
. In time series forecasting, the prediction interval, which is captured by the lower and upper bounds, is as important as theforecast_value
. Theforecast_value
is the middle point of the prediction interval. The prediction interval depends on thestandard_error
andconfidence_level
.
Step seven: Explain and visualize the forecasting results
To understand how the time series is forecasted, and to visualize the forecasted
time series together with the history time series and all the separate
components, the ML.EXPLAIN_FORECAST
function forecasts future time series values with a prediction interval using
your model, bqml_tutorial.ga_arima_model
, and at the same time returns all the
separate components of the time series.
Like the ML.FORECAST
function, the
STRUCT(30 AS horizon, 0.8 AS confidence_level)
clause indicates that the
query forecasts 30 future time points and generates a prediction interval
with 80% confidence. The ML.EXPLAIN_FORECAST
function takes the model, as well as a
couple of optional arguments.
To run the ML.EXPLAIN_FORECAST
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.EXPLAIN_FORECAST(MODEL `bqml_tutorial.ga_arima_model`, STRUCT(30 AS horizon, 0.8 AS confidence_level))
Click Run.
The query takes less than a second to complete. The results should look like the following:
The results include the following columns:
time_series_timestamp
time_series_type
time_series_data
time_series_adjusted_data
standard_error
confidence_level
prediction_interval_lower_bound
prediction_interval_lower_bound
trend
seasonal_period_yearly
seasonal_period_quarterly
seasonal_period_monthly
seasonal_period_weekly
seasonal_period_daily
holiday_effect
spikes_and_dips
step_changes
residual
The output rows are ordered in the chronological order of
time_series_timestamp
. Different components are listed as columns of the output. For more information, see the definition ofML.EXPLAIN_FORECAST
.After the query completes, click the Explore data button, and then click Explore with Looker Studio. Looker Studio opens in a new tab.
In the Chart panel, choose Time series chart:
In the Data panel, do the following:
- In the Date Range Dimension section, select
time_series_timestamp (Date)
. - In the Dimension section, select
time_series_timestamp (Date)
. - In the Metric section, remove the default metric
Record Count
, and add the following:time_series_data
prediction_interval_lower_bound
prediction_interval_upper_bound
trend
seasonal_period_weekly
step_changes
- In the Date Range Dimension section, select
In the Style panel, scroll down to the Missing Data option, and use Line Breaks instead of Line to Zero.
The following plot appears:
Step eight (optional): Visualize the forecasting results without having decompose_time_series
enabled
If decompose_time_series
is set to false in the ARIMA_PLUS
training, then
you can concatenate the history time series and the forecasted time series using
the UNION ALL
clause and the ML.FORECAST
function.
In the following query, the SQL before the UNION ALL
clause forms the history
time series. The SQL after the UNION ALL
clause uses the ML.FORECAST
function to generate the forecasted time series as well as the prediction
interval. The query uses different fields for history_value
and
forecasted_value
to plot them in different colors.
To run the query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT history_timestamp AS timestamp, history_value, NULL AS forecast_value, NULL AS prediction_interval_lower_bound, NULL AS prediction_interval_upper_bound FROM ( SELECT PARSE_TIMESTAMP("%Y%m%d", date) AS history_timestamp, SUM(totals.visits) AS history_value FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` GROUP BY date ORDER BY date ASC ) UNION ALL SELECT forecast_timestamp AS timestamp, NULL AS history_value, forecast_value, prediction_interval_lower_bound, prediction_interval_upper_bound FROM ML.FORECAST(MODEL `bqml_tutorial.ga_arima_model`, STRUCT(30 AS horizon, 0.8 AS confidence_level))
Click Run.
After the query completes, click the Explore data button, and then click Explore with Looker Studio. Looker Studio opens in a new tab. Complete the following steps in the new tab.
In the Chart panel, choose Time series chart:
In the Data panel, below the Chart panel, go to the Metric section. Add the following metrics:
history_value
,forecast_value
,prediction_interval_lower_bound
, andprediction_interval_upper_bound
. Then, remove the default metricRecord Count
.In the Style panel, scroll down to the Missing Data option and use Line Breaks instead of Line to Zero.
After you complete these steps, the following plot appears in the left panel. The input history time series is in blue, while the forecasted series is in green. The prediction interval is the region between the lower bound series and the upper bound series.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Learn how to perform multiple time-series forecasting with a single query from NYC Citi Bike trips data
- Learn how to accelerate ARIMA_PLUS to enable forecast 1 million time series within hours.
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.