In this tutorial, you will learn how to create a multivariate time series model
(ARIMA_PLUS_XREG
) to perform time-series forecasting using the following sample tables from the epa_historical_air_quality
dataset:
epa_historical_air_quality.pm25_nonfrm_daily_summary
sample table.epa_historical_air_quality.wind_daily_summary
sample table.epa_historical_air_quality.temperature_daily_summary
sample table.
The epa_historical_air_quality
dataset contains daily PM 2.5, temperature, and wind
speed information collected from multiple US cities.
Objectives
In this tutorial, you use the following:
- The
CREATE MODEL
statement: to create a time series model. - The
ML.ARIMA_EVALUATE
function: to inspect the ARIMA related evaluation information in the model. - The
ML.ARIMA_COEFFICIENTS
function: to inspect the model coefficients. - The
ML.FORECAST
function: to forecast daily PM 2.5. - The
ML.EVALUATE
function: to evaluate the model with actual data. - The
ML.EXPLAIN_FORECAST
function: to retrieve various components of the time series (such as seasonality, trend and feature attributions) that you can use to explain the forecast results.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For more information about BigQuery costs, see the BigQuery pricing page.
For more information about BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Step one: Create your dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Step two: Create a time series table with extra features
The PM2.5, temperature, and wind speed data are in separate tables.
To simplify the following queries, you can create a new table
bqml_tutorial.seattle_air_quality_daily
by joining those tables,
with the following columns:
- date: the date of the observation
- PM2.5: the average PM2.5 value for each day
- wind_speed: the average wind speed for each day
- temperature: the highest temperature for each day
The new table has daily data from 2009-08-11 to 2022-01-31.
In the following GoogleSQL query, the
FROM bigquery-public-data.epa_historical_air_quality.*_daily_summary
clause
indicates that you are querying the *_daily_summary
tables in the
epa_historical_air_quality
dataset. These tables are
partitioned tables.
#standardSQL CREATE TABLE `bqml_tutorial.seattle_air_quality_daily` AS WITH pm25_daily AS ( SELECT avg(arithmetic_mean) AS pm25, date_local AS date FROM `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary` WHERE city_name = 'Seattle' AND parameter_name = 'Acceptable PM2.5 AQI & Speciation Mass' GROUP BY date_local ), wind_speed_daily AS ( SELECT avg(arithmetic_mean) AS wind_speed, date_local AS date FROM `bigquery-public-data.epa_historical_air_quality.wind_daily_summary` WHERE city_name = 'Seattle' AND parameter_name = 'Wind Speed - Resultant' GROUP BY date_local ), temperature_daily AS ( SELECT avg(first_max_value) AS temperature, date_local AS date FROM `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary` WHERE city_name = 'Seattle' AND parameter_name = 'Outdoor Temperature' GROUP BY date_local ) SELECT pm25_daily.date AS date, pm25, wind_speed, temperature FROM pm25_daily JOIN wind_speed_daily USING (date) JOIN temperature_daily USING (date)
To run the query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.
Step three (optional): Visualize the time series you want to forecast
Before creating the model, it is useful to see what your input time series looks like. You can do this by using Looker Studio.
In the following GoogleSQL query, the
FROM bqml_tutorial.seattle_air_quality_daily
clause
indicates that you are querying the seattle_air_quality_daily
table in the
bqml_tutorial
dataset you just created.
#standardSQL SELECT * FROM `bqml_tutorial.seattle_air_quality_daily`
To run the query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM `bqml_tutorial.seattle_air_quality_daily`
Click Run.
After this query runs, the output is similar to the following screenshot. In the screenshot, you can see that this time series has 3960 data points. Click the Explore data button and then Explore with Looker Studio. Looker Studio opens in a new tab. Complete the following steps in the new tab.
In the Chart panel, choose Time series chart:
In the SETUP panel, below the Chart panel, go to the Metric section. Add the pm25, temperature, and wind_speed fields and then remove the default metric Record Count. You can also set a custom date range, e.g., Jan. 1, 2019 to Dec 31, 2021, to make the time series shorter. This is shown in the following figure.
After you complete these steps, the following plot appears. The plot shows that that the input time series has a weekly seasonal pattern.
Step four: Create your time series model
Next, create a time series model using the above air quality data.
The following GoogleSQL query creates a model used to forecast
pm25
.
The CREATE MODEL
clause creates and trains a model named bqml_tutorial.seattle_pm25_xreg_model
.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.seattle_pm25_xreg_model` OPTIONS ( MODEL_TYPE = 'ARIMA_PLUS_XREG', time_series_timestamp_col = 'date', time_series_data_col = 'pm25') AS SELECT date, pm25, temperature, wind_speed FROM `bqml_tutorial.seattle_air_quality_daily` WHERE date BETWEEN DATE('2012-01-01') AND DATE('2020-12-31')
The OPTIONS(model_type='ARIMA_PLUS_XREG', time_series_timestamp_col='date', ...)
clause indicates that you are creating an ARIMA with external regressors model. By default,
auto_arima=TRUE
,
so the auto.ARIMA
algorithm automatically tunes the hyper-parameters in
ARIMA_PLUS_XREG
models. The algorithm fits dozens of candidate models and chooses
the best one with the lowest
Akaike information criterion (AIC).
Additionally, because the default is
data_frequency='AUTO_FREQUENCY'
, the training process automatically infers
the data frequency of the input time series.
Run the CREATE MODEL
query to create and train your model:
In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.
The query takes about 20 seconds to complete, after which your model (
seattle_pm25_xreg_model
) appears in the navigation panel. Because the query uses aCREATE MODEL
statement to create a model, you do not see query results.
Step five: Inspect the evaluation metrics of all evaluated models
After creating your model, you can use the
ML.ARIMA_EVALUATE
function to see the evaluation metrics of all the candidate models evaluated
during the process of automatic hyperparameter tuning.
In the following GoogleSQL query, the FROM
clause uses the
ML.ARIMA_EVALUATE
function against your model, bqml_tutorial.seattle_pm25_xreg_model
. By default this
query returns the evaluation metrics of all the candidate models.
To run the ML.ARIMA_EVALUATE
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.ARIMA_EVALUATE(MODEL `bqml_tutorial.seattle_pm25_xreg_model`)
Click Run.
When the query is complete, click the Results tab below the query text area. The results should look similar to the following screenshot:
The results include the following columns:
non_seasonal_p
non_seasonal_d
non_seasonal_q
has_drift
log_likelihood
AIC
variance
seasonal_periods
has_holiday_effect
has_spikes_and_dips
has_step_changes
error_message
The following four columns (
non_seasonal_{p,d,q}
andhas_drift
) define an ARIMA model in the training pipeline. The three metrics after that (log_likelihood
,AIC
, andvariance
) are relevant to the ARIMA model fitting process.The
auto.ARIMA
algorithm first uses the KPSS test to decide that the best value fornon_seasonal_d
is 1. Whennon_seasonal_d
is 1, auto.ARIMA then trains 42 different candidate ARIMA models in parallel. Note that whennon_seasonal_d
is not 1, auto.ARIMA trains 21 different candidate models. In this example, all 42 candidate models are valid. Therefore, the output contains 42 rows, where each row is associated with a candidate ARIMA model. Note that for some time series, some candidate models are invalid as they are either non-invertible or non-stationary. These invalid models are excluded from the output, which will make the output have less than 42 rows. These candidate models are ordered by AIC in the ascending order. The model in the first row has the lowest AIC, and is considered as the best model. This best model is saved as the final model and is used when you callML.FORECAST
,ML.EVALUATE
, andML.ARIMA_COEFFICIENTS
as shown in the following steps.The
seasonal_periods
column is about the seasonal pattern inside the input time series. It has nothing to do with the ARIMA modeling, therefore it has the same value across all output rows. It reports a weekly patten, which is within our expectation as described in step two above.The
has_holiday_effect
,has_spikes_and_dips
, andhas_step_changes
columns are only populated whendecompose_time_series=TRUE
. They are about the holiday effect, spikes and dips, and step changes inside the input time series, which are not related to the ARIMA modeling. Therefore they are all the same across all output rows, except for those failed models.The
error_message
column shows that the possible error incurred during theauto.ARIMA
fitting process. A possible reason might be that the selectednon_seasonal_p
,non_seasonal_d
,non_seasonal_q
, andhas_drift
columns are not able to stabilize the time series. To retrieve the possible error message of all the candidate models, setshow_all_candidate_models=true
.
Step six: Inspect the coefficients of your model
The ML.ARIMA_COEFFICIENTS
function retrieves the model coefficients of your ARIMA_PLUS model,
bqml_tutorial.seattle_pm25_xreg_model
. ML.ARIMA_COEFFICIENTS
takes the model
as the only input.
Run the ML.ARIMA_COEFFICIENTS
query:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.ARIMA_COEFFICIENTS(MODEL `bqml_tutorial.seattle_pm25_xreg_model`)
Click Run.
The results should look like the following:
The results include the following columns:
ar_coefficients
ma_coefficients
intercept_or_drift
processed_input
weight
category_weights.category
category_weights.weight
ar_coefficients
shows the model coefficients of the autoregressive (AR) part of the ARIMA model. Similarly,ma_coefficients
shows the model coefficients of moving-average (MA) part. They are both arrays, whose lengths are equal tonon_seasonal_p
andnon_seasonal_q
, respectively. From the output ofML.ARIMA_EVALUATE
, the best model as in the top row has anon_seasonal_p
of 0 and anon_seasonal_q
of 5. Therefore,ar_coefficients
is an empty array andma_coefficients
is a length-5 array. Theintercept_or_drift
is the constant term in the ARIMA model.processed_input
and the correspondingweight
andcategory_weights
column show the weights for each feature and the intercept in the linear regression model. If the feature is a numerical feature, the weight is in theweight
column. If the feature is a categorical feature, thecategory_weights
is anARRAY
ofSTRUCT
where theSTRUCT
contains the names and weights of the categories.
Step seven: Use your model to forecast the time series
The ML.FORECAST
function forecasts future time series values with a prediction interval using
your model bqml_tutorial.seattle_pm25_xreg_model
and future feature values.
In the following GoogleSQL query, the
STRUCT(30 AS horizon, 0.8 AS confidence_level)
clause indicates that the
query forecasts 30 future time points, and generates a prediction interval
with a 80% confidence level. ML.FORECAST
takes the model, future feature
values, as well as a couple of optional arguments.
To run the ML.FORECAST
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.FORECAST( MODEL `bqml_tutorial.seattle_pm25_xreg_model`, STRUCT(30 AS horizon, 0.8 AS confidence_level), ( SELECT date, temperature, wind_speed FROM `bqml_tutorial.seattle_air_quality_daily` WHERE date > DATE('2020-12-31') ))
Click Run.
The results should look like the following:
The results include the following columns:
forecast_timestamp
forecast_value
standard_error
confidence_level
prediction_interval_lower_bound
prediction_interval_upper_bound
The output rows are ordered in the chronological order of
forecast_timestamp
. In time series forecasting, the prediction interval, which is captured by the lower and upper bounds, is as important as theforecast_value
. Theforecast_value
is the middle point of the prediction interval. The prediction interval depends on thestandard_error
andconfidence_level
.
Step eight: Evaluate forecasting accuracy with actual data
To evaluate the forecasting accuracy with the actual data, you can use
ML.EVALUATE
function with your model, bqml_tutorial.seattle_pm25_xreg_model
, and the actual
data table.
To run the ML.EVALUATE
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.EVALUATE( MODEL `bqml_tutorial.seattle_pm25_xreg_model`, ( SELECT date, pm25, temperature, wind_speed FROM `bqml_tutorial.seattle_air_quality_daily` WHERE date > DATE('2020-12-31') ), STRUCT( TRUE AS perform_aggregation, 30 AS horizon))
The second parameter is the actual data with the future features, which are used to forecast the future values to compare with the actual data. The third parameter is a struct of parameters to this function.
Click Run.
The results should look like the following:
Step nine: Explain the forecasting results
To understand how the time series is forecasted, the ML.EXPLAIN_FORECAST
function forecasts future time series values with a prediction interval using
your model, bqml_tutorial.seattle_pm25_xreg_model
, and at the same time returns all the
separate components of the time series.
Like the ML.FORECAST
function, the
STRUCT(30 AS horizon, 0.8 AS confidence_level)
clause indicates that the
query forecasts 30 future time points and generates a prediction interval
with 80% confidence. The ML.EXPLAIN_FORECAST
function takes the model,
future feature values, and a couple of optional arguments as input.
To run the ML.EXPLAIN_FORECAST
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.EXPLAIN_FORECAST( MODEL `bqml_tutorial.seattle_pm25_xreg_model`, STRUCT(30 AS horizon, 0.8 AS confidence_level), ( SELECT date, temperature, wind_speed FROM `bqml_tutorial.seattle_air_quality_daily` WHERE date > DATE('2020-12-31') ))
Click Run.
The query takes less than a second to complete. The results should look like the following:
The results include the following columns:
time_series_timestamp
time_series_type
time_series_data
time_series_adjusted_data
standard_error
confidence_level
prediction_interval_lower_bound
prediction_interval_lower_bound
trend
seasonal_period_yearly
seasonal_period_quarterly
seasonal_period_monthly
seasonal_period_weekly
seasonal_period_daily
holiday_effect
spikes_and_dips
step_changes
residual
attribution_temperature
attribution_wind_speed
attribution___INTERCEPT__
The output rows are ordered in the chronological order of
time_series_timestamp
. Different components are listed as columns of the output. For more information, seeML.EXPLAIN_FORECAST
.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Learn how to perform multiple time-series forecasting with a single query from NYC Citi Bike trips data
- Learn how to accelerate ARIMA_PLUS to enable forecast 1 million time series within hours.
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.