In this tutorial, you will learn how to significantly accelerate training a set of time series models to perform multiple time-series forecasts with a single query. You will also learn how to evaluate forecasting accuracy.
For all steps but the last one, you will use the
new_york.citibike_trips
data.
This data contains information about Citi Bike trips in New York City. This
dataset only contains a few hundred time series. It is used to illustrate
various strategies to accelerate model training.
For the last step, you will use iowa_liquor_sales.sales
data to forecast more than 1 million time series.
Before reading this tutorial, you should read Perform multiple time-series forecasting with a single query from NYC Citi Bike trips data. You should also read Large-scale time series forecasting best practices.
Objectives
In this tutorial, you use the following:
- The
CREATE MODEL
statement creates a time-series model or a set of time-series models. - The
ML.EVALUATE
function evaluats the forecasting accuracy. - The
AUTO_ARIMA_MAX_ORDER
,TIME_SERIES_LENGTH_FRACTION
,MIN_TIME_SERIES_LENGTH
, andMAX_TIME_SERIES_LENGTH
training options: to significantly reduce the model training time.
For simplicity, this tutorial doesn't cover how to use ML.FORECAST
or ML.EXPLAIN_FORECAST
to generate (explainable) forecasts. To learn how to use those functions, see
Performing multiple time-series forecasting with a single query from NYC Citi Bike trips data.
Costs
This tutorial uses billable components of Google Cloud, including:
- BigQuery
- BigQuery ML
For more information about costs, see the BigQuery pricing page and the BigQuery ML pricing page.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Step one: Create dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Step two: Create the time series to forecast
In the following query, the FROM bigquery-public-data.new_york.citibike_trips
clause indicates that you are queries the citibike_trips
table in the
new_york
dataset.
CREATE OR REPLACE TABLE `bqml_tutorial.nyc_citibike_time_series` AS WITH input_time_series AS ( SELECT start_station_name, EXTRACT(DATE FROM starttime) AS date, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY start_station_name, date ) SELECT table_1.* FROM input_time_series AS table_1 INNER JOIN ( SELECT start_station_name, COUNT(*) AS num_points FROM input_time_series GROUP BY start_station_name) table_2 ON table_1.start_station_name = table_2.start_station_name WHERE num_points > 400
To run the query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.
The SELECT
statement in the query uses
EXTRACT
function
to extract the date information from the starttime
column. The query uses
the COUNT(*)
clause to get the daily total number of Citi Bike trips.
table_1
has 679 time series. The query uses additional INNER JOIN
logic
to select all those time series that have more than 400 time points, resulting
in a total of 383 times series.
Step three: Simultaneously forecast multiple time-series with default parameters
In this step you forecast the daily total number of trips starting from
different Citi Bike stations. To do this, you must forecast many time series.
You could write multiple
CREATE MODEL
queries but that can be a tedious and time consuming process, especially when
you have a large number of time series.
To improve this process, BigQuery ML lets you create a set of time series models to forecast multiple time series using a single query. Additionally, all time series models are fit simultaneously.
In the following GoogleSQL query, the
CREATE MODEL
clause creates and trains a set of models named
bqml_tutorial.nyc_citibike_arima_model_default
.
CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model_default` OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'date', time_series_data_col = 'num_trips', time_series_id_col = 'start_station_name' ) AS SELECT * FROM bqml_tutorial.nyc_citibike_time_series WHERE date < '2016-06-01'
To run the CREATE MODEL
query to create and train your model, use the
following steps:
In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.
The query takes about 14 minutes 25 seconds to complete.
The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...)
clause indicates that you are creating a set of
ARIMA-based
time-series ARIMA_PLUS
models. In addition to time_series_timestamp_col
and
time_series_data_col
, you must specify time_series_id_col
, which is used to
annotate different input time series.
This example leaves out the time points in the time series after 2016-06-01 so that
those time points can be used to evaluate the forecasting accuracy later by using
the ML.EVALUATE
function.
Step four: Evaluate forecasting accuracy for each time series
In this step, you evaluate the forecasting accuracy for each time series by using the following ML.EVALUATE
query.
SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_default`, TABLE `bqml_tutorial.nyc_citibike_time_series`, STRUCT(7 AS horizon, TRUE AS perform_aggregation))
To run the above query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run. This query reports several forecasting metrics, including:
- mean absolute error
- mean squared error
- mean absolute percentage error
- symmetric mean absolute percentage error
The results should look like the following:
ML.EVALUATE
takes the ARIMA_PLUS model that was trained in the previous step as its first argument.
The second argument is a data table containing the ground truth data. These forecasting results are compared to the ground truth data to compute accuracy metrics. In this case, the nyc_citibike_time_series
contains both
the time series points that are before 2016-06-01 and after 2016-06-01. The
points after 2016-06-01 are the ground truth data. The points before
2016-06-01 are used to train the model to generate forecasts after that date.
Only the points after 2016-06-01 are necessary to compute the metrics. The points before 2016-06-01 are ignored in metrics calculation.
The third argument is a STRUCT
which contains two parameters. The horizon is
7, which means the query is calculating the forecasting accuracy based on the 7
point forecast. Note that if the ground truth data has less than 7 points for
the comparison, then accuracy metrics is computed based on the available
points only. perform_aggregation
has a value of TRUE, which means that the
forecasting accuracy metrics are aggregated over the metrics on the time point
basis. If you specify perform_aggregation
as FALSE, forecasting accuracy is returned for each forecasted time point.
Step five: Evaluate the overall forecasting accuracy for all the time series
In this step, you evaluate the forecasting accuracy for the entire 383 time series using the following query:
SELECT AVG(mean_absolute_percentage_error) AS MAPE, AVG(symmetric_mean_absolute_percentage_error) AS sMAPE FROM ML.EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_default`, TABLE `bqml_tutorial.nyc_citibike_time_series`, STRUCT(7 AS horizon, TRUE AS perform_aggregation))
Of the forecasting metrics returned by ML.EVALUATE
, only
mean absolute percentage error and symmetric mean absolute percentage error are
time series value independent. Therefore, to evaluate the entire forecasting accuracy of the set of time series, only the aggregate of these two metrics is meaningful.
This query returns the following results: MAPE is 0.3471, sMAPE is 0.2563.
Step six: Forecast many time-series simultaneously using a smaller hyperparameter search space
In step three, we used the default values for all of the training options, including
the auto_arima_max_order
. This option controls the search space
for hyperparameter tuning in the auto.ARIMA
algorithm.
In this step, you use a smaller search space for the hyperparameters.
CREATE OR REPLACE MODELbqml_tutorial.nyc_citibike_arima_model_max_order_2
OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'date', time_series_data_col = 'num_trips', time_series_id_col = 'start_station_name', auto_arima_max_order = 2 ) AS SELECT * FROMbqml_tutorial.nyc_citibike_time_series
WHERE date < '2016-06-01'
This query reduces auto_arima_max_order
from 5 (the default value) to 2.
To run the query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.
The query takes about 1 minutes 45 seconds to complete. Recall that it takes 14 min 25 sec for the query to complete if
auto_arima_max_order
is 5. So the speed gain is around 7x by settingauto_arima_max_order
to 2. If you wonder why the speed gain is not 5/2=2.5x, this is because when increasing the order ofauto_arima_max_order
, not only the number of candidate models increase, but also the complexity hence the training time of the models increases.
Step seven: Evaluate forecast accuracy based on a smaller hyperparameter search space
SELECT AVG(mean_absolute_percentage_error) AS MAPE, AVG(symmetric_mean_absolute_percentage_error) AS sMAPE FROM ML.EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_max_order_2`, TABLE `bqml_tutorial.nyc_citibike_time_series`, STRUCT(7 AS horizon, TRUE AS perform_aggregation))
This query returns the following results: MAPE is 0.3337 and sMAPE is 0.2337.
In step five, using a larger hyperparameter search space,
auto_arima_max_order = 5, resulted in MAPE of 0.3471 and sMAPE 0.2563.
Therefore, in this case a smaller hyperparameter search space actually gives higher forecasting accuracy. One
reason is that the auto.ARIMA
algorithm only performs hyperparameter tuning
for the trend module of the entire modeling pipeline. The best ARIMA model
selected by auto.ARIMA algorithm might not generate the best forecasting
results for the entire pipeline.
Step eight: Forecast many time-series simultaneously with a smaller hyperparameter search space and smart fast training strategies
In this step, you use both a smaller hyperparameter search space and the
smart fast training strategy using one or more of the max_time_series_length
,
max_time_series_length
, or time_series_length_fraction
training options.
While periodic modeling such as seasonality requires a certain number of time points, trend modeling requires fewer time points. Meanwhile, trend modeling is much more computationally expensive than other time series components such as seasonality. By using the fast training options above, you can efficiently model the trend component with a subset of the time series, while the other time series components use the entire time series.
This example uses max_time_series_length
to achieve fast training.
CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model_max_order_2_fast_training` OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'date', time_series_data_col = 'num_trips', time_series_id_col = 'start_station_name', auto_arima_max_order = 2, max_time_series_length = 30 ) AS SELECT * FROM `bqml_tutorial.nyc_citibike_time_series` WHERE date < '2016-06-01'
The max_time_series_length
option has a value of 30, so for each of the
383 time series, only the 30 most recent time points are used to model the trend
component. All time series are still used to model the non-trend components.
To run the query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.
The query takes about 35 seconds to complete. This is 3x faster compared to the training query which doesn't use fast training strategy (i.e., takes 1 minute 45 seconds). Note that due to the constant time overhead for the non-training part of the query, such as data preprocessing etc, the speed gain will be much higher when the number of time series is much larger than this case. For a million time series, the speed gain will approach the ratio of the time series length and the value of max_time_series_length. In this case, the speed gain will be greater than 10x.
Step nine: Evaluate forecasting accuracy for a model with smaller hyperparameter search space and smart fast training strategies
SELECT AVG(mean_absolute_percentage_error) AS MAPE, AVG(symmetric_mean_absolute_percentage_error) AS sMAPE FROM ML.EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_max_order_2_fast_training`, TABLE `bqml_tutorial.nyc_citibike_time_series`, STRUCT(7 AS horizon, TRUE AS perform_aggregation))
This query returns the following results: MAPE is 0.3515, and sMAPE is 0.2473.
Recall that without the use of fast training strategies, the forecasting accuracy results are MAPE is 0.3337 and sMAPE is 0.2337. The difference between the two set of metric values are within 3%, which is statistically insignificant.
In short, you have used a smaller hyperparameter search space and smart fast training strategies to make your model training more than 20x faster without sacrificing forecasting accuracy. As mentioned earlier, with more time series, the speed gain by the smart fast training strategies can be significantly higher. Additionally, the underlying ARIMA library used by ARIMA_PLUS has been optimized to run 5x faster than before. Together, these gains enable the forecasting of millions of time series within hours.
Step ten: Forecast over a million time series
In this step, you forecast liquor sales for over 1 million liquor products in different stores using the public Iowa liquor sales data.
CREATE OR REPLACE MODEL `bqml_tutorial.liquor_forecast_by_product` OPTIONS( MODEL_TYPE = 'ARIMA_PLUS', TIME_SERIES_TIMESTAMP_COL = 'date', TIME_SERIES_DATA_COL = 'total_bottles_sold', TIME_SERIES_ID_COL = ['store_number', 'item_description'], HOLIDAY_REGION = 'US', AUTO_ARIMA_MAX_ORDER = 2, MAX_TIME_SERIES_LENGTH = 30 ) AS SELECT store_number, item_description, date, SUM(bottles_sold) as total_bottles_sold FROM `bigquery-public-data.iowa_liquor_sales.sales` WHERE date BETWEEN DATE("2015-01-01") AND DATE("2021-12-31") GROUP BY store_number, item_description, date
The model training still uses a small hyperparameter search space as well as the smart fast training strategy. The query takes about 1 hour 16 minutes to complete.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset to delete the dataset, the table, and all of the data.
In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.