Hierarchical time series forecasting
This tutorial teaches you how to generate a hierarchical time series. In this tutorial, you create two time series models over the same data, one of which uses hierarchical forecasting and one which does not. This allows you to compare the results returned by the models.
You use the
iowa_liquor.sales.sales
data to train the models in this tutorial. This dataset contains information for over 1 million liquor products in different stores using public Iowa liquor sales data.
Before following this tutorial, you should be familiar with multiple time series forecasting. Complete the Multiple time series forecasting from Google Analytics data tutorial for an introduction to this topic.
Required Permissions
- To create the dataset, you need the
bigquery.datasets.create
IAM permission. To create the connection resource, you need the following permissions:
bigquery.connections.create
bigquery.connections.get
To create the model, you need the following permissions:
bigquery.jobs.create
bigquery.models.create
bigquery.models.getData
bigquery.models.updateData
bigquery.connections.delegate
To run inference, you need the following permissions:
bigquery.models.getData
bigquery.jobs.create
For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.
Objectives
In this tutorial, you use the following:
- The
CREATE MODEL
statement: to create a time series model and hierarchical time series model. - The
ML.FORECAST
function: to forecast daily total sales.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For more information about BigQuery costs, see the BigQuery pricing page.
For more information about BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery API.
Create a dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Hierarchical Reconciliation
Time series forecasts can often be broken down or aggregated based on different dimensions of interest. These are known as hierarchical time series. For example, census data that reveals the total population per state can be broken down by city and zip. Conversely, we can aggregate that data for each country or continent.
There are several techniques that you can use to generate and reconcile hierarchical forecasts. Take the following example, which shows a simplified hierarchical structure for liquor sales in the state of Iowa:
The lowest level shows the store level, followed by the zip code level, city, county, and finally by state. The goal for hierarchical forecasts is to make sure that all forecasts across each level must reconcile. For example, given the earlier figure, this means that the forecasts for Clive and Des Moines must add up to the forecast in Polk. Similarly, the forecasts in Polk, Linn and Scott should add up to the forecast in Iowa.
There are a couple of common techniques that you can use to generate reconciled forecasts for each level. One technique is known as the Bottom-Up approach. In this approach, the forecasts are generated at the bottom level of the hierarchy first before summing up the other levels. Taking the earlier example, the forecasts for each store are used to build the forecasting models for the other levels (first by zip, followed by city, etc.).
Create a time series model
First, create a time series model, using the Iowa liquor sales data.
The following GoogleSQL query creates a model that forecasts daily total number of bottles sold in 2015 in Polk, Linn and Scott counties. The CREATE MODEL
statement creates and trains a model named bqml_tutorial.liquor_forecast
.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.liquor_forecast` OPTIONS ( MODEL_TYPE = 'ARIMA_PLUS', TIME_SERIES_TIMESTAMP_COL = 'date', TIME_SERIES_DATA_COL = 'total_bottles_sold', TIME_SERIES_ID_COL = ['store_number', 'zip_code', 'city', 'county'], HOLIDAY_REGION = 'US') AS SELECT store_number, zip_code, city, county, date, SUM(bottles_sold) AS total_bottles_sold FROM `bigquery-public-data.iowa_liquor_sales.sales` WHERE date BETWEEN DATE('2015-01-01') AND DATE('2015-12-31') AND county IN ('POLK', 'LINN', 'SCOTT') GROUP BY store_number, date, city, zip_code, county;
The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...)
clause indicates that you are creating an
ARIMA-based
time series model. By default,
auto_arima=TRUE
,
so the auto.ARIMA
algorithm automatically tunes the hyperparameters in
ARIMA_PLUS
models. The algorithm fits dozens of candidate models and chooses
the best one with the lowest
Akaike information criterion (AIC). Setting the holiday_region
option
to US
allows a more accurate modeling on
those United States holidays time points if there are United States holiday patterns in the time series.
Run the CREATE MODEL
query to create and train your model:
In the Google Cloud console, click the Run a query in BigQuery button.
Enter the following GoogleSQL query in the query editor.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.liquor_forecast` OPTIONS ( MODEL_TYPE = 'ARIMA_PLUS', TIME_SERIES_TIMESTAMP_COL = 'date', TIME_SERIES_DATA_COL = 'total_bottles_sold', TIME_SERIES_ID_COL = ['store_number', 'zip_code', 'city', 'county'], HOLIDAY_REGION = 'US') AS SELECT store_number, zip_code, city, county, date, SUM(bottles_sold) AS total_bottles_sold FROM `bigquery-public-data.iowa_liquor_sales.sales` WHERE date BETWEEN DATE('2015-01-01') AND DATE('2015-12-31') AND county IN ('POLK', 'LINN', 'SCOTT') GROUP BY store_number, date, city, zip_code, county;
Click Run.
The query takes approximately 37 seconds to complete, after which your model (
liquor_forecast
) appears in the Explorer pane. Because the query uses aCREATE MODEL
statement to create a model, there are no query results.
Inspect the results of the time series model
After creating your model, you can use the ML.FORECAST
function to see the forecast results.
You run the ML.FORECAST
function against the model liquor_forecast
by specifying it in the FROM
clause.
By default, this query returns the forecasts for all unique timeseries in the data, which are identified by the columns store_number
, zip_code
, city
and county
.
To run the ML.FORECAST query, use the following steps:
In the Google Cloud console, click the Run a query in BigQuery button.
Enter the following GoogleSQL query in the query editor.
#standardSQL SELECT * FROM ML.FORECAST( MODEL `bqml_tutorial.liquor_forecast`, STRUCT(20 AS horizon, 0.8 AS confidence_level)) ORDER BY store_number, county, city, zip_code, forecast_timestamp
Click Run.
The query should take around 5 seconds to execute. After the query runs, the output shows the following results.
Notice how the forecast for the first time series is displayed (store_number=2190, zip_code=50314, city=DES MOINES, county=POLK). If you look further at the rest of the rows, you will notice the forecasts for the other groups.
In general, forecasts are generated for each unique time series. In order to generate forecasts for aggregated levels such as forecasts for a specific county, you need to generate a hierarchical forecast.
Create time series model with hierarchical forecasts
Create a hierarchical time series forecast, using the Iowa liquor sales data.
The following GoogleSQL query creates a model that generates hierarchical forecasts for daily total number of bottles sold in 2015 in Polk, Linn and Scott counties. The CREATE MODEL
statement creates and trains a model named bqml_tutorial.liquor_forecast_hierarchical
.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.liquor_forecast_hierarchical` OPTIONS ( MODEL_TYPE = 'ARIMA_PLUS', TIME_SERIES_TIMESTAMP_COL = 'date', TIME_SERIES_DATA_COL = 'total_bottles_sold', TIME_SERIES_ID_COL = ['store_number', 'zip_code', 'city', 'county'], HIERARCHICAL_TIME_SERIES_COLS = ['zip_code', 'store_number'], HOLIDAY_REGION = 'US') AS SELECT store_number, zip_code, city, county, date, SUM(bottles_sold) AS total_bottles_sold FROM `bigquery-public-data.iowa_liquor_sales.sales` WHERE date BETWEEN DATE('2015-01-01') AND DATE('2015-12-31') AND county IN ('POLK', 'LINN', 'SCOTT') GROUP BY store_number, date, city, zip_code, county;
The HIERARCHICAL_TIME_SERIES_COLS
parameter indicates that you are creating a hierarchical forecast based on a set of columns. Each of these columns are rolled up and aggregated. For example, from the earlier query, this means that store_number
is rolled up to show forecasts for each county
, city
and zip_code
. Separately, both zip_code
and store_number
are also rolled up to show forecasts for each county
and city
. The column order is important because it defines the structure of the hierarchy.
Run the CREATE MODEL
query to create and train your model:
In the Google Cloud console, click the Run a query in BigQuery button.
Enter the following GoogleSQL query in the query editor.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.liquor_forecast_hierarchical` OPTIONS ( MODEL_TYPE = 'ARIMA_PLUS', TIME_SERIES_TIMESTAMP_COL = 'date', TIME_SERIES_DATA_COL = 'total_bottles_sold', TIME_SERIES_ID_COL = ['store_number', 'zip_code', 'city', 'county'], HIERARCHICAL_TIME_SERIES_COLS = ['zip_code', 'store_number'], HOLIDAY_REGION = 'US') AS SELECT store_number, zip_code, city, county, date, SUM(bottles_sold) AS total_bottles_sold FROM `bigquery-public-data.iowa_liquor_sales.sales` WHERE date BETWEEN DATE('2015-01-01') AND DATE('2015-12-31') AND county IN ('POLK', 'LINN', 'SCOTT') GROUP BY store_number, date, city, zip_code, county;
Click Run.
The query takes approximately 45 seconds to complete, after which your model (
bqml_tutorial.liquor_forecast_hierarchical
) appears in the Explorer pane. Because the query uses aCREATE MODEL
statement to create a model, there are no query results.
Inspect the results of the hierarchical time series model
In the Google Cloud console, click the Run a query in BigQuery button.
Enter the following GoogleSQL query in the query editor.
#standardSQL SELECT * FROM ML.FORECAST( MODEL `bqml_tutorial.liquor_forecast_hierarchical`, STRUCT(30 AS horizon, 0.8 AS confidence_level)) WHERE city = 'LECLAIRE' ORDER BY county, city, zip_code, store_number, forecast_timestamp
Click Run.
The query should take around 5 seconds to execute. After the query runs, the output shows the following results.
Notice how the aggregated forecast is displayed for the city of LeClaire (store_number=NULL, zip_code=NULL, city=LECLAIRE, county=SCOTT). If you look further at the rest of the rows, you will notice the forecasts for the other sub groups. For example, the following image shows the forecasts aggregated for the zip code 52753 (store_number=NULL, zip_code=52753, city=LECLAIRE, county=SCOTT):
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Learn how to perform multiple time-series forecasting with a single query from NYC Citi Bike trips data.
- Learn how to accelerate ARIMA_PLUS to enable forecast 1 million time series within hours.
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.