Hierarchical time series forecasting

This tutorial teaches you how to generate a hierarchical time series. In this tutorial, you create two time series models over the same data, one of which uses hierarchical forecasting and one which does not. This allows you to compare the results returned by the models.

You use the iowa_liquor.sales.sales data to train the models in this tutorial. This dataset contains information for over 1 million liquor products in different stores using public Iowa liquor sales data.

Before following this tutorial, you should be familiar with multiple time series forecasting. Complete the Multiple time series forecasting from Google Analytics data tutorial for an introduction to this topic.

Required Permissions

  • To create the dataset, you need the bigquery.datasets.create IAM permission.
  • To create the connection resource, you need the following permissions:

    • bigquery.connections.create
    • bigquery.connections.get
  • To create the model, you need the following permissions:

    • bigquery.jobs.create
    • bigquery.models.create
    • bigquery.models.getData
    • bigquery.models.updateData
    • bigquery.connections.delegate
  • To run inference, you need the following permissions:

    • bigquery.models.getData
    • bigquery.jobs.create

For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.

Objectives

In this tutorial, you use the following:

  • The CREATE MODEL statement: to create a time series model and hierarchical time series model.
  • The ML.FORECAST function: to forecast daily total sales.

Costs

This tutorial uses billable components of Google Cloud, including the following:

  • BigQuery
  • BigQuery ML

For more information about BigQuery costs, see the BigQuery pricing page.

For more information about BigQuery ML costs, see BigQuery ML pricing.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the BigQuery API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the BigQuery API.

    Enable the API

Create a dataset

Create a BigQuery dataset to store your ML model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset.

    Create dataset.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter bqml_tutorial.

    • For Location type, select Multi-region, and then select US (multiple regions in United States).

      The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.

    • Leave the remaining default settings as they are, and click Create dataset.

      Create dataset page.

Hierarchical Reconciliation

Time series forecasts can often be broken down or aggregated based on different dimensions of interest. These are known as hierarchical time series. For example, census data that reveals the total population per state can be broken down by city and zip. Conversely, we can aggregate that data for each country or continent.

There are several techniques that you can use to generate and reconcile hierarchical forecasts. Take the following example, which shows a simplified hierarchical structure for liquor sales in the state of Iowa:

Hierarchical Time Series.

The lowest level shows the store level, followed by the zip code level, city, county, and finally by state. The goal for hierarchical forecasts is to make sure that all forecasts across each level must reconcile. For example, given the earlier figure, this means that the forecasts for Clive and Des Moines must add up to the forecast in Polk. Similarly, the forecasts in Polk, Linn and Scott should add up to the forecast in Iowa.

There are a couple of common techniques that you can use to generate reconciled forecasts for each level. One technique is known as the Bottom-Up approach. In this approach, the forecasts are generated at the bottom level of the hierarchy first before summing up the other levels. Taking the earlier example, the forecasts for each store are used to build the forecasting models for the other levels (first by zip, followed by city, etc.).

Create a time series model

First, create a time series model, using the Iowa liquor sales data.

The following GoogleSQL query creates a model that forecasts daily total number of bottles sold in 2015 in Polk, Linn and Scott counties. The CREATE MODEL statement creates and trains a model named bqml_tutorial.liquor_forecast.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.liquor_forecast`
  OPTIONS (
    MODEL_TYPE = 'ARIMA_PLUS',
    TIME_SERIES_TIMESTAMP_COL = 'date',
    TIME_SERIES_DATA_COL = 'total_bottles_sold',
    TIME_SERIES_ID_COL = ['store_number', 'zip_code', 'city', 'county'],
    HOLIDAY_REGION = 'US')
AS
SELECT
  store_number,
  zip_code,
  city,
  county,
  date,
  SUM(bottles_sold) AS total_bottles_sold
FROM
  `bigquery-public-data.iowa_liquor_sales.sales`
WHERE
  date BETWEEN DATE('2015-01-01') AND DATE('2015-12-31')
  AND county IN ('POLK', 'LINN', 'SCOTT')
GROUP BY store_number, date, city, zip_code, county;

The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...) clause indicates that you are creating an ARIMA-based time series model. By default, auto_arima=TRUE, so the auto.ARIMA algorithm automatically tunes the hyperparameters in ARIMA_PLUS models. The algorithm fits dozens of candidate models and chooses the best one with the lowest Akaike information criterion (AIC). Setting the holiday_region option to US allows a more accurate modeling on those United States holidays time points if there are United States holiday patterns in the time series.

Run the CREATE MODEL query to create and train your model:

  1. In the Google Cloud console, click the Run a query in BigQuery button.

  2. Enter the following GoogleSQL query in the query editor.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.liquor_forecast`
  OPTIONS (
    MODEL_TYPE = 'ARIMA_PLUS',
    TIME_SERIES_TIMESTAMP_COL = 'date',
    TIME_SERIES_DATA_COL = 'total_bottles_sold',
    TIME_SERIES_ID_COL = ['store_number', 'zip_code', 'city', 'county'],
    HOLIDAY_REGION = 'US')
AS
SELECT
  store_number,
  zip_code,
  city,
  county,
  date,
  SUM(bottles_sold) AS total_bottles_sold
FROM
  `bigquery-public-data.iowa_liquor_sales.sales`
WHERE
  date BETWEEN DATE('2015-01-01') AND DATE('2015-12-31')
  AND county IN ('POLK', 'LINN', 'SCOTT')
GROUP BY store_number, date, city, zip_code, county;
  1. Click Run.

    The query takes approximately 37 seconds to complete, after which your model (liquor_forecast) appears in the Explorer pane. Because the query uses a CREATE MODEL statement to create a model, there are no query results.

Inspect the results of the time series model

After creating your model, you can use the ML.FORECAST function to see the forecast results.

You run the ML.FORECAST function against the model liquor_forecast by specifying it in the FROM clause.

By default, this query returns the forecasts for all unique timeseries in the data, which are identified by the columns store_number, zip_code, city and county.

To run the ML.FORECAST query, use the following steps:

  1. In the Google Cloud console, click the Run a query in BigQuery button.

  2. Enter the following GoogleSQL query in the query editor.

#standardSQL
SELECT *
FROM
  ML.FORECAST(
    MODEL `bqml_tutorial.liquor_forecast`,
    STRUCT(20 AS horizon, 0.8 AS confidence_level))
ORDER BY store_number, county, city, zip_code, forecast_timestamp
  1. Click Run.

    The query should take around 5 seconds to execute. After the query runs, the output shows the following results.

    Time Series Example.

    Notice how the forecast for the first time series is displayed (store_number=2190, zip_code=50314, city=DES MOINES, county=POLK). If you look further at the rest of the rows, you will notice the forecasts for the other groups.

    In general, forecasts are generated for each unique time series. In order to generate forecasts for aggregated levels such as forecasts for a specific county, you need to generate a hierarchical forecast.

Create time series model with hierarchical forecasts

Create a hierarchical time series forecast, using the Iowa liquor sales data.

The following GoogleSQL query creates a model that generates hierarchical forecasts for daily total number of bottles sold in 2015 in Polk, Linn and Scott counties. The CREATE MODEL statement creates and trains a model named bqml_tutorial.liquor_forecast_hierarchical.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.liquor_forecast_hierarchical`
  OPTIONS (
    MODEL_TYPE = 'ARIMA_PLUS',
    TIME_SERIES_TIMESTAMP_COL = 'date',
    TIME_SERIES_DATA_COL = 'total_bottles_sold',
    TIME_SERIES_ID_COL = ['store_number', 'zip_code', 'city', 'county'],
    HIERARCHICAL_TIME_SERIES_COLS = ['zip_code', 'store_number'],
    HOLIDAY_REGION = 'US')
AS
SELECT
  store_number,
  zip_code,
  city,
  county,
  date,
  SUM(bottles_sold) AS total_bottles_sold
FROM
  `bigquery-public-data.iowa_liquor_sales.sales`
WHERE
  date BETWEEN DATE('2015-01-01') AND DATE('2015-12-31')
  AND county IN ('POLK', 'LINN', 'SCOTT')
GROUP BY store_number, date, city, zip_code, county;

The HIERARCHICAL_TIME_SERIES_COLS parameter indicates that you are creating a hierarchical forecast based on a set of columns. Each of these columns are rolled up and aggregated. For example, from the earlier query, this means that store_number is rolled up to show forecasts for each county, city and zip_code. Separately, both zip_code and store_number are also rolled up to show forecasts for each county and city. The column order is important because it defines the structure of the hierarchy.

Run the CREATE MODEL query to create and train your model:

  1. In the Google Cloud console, click the Run a query in BigQuery button.

  2. Enter the following GoogleSQL query in the query editor.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.liquor_forecast_hierarchical`
  OPTIONS (
    MODEL_TYPE = 'ARIMA_PLUS',
    TIME_SERIES_TIMESTAMP_COL = 'date',
    TIME_SERIES_DATA_COL = 'total_bottles_sold',
    TIME_SERIES_ID_COL = ['store_number', 'zip_code', 'city', 'county'],
    HIERARCHICAL_TIME_SERIES_COLS = ['zip_code', 'store_number'],
    HOLIDAY_REGION = 'US')
AS
SELECT
  store_number,
  zip_code,
  city,
  county,
  date,
  SUM(bottles_sold) AS total_bottles_sold
FROM
  `bigquery-public-data.iowa_liquor_sales.sales`
WHERE
  date BETWEEN DATE('2015-01-01') AND DATE('2015-12-31')
  AND county IN ('POLK', 'LINN', 'SCOTT')
GROUP BY store_number, date, city, zip_code, county;
  1. Click Run.

    The query takes approximately 45 seconds to complete, after which your model (bqml_tutorial.liquor_forecast_hierarchical) appears in the Explorer pane. Because the query uses a CREATE MODEL statement to create a model, there are no query results.

Inspect the results of the hierarchical time series model

  1. In the Google Cloud console, click the Run a query in BigQuery button.

  2. Enter the following GoogleSQL query in the query editor.

    #standardSQL
    SELECT
    *
    FROM
    ML.FORECAST(
      MODEL `bqml_tutorial.liquor_forecast_hierarchical`,
      STRUCT(30 AS horizon, 0.8 AS confidence_level))
    WHERE city = 'LECLAIRE'
    ORDER BY county, city, zip_code, store_number, forecast_timestamp
  3. Click Run.

    The query should take around 5 seconds to execute. After the query runs, the output shows the following results.

    Hierarchical Time Series Example.

    Notice how the aggregated forecast is displayed for the city of LeClaire (store_number=NULL, zip_code=NULL, city=LECLAIRE, county=SCOTT). If you look further at the rest of the rows, you will notice the forecasts for the other sub groups. For example, the following image shows the forecasts aggregated for the zip code 52753 (store_number=NULL, zip_code=52753, city=LECLAIRE, county=SCOTT):

    Hierarchical Time Series Example.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.

  4. In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Delete your project

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next