The CREATE MODEL statement for importing XGBoost models

This document describes the CREATE MODEL statement for importing XGBoost models into BigQuery.

For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[INPUT(field_name field_type, …)
 OUTPUT(field_name field_type, …)]
OPTIONS(MODEL_TYPE = 'XGBOOST', MODEL_PATH = string_value
  [, KMS_KEY_NAME = string_value ]
);

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model doesn't exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

INPUT OUTPUT clause

The INPUT OUTPUT clause lets you specify the model input output schema information when you create the XGBoost model.

INPUT OUTPUT is optional only if feature_names and feature_types are both specified in the model file. For more information about how to store feature_names and feature_types in the XGBoost model file, see Introduction to Model IO.

field_name

Use field_name to define the name of an input feature or a model output.

If the feature_names field is populated in the XGBoost model file, the input field names must be identical to the names in the feature_names field. For more information, see the XGBoost model JSON Schema.

field_type

Use field_type to specify the data type of an input feature or a model output. Input data types must be one of the supported numeric types. The output data type must be FLOAT64.

Example

INPUT(f1 INT64, f2 FLOAT64, f3 FLOAT64)
OUTPUT(predicted_label FLOAT64)

MODEL_TYPE

Syntax

MODEL_TYPE = 'XGBOOST'

Description

Specifies the model type. This option is required.

MODEL_PATH

Syntax

MODEL_PATH = string_value

Description

Specifies the Cloud Storage URI of the XGBoost model to import. This option is required.

Arguments

A STRING value specifying the URI of a Cloud Storage bucket that contains the model to import.

BigQuery ML imports the model from Cloud Storage by using the credentials of the user who runs the CREATE MODEL statement.

Example

MODEL_PATH = 'gs://bucket/path/to/xgboost_model/*'

KMS_KEY_NAME

Syntax

KMS_KEY_NAME = string_value

Description

The Cloud Key Management Service customer-managed encryption key (CMEK) to use to encrypt the model.

Arguments

A STRING value containing the fully-qualified name of the CMEK. For example,

'projects/my_project/locations/my_location/keyRings/my_ring/cryptoKeys/my_key'

Limitations

Imported XGBoost models have the following limitations:

  • The XGBoost model must already exist before it can be imported into BigQuery.
  • Models must be stored in Cloud Storage.
  • XGBoost models must be in .bst or .json format.
  • You can only use XGBoost models with the ML.PREDICT and ML.FEATURE_IMPORTANCE functions.
  • Models are limited to 250 MB in size.
  • The memory limit to run the XGBoost model is 1 GB. You can reduce the model size by using fewer trees or shallower tree depth, or by using the XGBoost library's default save_model method to save the models.
  • BigQuery XGBoost models only support numeric types as input data types and FLOAT64 as the output data type.
  • Categorical features that use XGBoost built-in categorical data support are treated as integer inputs.
  • BigQuery XGBoost models only support a single scalar or array output. Multiple outputs aren't supported.

Examples

The following examples show how to create different types of imported XGBoost models.

Import a model and specify input and output columns

The following example imports a XGBoost model into BigQuery as a BigQuery model. The example assumes the following:

  • There is an existing XGBoost model located at gs://bucket-name/xgboost-model/*.
  • The model file is in .bst format or in .json format.
  • The model file doesn't contain information about input feature_names and feature_types.
CREATE OR REPLACE
  MODEL
    `project_id.mydataset.mymodel`
      INPUT(f1 float64, f2 float64, f3 float64, f4 float64)
      OUTPUT(predicted_label float64)
  OPTIONS (
    MODEL_TYPE = 'XGBOOST',
    MODEL_PATH = 'gs://bucket-name/xgboost-model/*')

Import a model that already contains input and output columns

The following example imports a XGBoost model into BigQuery as a BigQuery model. The example assumes the following:

  • There is an existing XGBoost model located at gs://bucket-name/xgboost-model/*.
  • The model file is in .bst format or in .json format.
  • The model file contains information about input feature_names and feature_types.
CREATE OR REPLACE
  MODEL
    `project_id.mydataset.mymodel`
  OPTIONS (
    MODEL_TYPE = 'XGBOOST',
    MODEL_PATH = 'gs://bucket-name/xgboost-model/*')