The ML.DESCRIBE_DATA function

This document describes the ML.DESCRIBE_DATA function, which you can use to generate descriptive statistics for the columns in a table or subquery. For example, you might want to know statistics for a table of training or serving data that you plan to use with a machine learning (ML) model. You can use the data output by this function for such purposes as feature preprocessing or model monitoring.

Syntax

ML.DESCRIBE_DATA(
  { TABLE `project_id.dataset.table` | (query_statement) },
  STRUCT(
    [num_quantiles AS num_quantiles]
    [, num_array_length_quantiles AS num_array_length_quantiles]
    [, top_k AS top_k])
)

Arguments

ML.DESCRIBE_DATA takes the following arguments:

  • project_id: your project ID.
  • dataset: the BigQuery dataset that contains the table.
  • table: the name of the input table that contains the training or serving data to calculate statistics for.
  • query_statement: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of the query_statement clause, see GoogleSQL query syntax.
  • num_quantiles: an INT64 value that specifies the number of quantiles to return for numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> columns. This affects the number of results shown in the quantiles output column. These quantiles describe the distribution of the data in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. The num_quantiles value must be in the range [1, 100,000]. The default value is 2.
  • num_array_length_quantiles: an INT64 value that specifies the number of quantiles to return for ARRAY columns. This affects the number of results shown in the array_length_quantiles output column. These quantiles describe the distribution of the length of the arrays in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. The num_array_length_quantiles value must be in the range [1, 100,000]. The default value is 10.
  • top_k: an INT64 value that specifies the number of top values to return for categorical and ARRAY<categorical> columns. This affects the number of results shown in the top_values output column. The top values are the values that are shown most frequently in the column. The top_k value must be in the range [1, 10,000]. The default value is 1.

Details

ML.DESCRIBE_DATA handles input columns as follows:

  • ARRAY columns are unnested before statistics are computed on them.
  • ARRAY<STRUCT<INT64, numerical>>. The INT64 value is the index, and the numerical value is the value. For statistics computation, BigQuery ML treats columns of this type as ARRAY<numerical> based on the value. The value of the dimension column in the output is MAX(index) + 1.
  • STRUCT fields are expanded, and then categorical columns are cast to STRING and numerical columns are cast to FLOAT64.
  • Columns of the following data types are cast to STRING and return the same statistics as STRING columns:
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP Columns of the following data types are cast to FLOAT64 and return the same statistics as FLOAT64 columns:
    • INT64
    • NUMERIC
    • BIGNUMERIC

Output

ML.DESCRIBE_DATA returns one row for each column in the input data. ML.DESCRIBE_DATA output contains the following columns:

  • name: a STRING column that contains the name of the input column.
  • num_rows: an INT64 column that contains the total number of rows for the input column.
  • num_nulls: an INT64 column that returns the number of NULL values found in the column.
  • num_zeros: an INT64 column that contains one of the following:
    • For numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> input columns, returns the number of 0 values found in the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL.
  • min: a STRING column that contains the MIN value for the column.
  • max: a STRING column that contains the MAX value for the column.
  • mean: a FLOAT64 column that contains one of the following:
    • For numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> input columns, returns the mean value calculated for the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL.
  • stdev: a FLOAT64 column that contains one of the following:
    • For numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> input columns, returns the standard deviation value calculated for the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL.
  • median: a FLOAT64 column that contains one of the following:
    • For numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> input columns, returns the median value calculated for the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL.
  • quantiles: an ARRAY<FLOAT64> column that contains information about the quantiles in an input column, as computed by the APPROX_QUANTILES function. The quantiles column contains one of the following values:
    • For numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> input columns, returns the quantiles computed for the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL.
  • unique: an INT64 column that contains information about the number of unique values in an input column, as computed by the APPROX_COUNT_DISTINCT function. The unique column contains one of the following values:
    • For numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> input columns, returns NULL.
    • For categorical or ARRAY<categorical> input columns, returns the number of unique values in the input column.
  • avg_string_length: a FLOAT64 column that contains one of the following:
    • For numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> input columns, returns NULL.
    • For categorical or ARRAY<categorical> input columns, returns the average length of the values in the column.
  • num_values: an INT64 column that contains the number of array elements for ARRAY columns, and the number of values in the column for other types of columns.
  • top_values: a ARRAY<STRUCT<STRING, INT64>> column that contains information about the top values and number of occurrences in an input column, as computed by the APPROX_TOP_COUNT function. The top_values column contains the following fields:
    • top_values.value: a STRING field that contains one of the following values:
      • For numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> input columns, returns NULL.
      • For categorical or ARRAY<categorical> input columns, returns one of the top values in the input column.
    • top_values.count: an INT64 field that contains one of the following values:
      • For numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> input columns, returns NULL.
      • For categorical or ARRAY<categorical> input columns, returns the number of times the related top value appears.
  • min_array_length: an INT64 column that contains one of the following values:
    • For ARRAY input columns, returns the minimum length of an array in the column.
    • For other types of input columns, returns NULL.
  • max_array_length: an INT64 column that contains one of the following values:
    • For ARRAY input columns, returns the maximum length of an array in the column.
    • For other types of input columns, returns NULL.
  • avg_array_length: a FLOAT64 column that contains one of the following values:
    • For ARRAY input columns, returns the average length of an array in the column.
    • For other types of input columns, returns NULL.
  • total_array_length: an INT64 column that contains one of the following values:
    • For ARRAY input columns, returns the sum of the size of the arrays in the column.
    • For other types of input columns, returns NULL.
  • array_length_quantiles: an ARRAY<INT64> column that contains the information about the quantiles for the array length in an input column, as computed by the APPROX_QUANTILES function. The array_length_quantiles column contains one of the following values:
    • For ARRAY input columns, returns the quantiles for the array length computed for the column.
    • For other types of input columns, returns 0.
  • dimension: an INT64 column that contains one of the following:
    • For ARRAY<STRUCT<INT64, numerical>> input columns, returns the dimension computed for the column, which is MAX(index) + 1 for sparse input.
    • For other types of input columns, returns NULL.

Example

The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:

SELECT *
FROM ML.DESCRIBE_DATA(
  TABLE `myproject.mydataset.mytable`,
  STRUCT(5 AS num_quantiles, 3 AS top_k)
);

Limitations

Input data for the ML.DESCRIBE_DATA function can only contain columns of the following data types:

  • Numeric types
  • STRING
  • BOOL
  • BYTE
  • DATE
  • DATETIME
  • TIME
  • TIMESTAMP
  • ARRAY<STRUCT<INT64, FLOAT64>> (a sparse tensor)
  • STRUCT columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP
  • ARRAY columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP