Stay organized with collections
Save and categorize content based on your preferences.
The ML.MULTI_HOT_ENCODER function
This document describes the ML.MULTI_HOT_ENCODER function, which lets you
encode a string array expression by using a
multi-hot
encoding scheme.
The encoding vocabulary is sorted alphabetically. NULL values and categories
that aren't in the vocabulary are encoded with an index value of 0.
When used in the
TRANSFORM clause,
the vocabulary calculated during training, along with the top k and frequency
threshold values that you specified, are automatically used in prediction.
ML.MULTI_HOT_ENCODER takes the following arguments:
array_expression: the ARRAY<STRING> expression to encode.
top_k: an INT64 value that specifies the number of categories
included in the encoding vocabulary. The function selects the top_k
most frequent categories in the data and uses those; categories below this
threshold are encoded to 0. This value must be less than 1,000,000
to avoid problems due to high dimensionality. The default value is 32,000.
frequency_threshold: an INT64 value that limits the categories
included in the encoding vocabulary based on category frequency. The
function uses categories whose frequency is greater than or equal to
frequency_threshold; categories below this threshold are encoded to 0.
The default value is 5.
Output
ML.MULTI_HOT_ENCODER returns an array of struct values in the form ARRAY<STRUCT<INT64, FLOAT64>>. The first element in the struct provides the
index of the encoded string expression, and the second element provides the
value of the encoded string expression.
Example
The following example performs multi-hot encoding on a set of string array
expressions. It limits the encoding vocabulary to the three categories that
occur the most frequently in the data and that also occur one or more times.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003e\u003ccode\u003eML.MULTI_HOT_ENCODER\u003c/code\u003e encodes a string array expression using a multi-hot encoding scheme, sorting the vocabulary alphabetically.\u003c/p\u003e\n"],["\u003cp\u003eThe function uses \u003ccode\u003etop_k\u003c/code\u003e and \u003ccode\u003efrequency_threshold\u003c/code\u003e to limit the categories included in the encoding vocabulary, with \u003ccode\u003eNULL\u003c/code\u003e values and out-of-vocabulary categories being encoded to index \u003ccode\u003e0\u003c/code\u003e.\u003c/p\u003e\n"],["\u003cp\u003eWhen used within the \u003ccode\u003eTRANSFORM\u003c/code\u003e clause, the function automatically applies the training vocabulary and the specified \u003ccode\u003etop_k\u003c/code\u003e and \u003ccode\u003efrequency_threshold\u003c/code\u003e during prediction.\u003c/p\u003e\n"],["\u003cp\u003eIt returns an array of struct values, where each struct contains the encoded string expression's index and value.\u003c/p\u003e\n"],["\u003cp\u003eThe function's default values for the optional \u003ccode\u003etop_k\u003c/code\u003e and \u003ccode\u003efrequency_threshold\u003c/code\u003e arguments are \u003ccode\u003e32,000\u003c/code\u003e and \u003ccode\u003e5\u003c/code\u003e, respectively.\u003c/p\u003e\n"]]],[],null,["# The ML.MULTI_HOT_ENCODER function\n=================================\n\nThis document describes the `ML.MULTI_HOT_ENCODER` function, which lets you\nencode a string array expression by using a\n[multi-hot](/bigquery/docs/auto-preprocessing#feature-transform)\nencoding scheme.\n\nThe encoding vocabulary is sorted alphabetically. `NULL` values and categories\nthat aren't in the vocabulary are encoded with an `index` value of `0`.\n\nWhen used in the\n[`TRANSFORM` clause](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform),\nthe vocabulary calculated during training, along with the top *k* and frequency\nthreshold values that you specified, are automatically used in prediction.\n\nSyntax\n------\n\n```sql\nML.MULTI_HOT_ENCODER(array_expression [, top_k] [, frequency_threshold]) OVER()\n```\n\n### Arguments\n\n`ML.MULTI_HOT_ENCODER` takes the following arguments:\n\n- `array_expression`: the `ARRAY\u003cSTRING\u003e` expression to encode.\n- `top_k`: an `INT64` value that specifies the number of categories included in the encoding vocabulary. The function selects the `top_k` most frequent categories in the data and uses those; categories below this threshold are encoded to `0`. This value must be less than `1,000,000` to avoid problems due to high dimensionality. The default value is `32,000`.\n- `frequency_threshold`: an `INT64` value that limits the categories included in the encoding vocabulary based on category frequency. The function uses categories whose frequency is greater than or equal to `frequency_threshold`; categories below this threshold are encoded to `0`. The default value is `5`.\n\nOutput\n------\n\n`ML.MULTI_HOT_ENCODER` returns an array of struct values in the form `ARRAY\u003cSTRUCT\u003cINT64, FLOAT64\u003e\u003e`. The first element in the struct provides the\nindex of the encoded string expression, and the second element provides the\nvalue of the encoded string expression.\n\nExample\n-------\n\nThe following example performs multi-hot encoding on a set of string array\nexpressions. It limits the encoding vocabulary to the three categories that\noccur the most frequently in the data and that also occur one or more times. \n\n```sql\nSELECT f[OFFSET(0)] AS f0, ML.MULTI_HOT_ENCODER(f, 3, 1) OVER () AS output\nFROM\n (\n SELECT ['a', 'b', 'b', 'c', NULL] AS f\n UNION ALL\n SELECT ['c', 'c', 'd', 'd', NULL] AS f\n )\nORDER BY f[OFFSET(0)];\n```\n\nThe output looks similar to the following: \n\n```\n+------+-----------------------------+\n| f0 | output.index | output.value |\n+------+--------------+--------------+\n| a | 1 | 1.0 |\n| | 2 | 1.0 |\n| | 3 | 1.0 |\n| | 0 | 1.0 |\n| c | 3 | 1.0 |\n| | 0 | 1.0 |\n+------+-----------------------------+\n```\n\nWhat's next\n-----------\n\n- For information about feature preprocessing, see [Feature preprocessing overview](/bigquery/docs/preprocess-overview).\n- For information about the supported SQL statements and functions for each model type, see [End-to-end user journey for each model](/bigquery/docs/e2e-journey)."]]