Stay organized with collections
Save and categorize content based on your preferences.
The ML.ONE_HOT_ENCODER function
This document describes the ML.ONE_HOT_ENCODER function, which lets you
encode a string expression using a
one-hot
or dummy
encoding scheme.
The encoding vocabulary is sorted alphabetically. NULL values and categories
that aren't in the vocabulary are encoded with an index value of 0. If you
use dummy encoding, the dropped category is encoded with a value of 0.
When used in the
TRANSFORM clause,
the vocabulary and dropped category values calculated during training, along
with the top k and frequency threshold values that you specified, are
automatically used in prediction.
string_expression: the STRING expression to encode.
drop: a STRING value that specifies whether the function drops
a category. Valid values are as follows:
none: Retain all categories. This is the default value.
most_frequent: Drop the most frequent category found in
the string expression. Selecting this value causes the function to use
dummy encoding.
top_k: an INT64 value that specifies the number of categories
included in the encoding vocabulary. The function selects the top_k
most frequent categories in the data and uses those; categories below this
threshold are encoded to 0. This value must be less than 1,000,000
to avoid problems due to high dimensionality. The default value is 32,000.
frequency_threshold: an INT64 value that limits the categories
included in the encoding vocabulary based on category frequency. The
function uses categories whose frequency is greater than or equal to
frequency_threshold; categories below this threshold are encoded to 0.
The default value is 5.
Output
ML.ONE_HOT_ENCODER returns an array of struct values, in the form
ARRAY<STRUCT<INT64, FLOAT64>>. The first element in the struct provides the
index of the encoded string expression, and the second element provides the
value of the encoded string expression.
Example
The following example performs dummy encoding on a set of string expressions.
It limits the encoding vocabulary to the ten categories that occur the most
frequently in the data and that also occur zero or more times.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003e\u003ccode\u003eML.ONE_HOT_ENCODER\u003c/code\u003e encodes string expressions using one-hot or dummy encoding, sorting the vocabulary alphabetically.\u003c/p\u003e\n"],["\u003cp\u003eThe function handles \u003ccode\u003eNULL\u003c/code\u003e values and out-of-vocabulary categories by encoding them with an \u003ccode\u003eindex\u003c/code\u003e of \u003ccode\u003e0\u003c/code\u003e, and uses a \u003ccode\u003evalue\u003c/code\u003e of \u003ccode\u003e0\u003c/code\u003e for dropped categories in dummy encoding.\u003c/p\u003e\n"],["\u003cp\u003e\u003ccode\u003eML.ONE_HOT_ENCODER\u003c/code\u003e supports parameters like \u003ccode\u003edrop\u003c/code\u003e, \u003ccode\u003etop_k\u003c/code\u003e, and \u003ccode\u003efrequency_threshold\u003c/code\u003e to customize the encoding process.\u003c/p\u003e\n"],["\u003cp\u003eThe \u003ccode\u003etop_k\u003c/code\u003e parameter limits the vocabulary to the most frequent categories, while the \u003ccode\u003efrequency_threshold\u003c/code\u003e parameter filters categories based on their occurrence frequency, and any category not satisfying the criteria is encoded to \u003ccode\u003e0\u003c/code\u003e.\u003c/p\u003e\n"],["\u003cp\u003eOutput is returned as an array of structs with \u003ccode\u003eindex\u003c/code\u003e and \u003ccode\u003evalue\u003c/code\u003e, showing the encoded representation of each string expression.\u003c/p\u003e\n"]]],[],null,["# The ML.ONE_HOT_ENCODER function\n===============================\n\nThis document describes the `ML.ONE_HOT_ENCODER` function, which lets you\nencode a string expression using a\n[one-hot](/bigquery/docs/auto-preprocessing#one_hot_encoding)\nor [dummy](/bigquery/docs/auto-preprocessing#dummy_encoding)\nencoding scheme.\n\nThe encoding vocabulary is sorted alphabetically. `NULL` values and categories\nthat aren't in the vocabulary are encoded with an `index` value of `0`. If you\nuse dummy encoding, the dropped category is encoded with a `value` of `0`.\n\nWhen used in the\n[`TRANSFORM` clause](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform),\nthe vocabulary and dropped category values calculated during training, along\nwith the top *k* and frequency threshold values that you specified, are\nautomatically used in prediction.\n\nSyntax\n------\n\n```sql\nML.ONE_HOT_ENCODER(string_expression [, drop] [, top_k] [, frequency_threshold]) OVER()\n```\n\n### Arguments\n\n`ML.ONE_HOT_ENCODER` takes the following arguments:\n\n- `string_expression`: the `STRING` expression to encode.\n- `drop`: a `STRING` value that specifies whether the function drops a category. Valid values are as follows:\n - `none`: Retain all categories. This is the default value.\n - `most_frequent`: Drop the most frequent category found in the string expression. Selecting this value causes the function to use dummy encoding.\n- `top_k`: an `INT64` value that specifies the number of categories included in the encoding vocabulary. The function selects the `top_k` most frequent categories in the data and uses those; categories below this threshold are encoded to `0`. This value must be less than `1,000,000` to avoid problems due to high dimensionality. The default value is `32,000`.\n- `frequency_threshold`: an `INT64` value that limits the categories included in the encoding vocabulary based on category frequency. The function uses categories whose frequency is greater than or equal to `frequency_threshold`; categories below this threshold are encoded to `0`. The default value is `5`.\n\nOutput\n------\n\n`ML.ONE_HOT_ENCODER` returns an array of struct values, in the form\n`ARRAY\u003cSTRUCT\u003cINT64, FLOAT64\u003e\u003e`. The first element in the struct provides the\nindex of the encoded string expression, and the second element provides the\nvalue of the encoded string expression.\n\nExample\n-------\n\nThe following example performs dummy encoding on a set of string expressions.\nIt limits the encoding vocabulary to the ten categories that occur the most\nfrequently in the data and that also occur zero or more times. \n\n```sql\nSELECT f, ML.ONE_HOT_ENCODER(f, 'most_frequent', 10, 0) OVER () AS output\nFROM UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd']) AS f\nORDER BY f;\n```\n\nThe output looks similar to the following: \n\n```\n+------+-----------------------------+\n| f | output.index | output.value |\n+------+--------------+--------------+\n| NULL | 0 | 1.0 |\n| a | 1 | 1.0 |\n| b | 2 | 1.0 |\n| b | 2 | 1.0 |\n| c | 3 | 0.0 |\n| c | 3 | 0.0 |\n| c | 3 | 0.0 |\n| d | 4 | 1.0 |\n| d | 4 | 1.0 |\n+------+-----------------------------+\n```\n\nWhat's next\n-----------\n\n- For information about feature preprocessing, see [Feature preprocessing overview](/bigquery/docs/preprocess-overview).\n- For information about the supported SQL statements and functions for each model type, see [End-to-end user journey for each model](/bigquery/docs/e2e-journey)."]]