Stay organized with collections
Save and categorize content based on your preferences.
The ML.LABEL_ENCODER function
This document describes the ML.LABEL_ENCODER function, which you can use to
encode a string expression to an INT64 value in [0, <number of categories>].
The encoding vocabulary is sorted alphabetically. NULL values and categories
that aren't in the vocabulary are encoded to 0.
When used in the
TRANSFORM clause,
the vocabulary values calculated during training, along
with the top k and frequency threshold values that you specified, are
automatically used in prediction.
string_expression: the STRING expression to encode.
top_k: an INT64 value that specifies the number of categories
included in the encoding vocabulary. The function selects the top_k
most frequent categories in the data and uses those; categories below this
threshold are encoded to 0. This value must be less than 1,000,000
to avoid problems due to high dimensionality. The default value is 32,000.
frequency_threshold: an INT64 value that limits the categories
included in the encoding vocabulary based on category frequency. The
function uses categories whose frequency is greater than or equal to
frequency_threshold; categories below this threshold are encoded to 0.
The default value is 5.
Output
ML.LABEL_ENCODER returns an INT64 value that represents the encoded
string expression.
Example
The following example performs label encoding on a set of string expressions.
It limits the encoding vocabulary to the two categories that occur the most
frequently in the data and that also occur two or more times.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eThe \u003ccode\u003eML.LABEL_ENCODER\u003c/code\u003e function encodes a string expression into an \u003ccode\u003eINT64\u003c/code\u003e value within the range of \u003ccode\u003e[0, <number of categories>]\u003c/code\u003e.\u003c/p\u003e\n"],["\u003cp\u003eEncoding is based on an alphabetically sorted vocabulary, with \u003ccode\u003eNULL\u003c/code\u003e values and categories not in the vocabulary being encoded to \u003ccode\u003e0\u003c/code\u003e.\u003c/p\u003e\n"],["\u003cp\u003eThe function accepts optional \u003ccode\u003etop_k\u003c/code\u003e and \u003ccode\u003efrequency_threshold\u003c/code\u003e arguments to limit the encoding vocabulary based on category frequency, defaulting to \u003ccode\u003e32,000\u003c/code\u003e and \u003ccode\u003e5\u003c/code\u003e, respectively.\u003c/p\u003e\n"],["\u003cp\u003eIn a \u003ccode\u003eTRANSFORM\u003c/code\u003e clause, the vocabulary calculated during training, including \u003ccode\u003etop_k\u003c/code\u003e and \u003ccode\u003efrequency_threshold\u003c/code\u003e values, are automatically used for predictions.\u003c/p\u003e\n"],["\u003cp\u003eThe function's output is an \u003ccode\u003eINT64\u003c/code\u003e value, demonstrated in the example where it encodes strings based on frequency and the parameters provided, outputting 0 for null values, values below frequency thresholds, and those not in the top_k most frequent values.\u003c/p\u003e\n"]]],[],null,["# The ML.LABEL_ENCODER function\n=============================\n\nThis document describes the `ML.LABEL_ENCODER` function, which you can use to\nencode a string expression to an `INT64` value in `[0, \u003cnumber of categories\u003e]`.\n\nThe encoding vocabulary is sorted alphabetically. `NULL` values and categories\nthat aren't in the vocabulary are encoded to `0`.\n\nWhen used in the\n[`TRANSFORM` clause](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform),\nthe vocabulary values calculated during training, along\nwith the top *k* and frequency threshold values that you specified, are\nautomatically used in prediction.\n\nSyntax\n------\n\n```sql\nML.LABEL_ENCODER(string_expression [, top_k] [, frequency_threshold]) OVER()\n```\n\n`ML.LABEL_ENCODER` takes the following arguments:\n\n- `string_expression`: the `STRING` expression to encode.\n- `top_k`: an `INT64` value that specifies the number of categories included in the encoding vocabulary. The function selects the `top_k` most frequent categories in the data and uses those; categories below this threshold are encoded to `0`. This value must be less than `1,000,000` to avoid problems due to high dimensionality. The default value is `32,000`.\n- `frequency_threshold`: an `INT64` value that limits the categories included in the encoding vocabulary based on category frequency. The function uses categories whose frequency is greater than or equal to `frequency_threshold`; categories below this threshold are encoded to `0`. The default value is `5`.\n\nOutput\n------\n\n`ML.LABEL_ENCODER` returns an `INT64` value that represents the encoded\nstring expression.\n\nExample\n-------\n\nThe following example performs label encoding on a set of string expressions.\nIt limits the encoding vocabulary to the two categories that occur the most\nfrequently in the data and that also occur two or more times. \n\n```sql\nSELECT f, ML.LABEL_ENCODER(f, 2, 2) OVER () AS output\nFROM UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd']) AS f\nORDER BY f;\n```\n\nThe output looks similar to the following: \n\n```\n+------+--------+\n| f | output |\n+------+--------+\n| NULL | 0 |\n| a | 0 |\n| b | 1 |\n| b | 1 |\n| c | 2 |\n| c | 2 |\n| c | 2 |\n| d | 0 |\n| d | 0 |\n+------+--------+\n```\n\nWhat's next\n-----------\n\n- For information about feature preprocessing, see [Feature preprocessing overview](/bigquery/docs/preprocess-overview).\n- For information about the supported SQL statements and functions for each model type, see [End-to-end user journey for each model](/bigquery/docs/e2e-journey)."]]