Stay organized with collections
Save and categorize content based on your preferences.
The ML.TF_IDF function
The term frequency-inverse document frequency (TF-IDF) reflects how important a word
is to a document in a collection or corpus. Use the ML.TF_IDF function to
compute TF-IDF of terms in a document, given the precomputed inverse-document
frequency for use in machine learning model creation. You can use ML.TF_IDF
within the TRANSFORM
clause.
This function uses a TF-IDF algorithm to compute the relevance of terms in a set
of tokenized documents. TF-IDF multiplies two metrics: how many times a term
appears in a document (term frequency), and the inverse document frequency of
the term across a collection of documents (inverse document frequency).
TF-IDF:
termfrequency*inversedocumentfrequency
Term frequency:
(countoftermindocument)/(documentsize)
Inverse document frequency:
log(1+num_documents/(1+token_document_count))
Terms are added to a dictionary of terms if they satisfy the criteria for
top_k and frequency_threshold, otherwise they are considered
the unknown term. The unknown term is always the first term in the dictionary
and represented as 0. The rest of the dictionary is ordered alphabetically.
tokenized_document: ARRAY<STRING> value that represents a document that
has been tokenized. A tokenized document is a collection of terms (tokens),
which are used for text analysis.
top_k: Optional argument. Takes an INT64 value,
which represents the size of the dictionary, excluding the unknown term. The
top_k terms that appear in the most documents are added to the dictionary
until this threshold is met. For example, if this value is 20, the top 20
unique terms that appear in the most documents are added and then no
additional terms are added.
frequency_threshold: Optional argument. Take an INT64 value that
represents the minimum number of documents a term must appear in to be
included in the dictionary. For example, if this value is 3, a term must
appear in at least three documents to be added to the
dictionary.
Output
ML.TF_IDF returns the input table plus the following two columns:
ARRAY<STRUCT<index INT64, value FLOAT64>>
Definitions:
index: The index of the term that was added to the dictionary. Unknown terms
have an index of 0.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eThe \u003ccode\u003eML.TF_IDF\u003c/code\u003e function computes the term frequency-inverse document frequency (TF-IDF) of terms within a document to reflect their importance, using precomputed inverse-document frequency.\u003c/p\u003e\n"],["\u003cp\u003eTF-IDF is calculated by multiplying the term frequency (how often a term appears in a document) by the inverse document frequency (how common the term is across all documents).\u003c/p\u003e\n"],["\u003cp\u003eThe function allows for optional arguments \u003ccode\u003etop_k\u003c/code\u003e and \u003ccode\u003efrequency_threshold\u003c/code\u003e to control the size and composition of the dictionary of terms used in the calculation.\u003c/p\u003e\n"],["\u003cp\u003eThe \u003ccode\u003eML.TF_IDF\u003c/code\u003e function returns an array of structures, each containing the index of a term in the dictionary and its calculated TF-IDF value, where unknown terms are assigned an index of 0.\u003c/p\u003e\n"],["\u003cp\u003eThis function can be used to measure the relevance of terms in tokenized documents to then use the output in machine learning models, and is applied with the \u003ccode\u003eTRANSFORM\u003c/code\u003e clause.\u003c/p\u003e\n"]]],[],null,["# The ML.TF_IDF function\n======================\n\nThe term frequency-inverse document frequency (TF-IDF) reflects how important a word\nis to a document in a collection or corpus. Use the `ML.TF_IDF` function to\ncompute TF-IDF of terms in a document, given the precomputed inverse-document\nfrequency for use in machine learning model creation. You can use `ML.TF_IDF`\nwithin the [TRANSFORM\nclause](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform).\n\nThis function uses a TF-IDF algorithm to compute the relevance of terms in a set\nof tokenized documents. TF-IDF multiplies two metrics: how many times a term\nappears in a document (term frequency), and the inverse document frequency of\nthe term across a collection of documents (inverse document frequency).\n\n- TF-IDF:\n\n term frequency * inverse document frequency\n\n- Term frequency:\n\n (count of term in document) / (document size)\n\n- Inverse document frequency:\n\n log(1 + num_documents / (1 + token_document_count))\n\nTerms are added to a dictionary of terms if they satisfy the criteria for\n`top_k` and `frequency_threshold`, otherwise they are considered\nthe *unknown term* . The unknown term is always the first term in the dictionary\nand represented as `0`. The rest of the dictionary is ordered alphabetically.\n\nSyntax\n------\n\n```sql\nML.TF_IDF(\n tokenized_document\n [, top_k]\n [, frequency_threshold]\n)\nOVER()\n```\n\n### Arguments\n\n`ML.TF_IDF` takes the following arguments:\n\n- `tokenized_document`: `ARRAY\u003cSTRING\u003e` value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.\n- `top_k`: Optional argument. Takes an `INT64` value, which represents the size of the dictionary, excluding the unknown term. The `top_k` terms that appear in the most documents are added to the dictionary until this threshold is met. For example, if this value is `20`, the top 20 unique terms that appear in the most documents are added and then no additional terms are added.\n- `frequency_threshold`: Optional argument. Take an `INT64` value that represents the minimum number of documents a term must appear in to be included in the dictionary. For example, if this value is `3`, a term must appear in at least three documents to be added to the dictionary.\n\nOutput\n------\n\n`ML.TF_IDF` returns the input table plus the following two columns:\n\n`ARRAY\u003cSTRUCT\u003cindex INT64, value FLOAT64\u003e\u003e`\n\nDefinitions:\n\n- `index`: The index of the term that was added to the dictionary. Unknown terms\n have an index of 0.\n\n- `value`: The TF-IDF computation for the term.\n\nQuotas\n------\n\nSee [Cloud AI service functions quotas and limits](/bigquery/quotas#cloud_ai_service_functions).\n\nExample\n-------\n\nThe following example creates a table `ExampleTable` and applies the `ML.TF_IDF`\nfunction: \n\n WITH\n ExampleTable AS (\n SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f\n UNION ALL\n SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f\n UNION ALL\n SELECT 3 AS id, ['I', 'yum', 'pie', NULL] AS f\n UNION ALL\n SELECT 4 AS id, ['you', 'like', 'pie', NULL] AS f\n )\n SELECT id, ML.TF_IDF(f, 3, 1) OVER () AS results\n FROM ExampleTable\n ORDER BY id;\n\nThe output is similar to the following: \n\n```\n+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n| id | results |\n+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n| 1 | [{\"index\":\"0\",\"value\":\"0.12679902142647365\"},{\"index\":\"1\",\"value\":\"0.1412163100645339\"},{\"index\":\"2\",\"value\":\"0.1412163100645339\"},{\"index\":\"3\",\"value\":\"0.29389333245105953\"}] |\n| 2 | [{\"index\":\"0\",\"value\":\"0.5705955964191315\"},{\"index\":\"3\",\"value\":\"0.14694666622552977\"}] |\n| 3 | [{\"index\":\"0\",\"value\":\"0.380397064279421\"},{\"index\":\"1\",\"value\":\"0.21182446509680086\"},{\"index\":\"3\",\"value\":\"0.14694666622552977\"}] |\n| 4 | [{\"index\":\"0\",\"value\":\"0.380397064279421\"},{\"index\":\"2\",\"value\":\"0.21182446509680086\"},{\"index\":\"3\",\"value\":\"0.14694666622552977\"}] |\n+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n```\n\nWhat's next\n-----------\n\n- Learn more about [TF-IDF](/bigquery/docs/reference/standard-sql/text-analysis-functions#tf_idf) outside of machine learning."]]