The ML.TF_IDF function

The term frequency-inverse document frequency (TF-IDF) reflects how important a word is to a document in a collection or corpus. Use the ML.TF_IDF function to compute TF-IDF of terms in a document, given the precomputed inverse-document frequency for use in machine learning model creation.

This function uses a TF-IDF algorithm to compute the relevance of terms in a set of tokenized documents. TF-IDF multiplies two metrics: how many times a term appears in a document (term frequency), and the inverse document frequency of the term across a collection of documents (inverse document frequency).

TF-IDF:

term frequency * inverse document frequency

Term frequency:

(count of term in document) / (document size)

Inverse document frequency:

log(1 + num_documents / (1 + token_document_count))

Terms are added to a dictionary of terms if they satisfy the criteria for top_k and frequency_threshold, otherwise they are considered the unknown term. The unknown term is always the first term in the dictionary and represented as 0. The rest of the dictionary is ordered alphabetically.

You can use this function with models that support manual feature preprocessing. For more information, see the following documents:

Syntax

ML.TF_IDF(
  tokenized_document
  [, top_k]
  [, frequency_threshold]
)
OVER()

Arguments

ML.TF_IDF takes the following arguments:

tokenized_document: ARRAY<STRING> value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.
top_k: Optional argument. Takes an INT64 value, which represents the size of the dictionary, excluding the unknown term. The top_k terms that appear in the most documents are added to the dictionary until this threshold is met. For example, if this value is 20, the top 20 unique terms that appear in the most documents are added and then no additional terms are added.
frequency_threshold: Optional argument. Take an INT64 value that represents the minimum number of documents a term must appear in to be included in the dictionary. For example, if this value is 3, a term must appear in at least three documents to be added to the dictionary.

Output

ML.TF_IDF returns the input table plus the following two columns:

ARRAY<STRUCT<index INT64, value FLOAT64>>

Definitions:

index: The index of the term that was added to the dictionary. Unknown terms have an index of 0.
value: The TF-IDF computation for the term.

Quotas

See Cloud AI service functions quotas and limits.

Example

The following example creates a table ExampleTable and applies the ML.TF_IDF function:

WITH
  ExampleTable AS (
    SELECT 1 AS id, ['I', 'like', 'pie', 'pie', 'pie', NULL] AS f
    UNION ALL
    SELECT 2 AS id, ['yum', 'yum', 'pie', NULL] AS f
    UNION ALL
    SELECT 3 AS id, ['I', 'yum', 'pie', NULL] AS f
    UNION ALL
    SELECT 4 AS id, ['you', 'like', 'pie', NULL] AS f
  )
SELECT id, ML.TF_IDF(f, 3, 1) OVER () AS results
FROM ExampleTable
ORDER BY id;

The output is similar to the following:

+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id |                                                                                     results                                                                                     |
+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  1 | [{"index":"0","value":"0.12679902142647365"},{"index":"1","value":"0.1412163100645339"},{"index":"2","value":"0.1412163100645339"},{"index":"3","value":"0.29389333245105953"}] |
|  2 |                                                                                        [{"index":"0","value":"0.5705955964191315"},{"index":"3","value":"0.14694666622552977"}] |
|  3 |                                             [{"index":"0","value":"0.380397064279421"},{"index":"1","value":"0.21182446509680086"},{"index":"3","value":"0.14694666622552977"}] |
|  4 |                                             [{"index":"0","value":"0.380397064279421"},{"index":"2","value":"0.21182446509680086"},{"index":"3","value":"0.14694666622552977"}] |
+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

What's next

Learn more about TF-IDF outside of machine learning.