Text analysis

This document provides an overview of text analysis, also known as text mining, in GoogleSQL for BigQuery.

GoogleSQL supports text analysis, which is a technique that you can use to identify terms (tokens) in unstructured text, and then use those terms for actionable insight, such as indexing and searching, or as inputs for vectorizations to be used in ML training pipelines. You can use a text analyzer to analyze information in a specific way and text analysis options to apply your own analyzation customizations.

Text analysis is supported in the following GoogleSQL functions and statements:

Text analyzers

GoogleSQL for BigQuery supports several types of text analyzers, which you can use to extract data from unstructured text. You can pass an analyzer into some functions and statements with the analyzer argument. Each text analyzer has a unique way of extracting information. Your choices are:

  • NO_OP_ANALYZER: Extracts the input as a single term (token).
  • LOG_ANALYZER: Breaks the input into terms when delimiters are encountered.
  • PATTERN_ANALYZER: Breaks the input into terms that match a regular expression.

NO_OP_ANALYZER analyzer

The NO_OP_ANALYZER analyzer is a no-operation analyzer, which extracts the input text as a single term (token). No formatting is applied to the resulting term.

This analyzer doesn't support any text analyzer options or token filters.

Example

The following query uses NO_OP_ANALYZER as the text analyzer:

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.',
  analyzer=>'NO_OP_ANALYZER'
) AS results

/*-----------------------------------------------*
 | results                                       |
 +-----------------------------------------------+
 | 'I like pie, you like-pie, they like 2 PIEs.' |
 *-----------------------------------------------*/

LOG_ANALYZER analyzer

The LOG_ANALYZER analyzer extracts the input text as terms (tokens) when a delimiter is encountered, discards the delimiters, and then changes any uppercase letters to lowercase letters in the results.

Details:

  • An uppercase letter in a term is made lowercase, but ASCII values greater than 127 are kept as is.
  • Text is split into individual terms when one the following delimiters, such as a space, period, or other non-letter character, is encountered:

    [ ] < > ( ) { } | ! ; , ' " * & ? + / : = @ . - $ % \ _ \n \r \s \t %21 %26
    %2526 %3B %3b %7C %7c %20 %2B %2b %3D %3d %2520 %5D %5d %5B %5b %3A %3a %0A
    %0a %2C %2c %28 %29
    

    If you don't want to use these default delimiters, you can specify the specific delimiters you want to use as text analyzer options. For more information, see delimiters option.

This analyzer supports token filters. For more information, see token_filters option. If the token_filters option is not specified, ASCII lowercase normalization is used by default.

Example

The following query uses LOG_ANALYZER as the text analyzer:

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.',
  analyzer=>'LOG_ANALYZER'
) AS results

/*---------------------------------------------------------------------------*
 | results                                                                   |
 +---------------------------------------------------------------------------+
 | [ 'i', 'like', 'pie', 'you', 'like', 'pie', 'they', 'like', '2', 'pies' ] |
 *---------------------------------------------------------------------------*/

Because LOG_ANALYZER is the default text analyzer, you don't need to specify it in the query. For example, the following query produces the same results as the preceding query:

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.'
) AS results

PATTERN_ANALYZER analyzer

The PATTERN_ANALYZER analyzer extracts terms (tokens) from unstructured text, using a re2 regular expression.

This analyzer finds the first term from the left side of the input text that matches the regular expression and adds this term to the output. Then, it removes the prefix in the input text up to the newly found term. This process is repeated until the input text is empty.

By default, the regular expression \b\w{2,}\b is used. This regular expression matches non-Unicode words that have at least two characters. If you would like to use another regular expression, see patterns option.

This analyzer supports token filters. For more information, see token_filters option. If the token_filters option is not specified, ASCII lowercase normalization is used by default.

Example

The following query uses PATTERN_ANALYZER as the text analyzer. Because the default regular expression is used, only words that have two or more characters are included as terms. Also, the results are lowercase. Notice that i and 2 don't appear in the results.

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.',
  analyzer=>'PATTERN_ANALYZER'
) AS results

/*----------------------------------------------------------------*
 | results                                                        |
 +----------------------------------------------------------------+
 | ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] |
 *----------------------------------------------------------------*/

Text analyzer options

Text analyzers support custom options that determine how input text is analyzed. You can pass analyzer options into some functions and statements with the analyzer_options argument. This argument takes a JSON-formatted STRING value.

Your choices are:

  • delimiters: Breaks the input into terms when these delimiters are encountered.
  • patterns: Breaks the input into terms that match a regular expression.
  • token_filters: After the input text has been tokenized into terms, apply filters on the terms.

delimiters analyzer option

'{
  "delimiters": array_of_delimiters
}'

Description

If you are using the LOG_ANALYZER text analyzer and you don't want to use the default delimiters, you can specify the specific delimiters that you want to use to filter the input text.

Definitions

  • delimiters: A JSON array containing strings that represent the delimiters to use to tokenize the input text.

Details

When there are two delimiters with the same prefix, for example: % and %2, the longer delimiter has higher precedence and is analyzed first.

You can add any ASCII string as a delimiter. The length of a delimiter must be less than or equal to 16 characters. Some common delimiters that you might want to include are:

[ ] < > ( ) { } | ! ; , ' " * & ? + / : = @ . - $ % \ _ \n \r \s \t %21 %26
%2526 %3B %3b %7C %7c %20 %2B %2b %3D %3d %2520 %5D %5d %5B %5b %3A %3a %0A
%0a %2C %2c %28 %29

Example

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'{"delimiters": [",", ".", "-"]}'
) AS results

/*-------------------------------------------------------*
 | results                                               |
 +-------------------------------------------------------+
 | ['i like pie', 'you like', 'pie', 'they like 2 pies]' |
 *-------------------------------------------------------*/

patterns analyzer option

'{
  "patterns": array_of_regex_patterns
}'

Description

If you are using the PATTERN_ANALYZER text analyzer and you don't want to use the default regular expression, you can specify the regular expression that you want to use to filter the input text.

Definitions

  • patterns: A JSON array which contains one string that represents the regular expression.

Details

If this analyzer option is not provided for the PATTERN_ANALYZER text analyzer, the regular expression \b\w{2,}\b is used by default to match non-Unicode words that have at least two characters.

Example

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.',
  analyzer=>'PATTERN_ANALYZER',
  analyzer_options=>'{"patterns": ["[a-zA-Z]*"]}'
) AS results

/*----------------------------------------------------------------*
 | results                                                        |
 +----------------------------------------------------------------+
 | ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] |
 *----------------------------------------------------------------*/

token_filters analyzer option

'{
  "token_filters": array_of_token_filters
}'

Description

If you are using the LOG_ANALYZER or PATTERN_ANALYZER text analyzer, you can sequentially apply one or more token filters to the input text after the input text has been tokenized.

Definitions

  • array_of_token_filters: A JSON array containing objects that represent token filters.

Details

For more information about the specific token filters you can add, see Token filters.

Example

For example, this query contains both patterns and token_filters options:

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.',
  analyzer=>'PATTERN_ANALYZER',
  analyzer_options=>'''
  {
    "patterns": ["[a-zA-Z]*"],
    "token_filters": [
      {
        "normalizer": {
          "mode": "LOWER"
        }
      },
      {
        "stop_words": ["they", "pie"]
      }
    ]
  }
  '''
) AS results

/*----------------------------------------------*
 | results                                      |
 +----------------------------------------------+
 | ['i', 'like', 'you', 'like', 'like, 'PIEs' ] |
 *----------------------------------------------*/

Token filters

'{
  "token_filters": [
    {
      "normalizer": {
        "mode": json_string,
        "icu_normalize_mode": json_string,
        "icu_case_folding": json_boolean
      }
    },
    {
      "stop_words": json_string_array
    }
  ]
}'

Token filters can modify or delete terms (tokens) that are extracted from input text. If no token filters are specified for a text analyzer that supports token filters, the ASCII lowercase normalization token filter is applied by default. If multiple token filters are added, they are applied in the order in which they are specified. The same token filter can be included multiple times in the token_filters array. See the examples in this section for details.

Definitions

Each token filter has a unique JSON syntax that contains some of these JSON key-value pairs, depending upon the type of token filter you want to use:

  • token_filters: JSON array of objects that contain token filters. The same type of token filter can be included multiple times in this array.

    • stop_words: JSON array of strings that represent the words to remove from the list of terms.

    • normalizer: JSON object that contains the normalization settings for a token filter. The settings include:

      • mode: JSON string that represents the normalization mode. Your choices are:

        • NONE: Don't apply normalization mode to terms.

        • LOWER: ASCII lowercase terms. If no token filters are specified for a text analyzer that supports token filters, this is used by default.

        • ICU_NORMALIZE: ICU normalize terms.

      • icu_normalize_mode: JSON string that represents the ICU normalization mode. Your choices are:

        You can use this if mode is ICU_NORMALIZE. If mode is ICU_NORMALIZE and this key-value pair is not set, icu_normalize_mode is NFKC by default.

      • icu_case_folding: JSON boolean that determines whether to apply ICU case folding to terms. true to apply ICU case folding to terms. Otherwise false.

        You can use this if mode is ICU_NORMALIZE. If mode is ICU_NORMALIZE and this value is not used, icu_case_folding is true by default.

Details

Token filters can be used with all but the NO_OP_ANALYZER text analyzer in the same query. Token filters are applied after the text analyzer breaks input text into terms.

If token_filters is not specified for an analyzer that supports token filters, ASCII lowercase normalization is applied by default.

You can add multiple token filters to the token filters array (token_filters). If multiple token filters are added, they are applied to the terms in the order in which they are specified. For more information, see the examples in this section.

You can add the same token filter multiple times to the token filters array. For more information, see the examples in this section.

Here are some of the filters that you can apply to terms, using the token filter JSON syntax:

Examples

In the following example, the terms are NFKC normalized, and then because ICU case folding is true, the terms are converted to lowercase. Finally, the lowercase words pies and 2 are removed from the query.

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "normalizer": {
          "mode": "ICU_NORMALIZE",
          "icu_normalize_mode": "NFKC",
          "icu_case_folding": true
        }
      },
      {
        "stop_words": ["pies", "2"]
      }
    ]
  }
  '''
) AS results

/*------------------------------------------*
 | results                                  |
 +------------------------------------------+
 | ['i', 'like', '❶', 'you', 'like', 'ño' ] |
 *------------------------------------------*/

The following query is similar to the preceding one, but the order of token filters is re-ordered, and this affects the outcome of the query. In the results, 2 and PIEs is retained because is normalized to 2 and PIEs is normalized to pies after the stop words token filter is applied:

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "stop_words": ["pies", "2"]
      },
      {
        "normalizer": {
          "mode": "ICU_NORMALIZE",
          "icu_normalize_mode": "NFKC",
          "icu_case_folding": true
        }
      }
    ]
  }
  '''
) AS results

/*-------------------------------------------------------*
 | results                                               |
 +-------------------------------------------------------+
 | ['i', 'like', '❶', '2', 'you', 'like', 'ño', 'pies' ] |
 *-------------------------------------------------------*/

You can use the same token filter as many times as you'd like in a query. In the following query, stop_words is used twice:

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "stop_words": ["like", "you"]
      },
      {
        "normalizer": {
          "mode": "ICU_NORMALIZE",
          "icu_normalize_mode": "NFKC",
          "icu_case_folding": true
        }
      },
      {
        "stop_words": ["ño"]
      }
    ]
  }
  '''
) AS results

/*----------------------------------*
 | results                          |
 +----------------------------------+
 | ['i', '❶', '2', 'pies', 'pies' ] |
 *----------------------------------*/

No normalization

'{
  "token_filters": [
    "normalizer": {
      "mode": "NONE"
    }
  ]
}'

Description

Normalization isn't applied to terms.

Example

In the following query, normalization is not applied to the results:

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "normalizer": {
          "mode": "NONE"
        }
      }
    ]
  }
  '''
) AS results

/*----------------------------------------------------------------*
 | results                                                        |
 +----------------------------------------------------------------+
 | ['I', 'like', '❶', '②', 'pies', 'you', 'like', 'Ño', 'PIEs' ] |
 *----------------------------------------------------------------*/

Convert to lowercase (ASCII)

'{
  "token_filters": [
    "normalizer": {
      "mode": "LOWER"
    }
  ]
}'

Description

Performs ASCII lowercasing on the resulting terms.

Example

In the following query, ASCII lowercasing is applied to the results:

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "normalizer": {
          "mode": "LOWER"
        }
      }
    ]
  }
  '''
) AS results

/*----------------------------------------------------------------*
 | results                                                        |
 +----------------------------------------------------------------+
 | ['i', 'like', '❶', '②', 'pies', 'you', 'like', 'Ño', 'pies' ] |
 *----------------------------------------------------------------*/

Convert to lowercase (ICU case folding)

'{
  "token_filters": [
    "normalizer": {
      "mode": "ICU_NORMALIZE",
      "icu_case_folding": true
    }
  ]
}'

Description

Performs ICU case folding, which converts the resulting terms to lowercase.

Example

In the following query, ICU case folding is applied to the results:

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "normalizer": {
          "mode": "ICU_NORMALIZE",
          "icu_case_folding": true
        }
      }
    ]
  }
  '''
) AS results

/*--------------------------------------------------------------*
 | results                                                      |
 +--------------------------------------------------------------+
 | ['i', 'like', '❶', '2' 'pies', 'you', 'like', 'ño', 'pies' ] |
 *--------------------------------------------------------------*/

Preserve uppercase

'{
  "token_filters": [
    "normalizer": {
      "mode": "ICU_NORMALIZE",
      "icu_case_folding": false
    }
  ]
}'

Description

Don't convert uppercase characters to lowercase characters in the resulting terms.

Example

In the following query, ICU case folding is not applied to the results:

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "normalizer": {
          "mode": "ICU_NORMALIZE",
          "icu_case_folding": false
        }
      }
    ]
  }
  '''
) AS results

/*---------------------------------------------------------------*
 | results                                                       |
 +---------------------------------------------------------------+
 | ['I', 'like', '❶', '2' 'pies', 'you', 'like',  'Ño', 'PIEs' ] |
 *---------------------------------------------------------------*/

ICU normalize with NFC

'{
  "token_filters": [
    "normalizer": {
      "mode": "ICU_NORMALIZE",
      "icu_normalize_mode": "NFC"
    }
  ]
}'

Description

Normalizes text with ICU NFC normalization, which decomposes and recomposes characters by canonical equivalence.

Example

In the following query, NFC normalization is applied to the results:

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "normalizer": {
          "mode": "ICU_NORMALIZE",
          "icu_normalize_mode": "NFC"
        }
      }
    ]
  }
  '''
) AS results

/*---------------------------------------------------------------*
 | results                                                       |
 +---------------------------------------------------------------+
 | ['i', 'like', '❶', '②' 'pies', 'you', 'like',  'ño', 'pies' ] |
 *---------------------------------------------------------------*/

ICU normalize with NFKC

'{
  "token_filters": [
    "normalizer": {
      "mode": "ICU_NORMALIZE",
      "icu_normalize_mode": "NFKC"
    }
  ]
}'

Description

Normalizes text with ICU NFKC normalization, which decomposes characters by compatibility, and then recomposes the characters by canonical equivalence.

Example

In the following query, NFKC normalization is applied to the results:

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "normalizer": {
          "mode": "ICU_NORMALIZE",
          "icu_normalize_mode": "NFKC"
        }
      }
    ]
  }'''
) AS results

/*---------------------------------------------------------------*
 | results                                                       |
 +---------------------------------------------------------------+
 | ['i', 'like', '❶', '2' 'pies', 'you', 'like',  'ño', 'pies' ] |
 *---------------------------------------------------------------*/

ICU normalize with NFD

'{
  "token_filters": [
    "normalizer": {
      "mode": "ICU_NORMALIZE",
      "icu_normalize_mode": "NFD"
    }
  ]
}'

Description

Normalizes text with ICU NFD normalization, which decomposes characters by canonical equivalence, and then arranges multiple combining characters in a specific order.

Example

In the following query, although the input and output for ñ look the same, the bytes are different (input is \u00f1, output is \u006e \u0303).

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "normalizer": {
          "mode": "ICU_NORMALIZE",
          "icu_normalize_mode": "NFD"
        }
      }
    ]
  }
  '''
) AS results

/*---------------------------------------------------------------*
 | results                                                       |
 +---------------------------------------------------------------+
 | ['i', 'like', '❶', '2' 'pies', 'you', 'like',  'ño', 'pies' ] |
 *---------------------------------------------------------------*/

ICU normalize with NFKD

'{
  "token_filters": [
    "normalizer": {
      "mode": "ICU_NORMALIZE",
      "icu_normalize_mode": "NFKD"
    }
  ]
}'

Description

Normalizes text with ICU NFKD normalization, which decomposes characters by compatibility, and then arranges multiple combining characters in a specific order.

Example

In the following query, although the input and output for ñ look the same, the bytes are different (input is \u00f1, output is \u006e \u0303).

SELECT TEXT_ANALYZE(
  'I like   pies, you like Ño PIEs',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {"normalizer": {
        "mode": "ICU_NORMALIZE",
        "icu_normalize_mode": "NFKD"
        }
      }
    ]
  }'''
) AS results

/*---------------------------------------------------------------*
 | results                                                       |
 +---------------------------------------------------------------+
 | ['i', 'like', '❶', '2' 'pies', 'you', 'like',  'ño', 'pies' ] |
 *---------------------------------------------------------------*/

Remove words

'{
  "token_filters": [
    "stop_words": array_of_stop_words
  ]
}'

Description

Exclude a list of terms (tokens) from the results.

Definitions

  • array_of_stop_words: A JSON array containing strings that represent terms. These terms shouldn't be included in the results. The array must have at least one element. An empty string is a valid array element.

Example

In the following query, the words they and pie are excluded from the results:

SELECT TEXT_ANALYZE(
  'I like pie, you like-pie, they like 2 PIEs.',
  analyzer=>'LOG_ANALYZER',
  analyzer_options=>'''
  {
    "token_filters": [
      {
        "stop_words": ["they", "pie"]
      }
    ]
  }
  '''
) AS results

/*---------------------------------------------------*
 | results                                           |
 +---------------------------------------------------+
 | ['I', 'like', 'you', 'like', 'like, '2', 'PIEs' ] |
 *---------------------------------------------------*/