Combine TOKENLISTs

This page describes how to concatenate TOKENLISTs in either a search index when you set up your schema or in a search query when performing a full-text search in Spanner.

Combine TOKENLISTs in a search index

Sometimes, you need your application to search across individual fields. At other times, the application needs to search across all fields. For example, in a table with two string columns, you might want your application to search across both columns without differentiating which column the matches come from.

In Spanner, there are two ways to achieve this:

  1. Tokenize words separately and concatenate the resulting TOKENLIST values (recommended).
  2. Concatenate strings and tokenize the result.

With the second approach, there are two problems:

  1. If you want to index Title or Studio individually, in addition to indexing them in a combined TOKENLIST, the same text is tokenized twice. This causes transactions to use more resources.
  2. A phrase search spans both fields. For example, if @p is set to "Blue Note", it matches a row that contains both Title="Big Blue Note" and Studio="Blue Note Studios".

The first approach solves these problems because a phrase only matches one field and each string field is only tokenized once if both the individual and combined TOKENLIST are indexed. Even though each string field is only tokenized once, the resulting TOKENLISTs are stored separately in the index.

Tokenize words separately and concatenate TOKENLIST values

The following example tokenizes each word and concatenates the TOKENLIST values:

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  Title STRING(MAX),
  Studio STRING(MAX),
  Title_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Title)) HIDDEN,
  Studio_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Studio)) HIDDEN,
  Combined_Tokens TOKENLIST AS (TOKENLIST_CONCAT([Title_Tokens, Studio_Tokens])) HIDDEN,
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsIndex ON Albums(Combined_Tokens);

SELECT AlbumId FROM Albums WHERE SEARCH(Combined_Tokens, @p);

TOKENLIST concatenation can also be implemented entirely on the query side. For more information, see Query-side TOKENLIST concatenation.

TOKENLIST_CONCAT is supported for both full-text and substring searches. Spanner doesn't let you mix tokenization types, such as TOKENIZE_FULLTEXT and TOKENIZE_SUBSTRING in the same TOKENLIST_CONCAT call.

The definition of text TOKENLIST columns can be changed in non-stored columns to add additional columns. This is useful when you want to add an additional column to TOKENLIST_CONCAT. Changing the generated column expression doesn't backfill existing rows in the index.

Concatenate strings and tokenize the result

The following example concatenates strings and tokenizes the result:

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  Title STRING(MAX),
  Studio STRING(MAX),
  Combined_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Title || " " || Studio)) HIDDEN,
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsIndex ON Albums(Combined_Tokens);

SELECT AlbumId FROM Albums WHERE SEARCH(Combined_Tokens, @p);

Query-side TOKENLIST concatenation

The tradeoff with indexing the concatenated TOKENLIST is that it increases storage and write cost. Each token is now stored on the disk twice: once in a posting list of its original TOKENLIST, and once in a posting list of the combined TOKENLIST. Query-side concatenation of TOKENLIST columns avoids this cost but the query uses more compute resources.

To concatenate multiple TOKENLISTs, use the TOKENLIST_CONCAT function in the SEARCH query. For this section, we're using the following sample schema:

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  Title STRING(MAX),
  Studio STRING(MAX),
  Title_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Title)) HIDDEN,
  Studio_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Studio)) HIDDEN,
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsIndex ON Albums(Title_Tokens, Studio_Tokens);

The following query searches for rows that have the tokens "blue" and "note" anywhere in the Title and Studio columns. This includes rows with both "blue" and "note" in the Title column, "blue" and "note" in the Studio column, and "blue" in the Title column and "note" in the Studio column, or the opposite.

SELECT AlbumId
FROM Albums
WHERE SEARCH(TOKENLIST_CONCAT([AlbumTitle_Tokens, Studio_Tokens]), 'blue note')

Write-side and query-side TOKENLIST concatenation produce identical results. The choice between the two is a trade-off between disk cost and query cost.

Alternatively, an application could search multiple TOKENLIST columns and use OR along with the SEARCH function:

SEARCH(AlbumTitle_Tokens, 'Blue Note') OR SEARCH(Studio_Tokens, 'Blue Note')

This, however, has different semantics. It doesn't match albums where AlbumTitle_Tokens has "blue", but not "note" and Studio_Tokens has "note", but not "blue".

What's next