This page describes how to concatenate TOKENLIST
s in either a
search index when you set up your schema
or in a search query when performing a full-text search in Spanner.
Combine TOKENLISTs in a search index
Sometimes, you need your application to search across individual fields. At other times, the application needs to search across all fields. For example, in a table with two string columns, you might want your application to search across both columns without differentiating which column the matches come from.
In Spanner, there are two ways to achieve this:
- Tokenize words separately and concatenate the resulting
TOKENLIST
values (recommended). - Concatenate strings and tokenize the result.
With the second approach, there are two problems:
- If you want to index
Title
orStudio
individually, in addition to indexing them in a combinedTOKENLIST
, the same text is tokenized twice. This causes transactions to use more resources. - A phrase search spans both fields. For example, if
@p
is set to"Blue Note"
, it matches a row that contains bothTitle
="Big Blue Note" andStudio
="Blue Note Studios".
The first approach solves these problems because a phrase only matches one field
and each string field is only tokenized once if both the individual and combined
TOKENLIST
are indexed. Even though each string field is only tokenized once,
the resulting TOKENLISTs are stored separately in the index.
Tokenize words separately and concatenate TOKENLIST
values
The following example tokenizes each word and concatenates the TOKENLIST
values:
CREATE TABLE Albums (
AlbumId STRING(MAX) NOT NULL,
Title STRING(MAX),
Studio STRING(MAX),
Title_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Title)) HIDDEN,
Studio_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Studio)) HIDDEN,
Combined_Tokens TOKENLIST AS (TOKENLIST_CONCAT([Title_Tokens, Studio_Tokens])) HIDDEN,
) PRIMARY KEY(AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(Combined_Tokens);
SELECT AlbumId FROM Albums WHERE SEARCH(Combined_Tokens, @p);
TOKENLIST
concatenation can also be implemented entirely on the query side.
For more information, see Query-side TOKENLIST
concatenation.
TOKENLIST_CONCAT
is supported for both full-text and
substring searches.
Spanner doesn't let you mix tokenization types, such as
TOKENIZE_FULLTEXT
and TOKENIZE_SUBSTRING
in the same TOKENLIST_CONCAT
call.
The definition of text TOKENLIST
columns can be changed in non-stored columns
to add additional columns. This is useful when you want to add an additional
column to TOKENLIST_CONCAT
. Changing the generated column expression doesn't
backfill existing rows in the index.
Concatenate strings and tokenize the result
The following example concatenates strings and tokenizes the result:
CREATE TABLE Albums (
AlbumId STRING(MAX) NOT NULL,
Title STRING(MAX),
Studio STRING(MAX),
Combined_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Title || " " || Studio)) HIDDEN,
) PRIMARY KEY(AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(Combined_Tokens);
SELECT AlbumId FROM Albums WHERE SEARCH(Combined_Tokens, @p);
Query-side TOKENLIST
concatenation
The tradeoff with indexing the concatenated TOKENLIST
is that it increases
storage and write cost. Each token is now stored on the disk twice:
once in a posting list of its original TOKENLIST
, and once in a posting list
of the combined TOKENLIST
. Query-side concatenation of TOKENLIST
columns
avoids this cost but the query uses more compute resources.
To concatenate multiple TOKENLIST
s, use the
TOKENLIST_CONCAT
function in the
SEARCH
query. For this section, we're using the following sample schema:
CREATE TABLE Albums (
AlbumId STRING(MAX) NOT NULL,
Title STRING(MAX),
Studio STRING(MAX),
Title_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Title)) HIDDEN,
Studio_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Studio)) HIDDEN,
) PRIMARY KEY(AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(Title_Tokens, Studio_Tokens);
The following query searches for rows that have the tokens "blue"
and "note" anywhere in the Title
and Studio
columns. This includes
rows with both "blue" and "note" in the Title
column, "blue" and "note" in the
Studio
column, and "blue" in the Title
column and "note" in the Studio
column, or the opposite.
SELECT AlbumId
FROM Albums
WHERE SEARCH(TOKENLIST_CONCAT([AlbumTitle_Tokens, Studio_Tokens]), 'blue note')
Write-side and query-side TOKENLIST
concatenation produce identical results.
The choice between the two is a trade-off between disk cost and query cost.
Alternatively, an application could search multiple TOKENLIST
columns and use
OR
along with the SEARCH
function:
SEARCH(AlbumTitle_Tokens, 'Blue Note') OR SEARCH(Studio_Tokens, 'Blue Note')
This, however, has different semantics. It doesn't match albums where
AlbumTitle_Tokens
has "blue", but not "note" and Studio_Tokens
has
"note", but not "blue".
What's next
- Learn about full-text search queries.
- Learn about search indexes.