This page describes how to add search indexes. The full-text search is run against entries in the search index.
How to use search indexes
You can create a search index on any columns that you want to make
available for full-text searches. To create a search index, use the
CREATE SEARCH INDEX
DDL statement. To update an index
use the ALTER SEARCH INDEX
DDL statement. Spanner
automatically builds and maintains the search index, including adding and
updating data in the search index as soon as it changes in the database.
Search indexes partitions
A search index can be partitioned or unpartitioned, depending upon the type of queries that you want to accelerate.
An example for when a partitioned index is the best choice is when the application queries an email mailbox. Each query is restricted to a specific mailbox.
An example for when an unpartitioned query is the best choice is when there's a query across all product categories in a product catalog.
Search index use cases
Besides full-text search, Spanner search indexes support the following:
- Substring searches, which is a type of query that looks for a shorter string (the substring) within a larger body of text.
- Combining conditions on any subset of indexed data into a single index scan.
While search indexes support indexing of non-textual data, such as numbers and exact match strings, the most common use case for a search index is to index text in a document.
Search index example
To show the capabilities of search indexes, suppose that there's a table that stores information about music albums:
CREATE TABLE Albums (
AlbumId STRING(MAX) NOT NULL,
AlbumTitle STRING(MAX)
) PRIMARY KEY(AlbumId);
Spanner has several tokenize functions that create
tokens. To modify the previous table to let users run a full-text search to find
album titles, use the TOKENIZE_FULLTEXT
function to create tokens from
album titles. Then create a column that uses the TOKENLIST
data type
to hold the tokenization output from TOKENIZE_FULLTEXT
. For this example,
we create the AlbumTitle_Tokens
column.
ALTER TABLE Albums
ADD COLUMN AlbumTitle_Tokens TOKENLIST
AS (TOKENIZE_FULLTEXT(AlbumTitle)) HIDDEN;
The following example uses the CREATE SEARCH INDEX
DDL to create a search index
(AlbumsIndex
) on the AlbumTitle
tokens (AlbumTitle_Tokens
):
CREATE SEARCH INDEX AlbumsIndex
ON Albums(AlbumTitle_Tokens);
After adding the search index, use SQL queries to find albums matching the search criteria. For example:
SELECT AlbumId
FROM Albums
WHERE SEARCH(AlbumTitle_Tokens, "fifth symphony")
Data consistency
When an index is created, Spanner uses automated processes to backfill the data to ensure consistency. When writes are committed, the indexes are updated in the same transaction. Spanner automatically performs data consistency checks.
Search index schema definitions
Search indexes are defined on one or more TOKENLIST
columns of a table. Search
indexes have the following components:
- Base table: the Spanner table that needs indexing.
TOKENLIST
column: a collection of columns that define the tokens that need indexing. The order of these columns is unimportant.
For example, in the following statement, the base table is Albums. TOKENLIST
columns are created on AlbumTitle
(AlbumTitle_Tokens
) and Rating
(Rating_Tokens
).
CREATE TABLE Albums (
AlbumId STRING(MAX) NOT NULL,
SingerId INT64 NOT NULL,
ReleaseTimestamp INT64 NOT NULL,
AlbumTitle STRING(MAX),
Rating FLOAT64,
AlbumTitle_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(AlbumTitle)) HIDDEN,
Rating_Tokens TOKENLIST AS (TOKENIZE_NUMBER(Rating)) HIDDEN
) PRIMARY KEY(AlbumId);
Use the following CREATE SEARCH INDEX
statement to create a search index
using the tokens for AlbumTitle
and Rating
:
CREATE SEARCH INDEX AlbumsIndex
ON Albums(AlbumTitle_Tokens, Rating_Tokens)
PARTITION BY SingerId
ORDER BY ReleaseTimestamp DESC
Search indexes have the following options:
- Partitions: an optional group of columns that divide the search index. Querying a partitioned index is often significantly more efficient than querying an unpartitioned index. For more information, see Partition search indexes.
- Sort order column: an optional
INT64
column that establishes the order of retrieval from the search index. For more information, see Search index sort order. - Interleaving: like secondary indexes, you can interleave search indexes. Interleaved search indexes use fewer resources to write and join with the base table. For more information, see Interleaved search indexes.
- Options clause: a list of key value pairs that overrides the default settings of the search index.
For more information, see the CREATE SEARCH INDEX
reference.
Internal layout of search indexes
An important element of the internal representation of search indexes is a
docid, which serves as a storage-efficient representation of the primary key
of the base table which can be arbitrarily long. It's also what creates the
order for the internal data layout according to the user-provided ORDER BY
columns of the
CREATE SEARCH INDEX
clause. It's represented as one or two 64-bit integers.
Search indexes are implemented internally as a two-level mapping:
- Tokens to docids
- Docids to base table primary keys
This scheme results in significant storage savings as Spanner
doesn't need to store the full base table primary key for each
<token, document>
pair.
There are two types of physical indexes that implement the two levels of mapping:
- A secondary index that maps partition keys and a docid to the base table
primary key. In the example in the previous section, this maps
{SingerId, ReleaseTimestamp, uid}
to{SingerId, AlbumId}
. The secondary index also stores all columns specified in theSTORING
clause ofCREATE SEARCH INDEX
. - Token indexes that map tokens to docids, similar to inverted indexes in
information retrieval literature. Spanner maintains a
separate token index for each
TOKENLIST
of the search index. Logically, token indexes maintain lists of docids for each token within each partition (known in information retrieval as postings lists). The lists are ordered by tokens for fast retrieval, and within lists, docid is used for ordering. Individual token indexes are an implementation detail not exposed through Spanner APIs.
Spanner supports the following four options for docid.
Search index | Docid | Behavior |
---|---|---|
ORDER BY clause is omitted for the search index |
{uid} |
Spanner adds a hidden unique value (UID) to identify each row. |
ORDER BY column |
{column, uid} |
Spanner adds the UID column as a tiebreaker between rows with the same column values within a partition. |
ORDER BY column ... OPTIONS (disable_automatic_uid_column=true) |
{column} |
The UID column isn't added. The column values must be unique within a partition. |
ORDER BY column1, column2 ... OPTIONS (disable_automatic_uid_column=true) |
{column1, column2} |
The UID column isn't added. The combination of column1 , column2 values must be unique within a partition. |
Usage notes:
- The internal UID column isn't exposed through the Spanner API.
- In indexes where the UID isn't added, transactions that add a row with an already existing (partition,sort order) fails.
For example, consider the following data:
AlbumId | SingerId | ReleaseTimestamp | SongTitle |
---|---|---|---|
a1 | 1 | 997 | Beautiful days |
a2 | 1 | 743 | Beautiful eyes |
Assuming the presort column is in ascending order, the content of the token
index partitioned by SingerId
partitions the content of the token index in the
following way:
SingerId | _token | ReleaseTimestamp | uid |
---|---|---|---|
1 | beautiful | 743 | uid1 |
1 | beautiful | 997 | uid2 |
1 | days | 743 | uid1 |
1 | eyes | 997 | uid2 |
Search index sharding
When Spanner splits a table it distributes search index data so that all tokens in a particular base table row are in the same split. In other words, the search index is document-sharded. This sharding strategy has significant performance implications:
- The number of servers that each transaction communicates with remains
constant, regardless of the number of tokens or the number of indexed
TOKENLIST
columns. - Search queries involving multiple conditional expressions are executed independently on each split, avoiding the performance overhead associated with a distributed join.
Search indexes have two distribution modes:
- Uniform sharding (default). In uniform sharding, indexed data for each base table row is assigned randomly to an index split of a partition.
- Sort-order sharding. In sort-order sharding, data for each base table row
is assigned to an index split of a partition based on the
ORDER BY
columns. For example, in the case of a descending sort order, all the rows with the highest sort order values appear on the first index split of a partition, and the next-highest group of sort order values on the next split.
These sharding modes come with a tradeoff between hotspotting risks and the query cost:
- Sort order sharded search indexes are prone to hotspotting when the index is sorted by a timestamp. For more information, see Choose a primary key to prevent hotspots. On the other hand when write load increases across a range of documents, uniform sharding ensures that the increase is distributed uniformly across shards.
- Standard load-based splitting creates additional splits which provide adequate protection against hotspotting. The downside of uniform sharding is that it can use more resources for some types of queries.
The sharding mode of a search index is configured using the OPTIONS
clause:
CREATE SEARCH INDEX AlbumsIndex
ON Albums(AlbumTitle_Tokens, Rating_Tokens)
PARTITION BY SingerId
ORDER BY ReleaseTimestamp DESC
OPTIONS (sort_order_sharding = true);
When sort_order_sharding=false
is set or left unspecified, the search index is
created using uniform sharding.
Interleaved search indexes
Like secondary indexes, you can interleave search indexes in a parent table of the base table. The primary reason to use interleaved search indexes is to colocate base table data with index data for small partitions. This opportunistic colocation has the following advantages:
- Writes don't need to do a two-phase commit.
- Back-joins of the search index with the base table aren't distributed.
Interleaved search indexes have the following restrictions:
- Only sort-order sharded indexes can be interleaved.
- Search indexes can only be interleaved in top-level tables (not in child tables).
- Like interleaved tables and secondary indexes, make the key of the
parent table a prefix of the
PARTITION BY
columns in the interleaved search index.
Define an interleaved search index
The following example demonstrates how to define an interleaved search index:
CREATE TABLE Singers (
SingerId INT64 NOT NULL
) PRIMARY KEY(SingerId);
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId STRING(MAX) NOT NULL,
AlbumTitle STRING(MAX),
AlbumTitle_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(AlbumTitle)) HIDDEN
) PRIMARY KEY(SingerId, AlbumId),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
CREATE SEARCH INDEX AlbumsIndex
ON Albums(AlbumTitle_Tokens)
PARTITION BY SingerId,
INTERLEAVE IN Singers
OPTIONS (sort_order_sharding = true);
Search index sort order
Requirements for the search index sort order definition are different from secondary indexes.
For example, consider the following table:
CREATE TABLE Albums (
AlbumId STRING(MAX) NOT NULL,
ReleaseTimestamp INT64 NOT NULL,
AlbumName STRING(MAX),
AlbumName_Token TOKENLIST AS (TOKEN(AlbumName)) HIDDEN
) PRIMARY KEY(AlbumId);
The application might define a secondary index to look up information using the
AlbumName
sorted by ReleaseTimestamp
:
CREATE INDEX AlbumsSecondaryIndex ON Albums(AlbumName, ReleaseTimestamp DESC);
The equivalent search index looks as following (this uses exact-match tokenization, since secondary indexes don't support full-text searches):
CREATE SEARCH INDEX AlbumsSearchIndex
ON Albums(AlbumName_Token)
ORDER BY ReleaseTimestamp DESC;
Search index sort order must conform to the following requirements:
- Only use
INT64
columns for the sort order of a search index. Columns that have arbitrary sizes use too many resources in the search index because Spanner needs to store a docid next to every token. Specifically, the sort order column can't use theTIMESTAMP
type becauseTIMESTAMP
uses nanosecond precision which doesn't fit in a 64-bit integer. Sort order columns must not be
NULL
. There are two ways to meet this requirement:- Declare the sort order column as
NOT NULL
. - Configure the index to exclude NULL values.
- Declare the sort order column as
A timestamp is often used to determine the sort order. A common practice is to use microseconds since the Unix epoch for such timestamps.
Applications usually retrieve the newest data first using a search index that's sorted in descending order.
NULL-filtered search indexes
Search indexes can use the WHERE column IS NOT NULL
syntax to
exclude base table rows. NULL filtering can apply to partitioning keys, sort
order columns, and stored columns. NULL filtering on stored array columns isn't
allowed.
Example
CREATE SEARCH INDEX AlbumsIndex
ON Albums(AlbumTitle_Tokens)
STORING Genre
WHERE Genre IS NOT NULL
The query must specify the NULL filtering condition (Genre IS NOT NULL
for
this example) in the WHERE
clause. Otherwise, the Query Optimizer isn't able
to use the search index. For more information, see
SQL query requirements.
Use NULL filtering on a generated column to exclude rows based on any arbitrary criteria. For more information, see Create a partial index using a generated column.
What's next
- Learn about tokenization and Spanner tokenizers.
- Learn about numeric indexes.
- Learn about index partitioning.