In addition to indexing text, the search index provides an efficient way to index numbers. It's primarily used to augment full-text search queries with conditions on numeric fields. This page describes how to add numeric search indexes to tables and how to tokenize arrays.
Overview
Spanner supports indexing numbers for the following query operations:
- Equality (
comparison_type=>"equality"
) - Inequality (
comparison_type=>"all"
)
In both cases, the original number (either integer or floating point) undergoes a process of tokenization, which is conceptually similar to full-text tokenization. It produces a set of tokens that the query can then use to locate documents matching the number condition.
Equality indexing only produces one token, which represents the number. This
mode is recommended if queries only have conditions in the form of field = @p
in the WHERE
clause.
Inequality and equality indexing can accelerate a wider range of conditions in
the WHERE
clause of the query. This includes field < @p
, field <= @p
,
field > @p
, field >= @p
, field BETWEEN @p1 and @p2
and field <> @p
in
addition to equality conditions. To implement this type of indexing,
Spanner produces tokens in the underlying search index.
Spanner can produce many tokens for each indexed number,
depending upon tuning parameters. The number of tokens depends on the parameters
that are set for TOKENIZE_NUMBER
, such as algorithm
, min
, max
and
granularity
. It's therefore important to evaluate the tuning parameters
carefully to ensure an appropriate balance between disk storage and lookup time.
Algorithms for index number ranges
The TOKENIZE_NUMBER
algorithm
argument has three options
to index number ranges: logtree
, prefixtree
, and floatingpoint
. Each
algorithm is better at indexing certain distributions of data and query ranges:
logtree
is best for indexing uniformly distributed columns containing integers. It's the default algorithm.prefixtree
is best when indexing exponentially distributed data and when query predicate is of the form "@param > number
" or "@param >= number
" (ranges without an upper bound). Compared tologtree
, this algorithm generates fewer index tokens for small numbers. For queries where theWHERE
clause contains the predicate previously described,prefixtree
generates fewer query tokens, which can improve performance.floatingpoint
is best for indexingFLOAT64
values where the indexed data and queries often contain fractions. Since the other algorithms only index whole numbers, queries covering fractional ranges are likely to retrieve more non-matching values. These non-matching values need filtering after retrieval, which negatively affects performance. Compared tologtree
andprefixtree
, this algorithm generates a larger number of index tokens and might generate more query tokens.
The base
parameter controls the width of each tree bucket in the logtree
and
prefixtree
algorithms. Both algorithms generate tokens representing nodes in a
base
-ary tree where the width of a node is
basedistance_from_leaf
. The algorithms differ in that prefixtree
omits some of the tree nodes in favor of greater-than tokens that accelerate
greater-than queries. The larger the base selected, the fewer index tokens are
generated. However, larger base
values increase the maximum number of query
tokens required.
Numbers that fall outside of the [min, max]
range are all indexed into two
buckets: one for all numbers less than min
, and the other for all numbers more
than max
. This might cause significant over-retrieval (retrieval of too many
numbers) when the range requested by the query also includes numbers outside of
the range. For this reason, set min
and max
to the narrowest possible values
that encompass all input numbers. Like all tokenization configurations, changing
the min
and max
values requires a rebuild of the numeric index, so leave
room to grow if the final domain of a column isn't known. The problem
of over-retrieval is not a correctness problem as all potential matches are
checked against non-bucketized numbers at the end of the search process; it's
only a potential efficiency issue.
The granularity
argument controls the rate of downsampling that's applied to
numbers before they are indexed in the tree-based algorithms. Before each number
is tokenized, it's sorted into buckets with a width equal to granularity
. All
the numbers in the same granularity
bucket get the same tokens. This means
that over-retrieval might occur if the granularity value is set to
anything other than 1. It also means that if numeric values change by a small
amount, most of their tokens don't need to be reindexed. Using a granularity
higher than 1 also reduces the number of tokens that the algorithm needs to
generate, but the effect is less significant than the effect of increasing the
base
. Therefore, we recommend that 'granularity' is set to 1.
Array tokenization
In addition to scalar values, TOKENIZE_NUMBER
supports tokenization of an
array of numbers.
When TOKENIZE_NUMBER
is used with the ARRAY
column, you must
specify comparison_type=>"equality"
. Range queries aren't supported with an
array of numbers.
For example, consider the following schema:
CREATE TABLE Albums (
AlbumId STRING(MAX) NOT NULL,
Ratings ARRAY<INT64>,
Ratings_Tokens TOKENLIST
AS (TOKENIZE_NUMBER(Ratings, comparison_type=>"equality")) HIDDEN
) PRIMARY KEY(AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(Grades_Tokens);
The following query finds all albums that have a rating of 1 or 2:
SELECT AlbumId
FROM Albums
WHERE ARRAY_CONTAINS_ANY(Ratings, [1, 2])
The following query finds all albums that were rated as 1 and as 5:
SELECT AlbumId
FROM Albums
WHERE ARRAY_CONTAINS_ALL(Ratings, [1, 5])
What's next
- Learn about tokenization and Spanner tokenizers.
- Learn about search indexes.
- Learn about index partitioning.