Numeric indexes

In addition to indexing text, the search index provides an efficient way to index numbers. It's primarily used to augment full-text search queries with conditions on numeric fields. This page describes how to add numeric search indexes to tables and how to tokenize arrays.

Overview

Spanner supports indexing numbers for the following query operations:

  • Equality (comparison_type=>"equality")
  • Inequality (comparison_type=>"all")

In both cases, the original number (either integer or floating point) undergoes a process of tokenization, which is conceptually similar to full-text tokenization. It produces a set of tokens that the query can then use to locate documents matching the number condition.

Equality indexing only produces one token, which represents the number. This mode is recommended if queries only have conditions in the form of field = @p in the WHERE clause.

Inequality and equality indexing can accelerate a wider range of conditions in the WHERE clause of the query. This includes field < @p, field <= @p, field > @p, field >= @p, field BETWEEN @p1 and @p2 and field <> @p in addition to equality conditions. To implement this type of indexing, Spanner produces tokens in the underlying search index. Spanner can produce many tokens for each indexed number, depending upon tuning parameters. The number of tokens depends on the parameters that are set for TOKENIZE_NUMBER, such as algorithm, min, max and granularity. It's therefore important to evaluate the tuning parameters carefully to ensure an appropriate balance between disk storage and lookup time.

Algorithms for index number ranges

The TOKENIZE_NUMBER algorithm argument has three options to index number ranges: logtree, prefixtree, and floatingpoint. Each algorithm is better at indexing certain distributions of data and query ranges:

  • logtree is best for indexing uniformly distributed data. It's the default algorithm.
  • prefixtree is best when indexing exponentially distributed data and when query predicate is of the form "@param > number" or "@param >= number" (ranges without an upper bound). Compared to logtree, this algorithm generates fewer index tokens for small numbers. For queries where the WHERE clause contains the predicate previously described, prefixtree generates fewer query tokens, which can improve performance.
  • floatingpoint is best for indexing FLOAT64 values where the indexed data and queries often contain fractions. When tokenizing FLOAT64 using logtree or prefixtree, TOKENIZE_NUMBER might lose precision as the count of granularity buckets in the min to max range approaches the maximum resolution of double-precision floating point numbers, approximately 2^52 buckets. This can make queries less efficient, but it doesn't cause incorrect behavior. This loss of precision doesn't happen with the floatingpoint algorithm if the precision argument is set high enough. However, the floatingpoint algorithm generates more index tokens when precision is set to a larger value.

The base parameter controls the width of each tree bucket in the logtree and prefixtree algorithms. Both algorithms generate tokens representing nodes in a base-ary tree where the width of a node is basedistance_from_leaf. The algorithms differ in that prefixtree omits some of the tree nodes in favor of greater-than tokens that accelerate greater-than queries. The larger the base selected, the fewer index tokens are generated. However, larger base values increase the maximum number of query tokens required.

Numbers that fall outside of the [min, max] range are all indexed into two buckets: one for all numbers less than min, and the other for all numbers more than max. This might cause significant over-retrieval (retrieval of too many numbers) when the range requested by the query also includes numbers outside of the range. For this reason, set min and max to the narrowest possible values that encompass all input numbers. Like all tokenization configurations, changing the min and max values requires a rebuild of the numeric index, so leave room to grow if the final domain of a column isn't known. The problem of over-retrieval is not a correctness problem as all potential matches are checked against non-bucketized numbers at the end of the search process; it's only a potential efficiency issue.

The granularity argument controls the rate of downsampling that's applied to numbers before they are indexed in the tree-based algorithms. Before each number is tokenized, it's sorted into buckets with a width equal to granularity. All the numbers in the same granularity bucket get the same tokens. This means that over-retrieval might occur if the granularity value is set to anything other than 1 for integral numbers. Overretrieval is always possible for FLOAT64 numbers. It also means that if numeric values change by a small amount, most of their tokens don't need to be reindexed. Using a granularity higher than 1 also reduces the number of tokens that the algorithm needs to generate, but the effect is less significant than the effect of increasing the base. Therefore, we recommend that 'granularity' is set to 1.

Array tokenization

In addition to scalar values, TOKENIZE_NUMBER supports tokenization of an array of numbers.

When TOKENIZE_NUMBER is used with the ARRAY column, you must specify comparison_type=>"equality". Range queries aren't supported with an array of numbers.

For example, consider the following schema:

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  Ratings ARRAY<INT64>,
  Ratings_Tokens TOKENLIST
    AS (TOKENIZE_NUMBER(Ratings, comparison_type=>"equality")) HIDDEN
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsIndex ON Albums(Grades_Tokens);

The following query finds all albums that have a rating of 1 or 2:

SELECT AlbumId
FROM Albums
WHERE ARRAY_CONTAINS_ANY(Ratings, [1, 2])

The following query finds all albums that were rated as 1 and as 5:

SELECT AlbumId
FROM Albums
WHERE ARRAY_CONTAINS_ALL(Ratings, [1, 5])

What's next