Find approximate nearest neighbors (ANN) and query vector embeddings

This page describes how to find approximate nearest neighbors (ANN) and query vector embeddings using the ANN distance functions.

When a dataset is small, you can use K-nearest neighbors (KNN) to find the exact k-nearest vectors. However, as your dataset grows, the latency and cost of a KNN search also increase. You can use ANN to find the approximate k-nearest neighbors with significantly reduced latency and cost.

In an ANN search, the k-returned vectors aren't the true top k-nearest neighbors because the ANN search calculates approximate distances and might not look at all the vectors in the dataset. Occasionally, a few vectors that aren't among the top k-nearest neighbors are returned. This is known as recall loss. How much recall loss is acceptable to you depends on the use case, but in most cases, losing a bit of recall in return for improved database performance is an acceptable tradeoff.

For more details about the approximate distance functions supported in Spanner, see the following GoogleSQL reference pages:

Query vector embeddings

Spanner accelerates approximate nearest neighbor (ANN) vector searches by using a vector index. You can use a vector index to query vector embeddings. To query vector embeddings, you must first create a vector index. You can then use any one of the three approximate distance functions to find the ANN.

Restrictions when using the approximate distance functions include the following:

  • The approximate distance function must calculate the distance between an embedding column and a constant expression (for example, a parameter or a literal).
  • The approximate distance function output must be used in a ORDER BY clause as the sole sort key, and a LIMIT must be specified after the ORDER BY.
  • The query must explicitly filter out rows that aren't indexed. In most cases, this means that the query must include a WHERE <column_name> IS NOT NULL clause that matches the vector index definition, unless the column is already marked as NOT NULL in the table definition.

For a detailed list of limitations, see the approximate distance function reference page.

Examples

To search for the nearest 100 vectors to [1.0, 2.0, 3.0]:

SELECT DocId
FROM Documents
WHERE WordCount > 1000
ORDER BY APPROX_EUCLIDEAN_DISTANCE(
  ARRAY<FLOAT32>[1.0, 2.0, 3.0], DocEmbedding,
  options => JSON '{"num_leaves_to_search": 10}')
LIMIT 100

If the embedding column is nullable:

SELECT DocId
FROM Documents
WHERE NullableDocEmbedding IS NOT NULL AND WordCount > 1000
ORDER BY APPROX_EUCLIDEAN_DISTANCE(
  ARRAY<FLOAT32>[1.0, 2.0, 3.0], NullableDocEmbedding,
  options => JSON '{"num_leaves_to_search": 10}')
LIMIT 100

What's next