This page describes how to create a vector index and query vector embeddings by using the approximate cosine distance, approximate Euclidean distance, and approximate dot product vector functions. You can use these functions to find approximate nearest neighbors (ANN) in Spanner. When a dataset is small, you can use K-nearest neighbors (KNN) to find the exact k-nearest vectors. However, as your dataset grows, the latency and cost of a KNN search also increase. You can use ANN to find the approximate k-nearest neighbors with significantly reduced latency and cost.

## Approximate k-nearest neighbors

In an ANN search, the k-returned vectors aren't the true top k-nearest
neighbors. Occasionally, a few vectors that aren't among the top k-nearest
neighbors are returned. This is known as *recall loss*. How much recall loss is
acceptable to you depends on the use case, but in most cases, losing a bit of
recall in return for improved database performance is an acceptable
tradeoff.

For more details about the Spanner approximate distance functions, see:

`APPROX_COSINE_DISTANCE`

in GoogleSQL`APPROX_EUCLIDEAN_DISTANCE`

in GoogleSQL`APPROX_DOT_PRODUCT`

in GoogleSQL

## Vector index

Spanner accelerates ANN vector searches by using a specialized vector index. This index leverages Google Research's Scalable Nearest Neighbor (ScaNN), a highly efficient nearest neighbor algorithm.

The vector index uses a tree-based structure to partition data and facilitate faster searches. Spanner offers both two-level and three-level tree configurations:

- Two-level tree configuration: Leaf nodes (
`num_leaves`

) contain groups of closely related vectors along with their corresponding centroid. The root level consists of the centroids from all leaf nodes. - Three-level tree configuration: Similar in concept to a two-level tree, while
introducing an additional branch layer (
`num_branches`

), from which leaf node centroids are further partitioned to form the root level (`num_leaves`

).

In addition, you must build your vector index with a specific distance metric.
You can choose the distance metric most appropriate for your use case by setting
the `distance_type`

to one of `COSINE`

, `DOT_PRODUCT`

or `EUCLIDEAN`

.

For more information, see `VECTOR INDEX`

statements.

### Limitations

Spanner vector index has the following limitations:

`ALTER VECTOR INDEX`

is not supported.

### Create vector index

To best optimize the vector index for good recall and performance, we recommend that you create your vector index after most of the rows with embeddings are written to your database. You might also need to periodically rebuild the vector index after you insert new data. For more information, see Rebuild the vector index.

To create a vector index with a two-level tree and 1000 leaf nodes on a
`Documents`

table with an embedding column `DocEmbedding`

using the cosine
distance:

```
CREATE VECTOR INDEX DocEmbeddingIndex
ON Documents(DocEmbedding)
OPTIONS (distance_type = 'COSINE', tree_depth = 2, num_leaves = 1000);
```

To create a vector index with a three-level tree and 1000000 leaf nodes:

```
CREATE VECTOR INDEX DocEmbeddingIndex
ON Documents(NullableDocEmbedding)
WHERE NullableDocEmbedding IS NOT NULL
OPTIONS (distance_type = 'COSINE', tree_depth = 3, num_branches=1000, num_leaves = 1000000);
```

If your embedding column is nullable, you must declare it with a
`WHERE column_name IS NOT NULL`

clause:

```
CREATE VECTOR INDEX DocEmbeddingIndex
ON Documents(NullableDocEmbedding)
WHERE NullableDocEmbedding IS NOT NULL
OPTIONS (distance_type = 'COSINE', tree_depth = 2, num_leaves = 1000);
```

### Query vector embeddings

To query a vector index, use one of the three approximate distance functions:

`APPROX_COSINE_DISTANCE`

`APPROX_EUCLIDEAN_DISTANCE`

`APPROX_DOT_PRODUCT`

Restrictions when using the approximate distance functions include the following:

- You must provide a query hint to use the vector index.
- You must use a constant expression as one argument of the distance function (for example, a parameter or a literal).
- The query or subquery in which the approximate distance function is used must
take a specific form: the distance function must be the only
`ORDER BY`

key, and a limit must be specified.

For a detailed list of limitations, see the approximate distance function reference page.

#### Example

To search for the nearest 100 vectors to `[1.0, 2.0, 3.0]`

:

```
SELECT DocId
FROM Documents@{FORCE_INDEX=DocEmbeddingIndex}
ORDER BY APPROX_EUCLIDEAN_DISTANCE(
ARRAY<FLOAT32>[1.0, 2.0, 3.0], DocEmbedding,
options => JSON '{"num_leaves_to_search": 10}')
LIMIT 100
```

If the embedding column is nullable:

```
SELECT DocId
FROM Documents@{FORCE_INDEX=DocEmbeddingIndex}
WHERE NullableDocEmbedding IS NOT NULL
ORDER BY APPROX_EUCLIDEAN_DISTANCE(
ARRAY<FLOAT32>[1.0, 2.0, 3.0], NullableDocEmbedding,
options => JSON '{"num_leaves_to_search": 10}')
LIMIT 100
```

## Best practices

Follow these best practices to optimize your vector indexes and improve query results.

### Tune the vector search options

The most optimal vector search value is dependent on the use case, the vector dataset, and on the query vectors. You might need to perform iterative tuning to find the best values for your specific workload.

Here are some helpful guidelines to follow when picking appropriate values:

`tree_depth`

(tree level): If the table being indexed has fewer than 10 million rows, use a`tree_depth`

of`2`

. Otherwise, a`tree_depth`

of`3`

supports tables of up to about 10 billion rows.`num_leaves`

: Use the square root of the number of rows in the dataset. A larger value can increase vector index build time. Avoid setting`num_leaves`

larger than`table_row_count/1000`

as this results in overly small leaves and poor performance.`num_leaves_to_search`

: This option specifies how many leaf nodes of the index are searched. Increasing`num_leaves_to_search`

improves recall but also increases latency and cost. We recommend using a number that is 1% the total number of leaves defined in the`CREATE VECTOR INDEX`

statement as the value for`num_leaves_to_search`

. If you're using a filter clause, increase this value to widen the search.

If acceptable recall is achieved, but the cost of querying is too high,
resulting in low maximum QPS, try increasing `num_leaves`

by following these
steps:

- Set
`num_leaves`

to some multiple k of its original value (for example,`2 * sqrt(table_row_count)`

). - Set
`num_leaves_to_search`

to be the same multiple k of its original value. - Experiment with reducing
`num_leaves_to_search`

to improve cost and QPS while maintaining recall.

### Improve recall

There are several possibilities for recall worsening, including the following:

`num_leaves_to_search`

is too small: You might find it more challenging to find the nearest neighbors for some query vectors, so increasing`num_leaves_to_search`

to search more leaves can help improve recall. Recent queries might have shifted to contain more of these challenging vectors.Vector index needs rebuilding: The tree structure of the vector index is optimized for the dataset at the time of creation, and is static thereafter. Therefore, if significantly different vectors are added after creating the initial vector index, then the tree structure might be sub-optimal, leading to poorer recall.

### Rebuild the vector index

To rebuild your vector index without downtime:

- Create a new vector index on the same embedding column as the current vector
index, updating parameters (for example,
`OPTIONS`

) as appropriate. - After the index creation completes, change the
`FORCE_INDEX`

hint to point at the new index to update the vector search query. This ensures that the query uses the new vector index. (You might also need to retune`num_leaves_to_search`

in your new query). - Drop the outdated vector index.

## What's next

Learn more about the GoogleSQL

`APPROXIMATE_COSINE_DISTANCE()`

,`APPROXIMATE_EUCLIDEAN_DISTANCE()`

,`APPROXIMATE_DOT_PRODUCT()`

functions.Learn more about the GoogleSQL

`VECTOR INDEX`

statements.