This document describes how to perform hybrid searches in AlloyDB for PostgreSQL, which combines text search, keyword matching, vector search, and semantic similarity, using the vector
extension, which is a standard pgvector
PostgreSQL extension customized for AlloyDB. Hybrid search lets you retrieve highly relevant results by using both exact keyword matches and semantically similar content.
Run a similarity search with text and vector input
To perform a hybrid search in AlloyDB for PostgreSQL, you'll need to create both a vector index and a text search index on your table. Then, you'll combine the results from both searches and re-rank them to present the most relevant information.
Create a GIN index
A Generalized Inverted Index (GIN) index is a specialized index type optimized for searching within composite values, such as arrays, JSONB, and full-text search data.
To create a GIN index on your text data to perform a full text search, run the following:
CREATE INDEX INDEX_NAME ON TABLE USING GIN (to_tsvector('english', COLUMN_NAME));
Replace the following:
INDEX_NAME
: the name of the index you want to create —for example,my-gin-index
.TABLE
: the table to add the index to.COLUMN_NAME
: the column that stores the text data you want to search.
Create a ScaNN index
To apply a two-level tree index using the ScaNN algorithm to a column containing stored vector embeddings, run the following DDL query:
CREATE INDEX INDEX_NAME ON TABLE
USING scann (EMBEDDING_COLUMN DISTANCE_FUNCTION)
WITH (num_leaves=NUM_LEAVES_VALUE);
Replace the following:
INDEX_NAME
: the name of the index you want to create—for example,my-scann-index
. The index names are shared across your database. Ensure that each index name is unique to each table in your database.TABLE
: the table to add the index to.EMBEDDING_COLUMN
: a column that storesvector
data.DISTANCE_FUNCTION
: the distance function to use with this index. Choose one of the following:L2 distance:
l2
Dot product:
dot_product
Cosine distance:
cosine
NUM_LEAVES_VALUE
: the number of partitions to apply to this index. Set to any value between 1 to 1048576. For more information about how to decide this value, see Tune aScaNN
index.
To learn more about different ScaNN index configurations, see Create a ScaNN
index. You can also create a HNSW index.
Perform a hybrid search using reciprocal rank fusion
Hybrid search involves performing separate vector and text searches, then combining and re-ranking results using Reciprocal Rank Fusion (RRF). RRF is a rank-based algorithm that combines multiple ranked lists of search results into a single ranked list by assigning a score to each document. This score is based on RRF's reciprocal rank across all contributing lists, with higher-ranked documents receiving a greater contribution. Use the following SQL Query to combine full text search and hybrid search, and re-rank the results:
WITH vector_search AS (
SELECT id,
RANK () OVER (ORDER BY embedding <=> google_ml.embedding('MODEL_ID', 'TEXT')) AS rank
FROM TABLE
ORDER BY embedding <=> google_ml.embedding('MODEL_ID', 'TEXT') LIMIT 10
),
text_search AS (
SELECT id,
RANK () OVER (ORDER BY ts_rank(to_tsvector('english', COLUMN_NAME), to_tsquery(KEYWORD)) desc)
FROM TABLE
WHERE to_tsvector('english', COLUMN_NAME) @@ to_tsquery(KEYWORD)
ORDER BY ts_rank(to_tsvector('english', COLUMN_NAME), to_tsquery(KEYWORD)) desc
LIMIT 10
)
SELECT
COALESCE(vector_search.id, text_search.id) AS id,
COALESCE(1.0 / (60 + vector_search.rank), 0.0) + COALESCE(1.0 / (60 + text_search.rank), 0.0) AS rrf_score
FROM vector_search FULL OUTER JOIN text_search ON vector_search.id = text_search.id
ORDER BY rrf_score DESC
LIMIT 5;
Replace the following:
MODEL_ID
: the ID of the model to query.If you are using the Vertex AI Model Garden, then specify
text-embedding-005
as the model ID. These are the cloud-based models that AlloyDB can use for text embeddings. For more information, see Text embeddings.TABLE
: the table containing your data.TEXT
: the text to translate into a vector embedding.KEYWORD
: the keyword you want to search for.COLUMN_NAME
: a column that stores contains the text data you want to search.
Explanation of the Hybrid Search Query and related Common Table Expression (CTE):
vector_search
CTE: Performs a standard vector similarity search, ordering results by cosine distance and assigning a rank. It retrieves the top 10 most semantically similar products.text_search
CTE: Executes a text search usingto_tsvector
andto_tsquery
, calculating relevance withts_rank
and retrieving the top 10 most relevant text matches.Final SELECT Statement
CTE: Joins vector and text search results using aFULL OUTER JOIN
, selects the product ID, calculates the RRF score, orders by score, and retrieves the top 5 results.