使用向量搜尋搜尋嵌入
本教學課程將說明如何使用 VECTOR_SEARCH
函式 (以及選用的向量索引),針對儲存在 BigQuery 資料表中的嵌入資料執行相似度搜尋。
當您將 VECTOR_SEARCH
與向量索引搭配使用時,VECTOR_SEARCH
會使用近似最鄰近項目方法提升向量搜尋成效,但這會導致回溯率降低,因此會傳回較為近似的結果。在沒有向量索引的情況下,VECTOR_SEARCH
會使用暴力搜尋來評估每筆記錄的距離。
所需權限
如要執行本教學課程,您必須具備下列身分和存取管理 (IAM) 權限:
- 如要建立資料集,您必須具備
bigquery.datasets.create
權限。 如要建立資料表,您必須具備下列權限:
bigquery.tables.create
bigquery.tables.updateData
bigquery.jobs.create
如要建立向量索引,您需要在建立索引的資料表上具備
bigquery.tables.createIndex
權限。如要捨棄向量索引,您必須在捨棄索引的資料表上擁有
bigquery.tables.deleteIndex
權限。
以下每個預先定義的 IAM 角色都包含使用向量索引所需的權限:
- BigQuery 資料擁有者 (
roles/bigquery.dataOwner
) - BigQuery 資料編輯器 (
roles/bigquery.dataEditor
)
費用
VECTOR_SEARCH
函式會使用 BigQuery 運算定價。系統會依據以量計價或版本定價向您收取相似搜尋費用。
- 以量計價:系統會根據在基礎資料表、索引和搜尋查詢中掃描的位元組數向您收費。
版本定價:系統會根據您在預留版本中完成工作所需的時間長度向您收費。較大且複雜的相似度計算作業會產生較高的費用。
詳情請參閱 BigQuery 定價。
事前準備
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery API.
建立資料集
建立 BigQuery 資料集
前往 Google Cloud 控制台的「BigQuery」頁面。
在「Explorer」窗格中,按一下專案名稱。
依序點選
「View actions」(查看動作) >「Create dataset」(建立資料集)。在「Create dataset」頁面上執行下列操作:
在「Dataset ID」(資料集 ID) 中輸入
vector_search
。在「位置類型」中選取「多區域」,然後選取「美國 (多個美國區域)」。
公開資料集儲存在
US
多地區中。為簡單起見,請將資料集儲存在相同的位置。保留其餘預設設定,然後點選「Create dataset」(建立資料集)。
建立測試資料表
根據 Google 專利公開資料集的子集,建立包含專利嵌入資料的
patents
資料表:CREATE TABLE vector_search.patents AS SELECT * FROM `patents-public-data.google_patents_research.publications` WHERE ARRAY_LENGTH(embedding_v1) > 0 AND publication_number NOT IN ('KR-20180122872-A') LIMIT 1000000;
建立包含專利嵌入資料的
patents2
資料表,找出下列項目的最近鄰:CREATE TABLE vector_search.patents2 AS SELECT * FROM `patents-public-data.google_patents_research.publications` WHERE publication_number = 'KR-20180122872-A';
建立向量索引
在
patents
資料表的embeddings_v1
欄上建立my_index
向量索引:CREATE OR REPLACE VECTOR INDEX my_index ON vector_search.patents(embedding_v1) STORING(publication_number, title) OPTIONS(distance_type='COSINE', index_type='IVF');
等待向量索引建立完成,然後執行下列查詢,並確認
coverage_percentage
值為100
:SELECT * FROM vector_search.INFORMATION_SCHEMA.VECTOR_INDEXES;
使用含有索引的 VECTOR_SEARCH
函式
建立並填入向量索引後,請使用 VECTOR_SEARCH
函式,在 patents2
資料表的 embedding_v1
欄中,找出嵌入項目的最鄰近項目。這個查詢會在搜尋中使用向量索引,因此 VECTOR_SEARCH
會使用近似最鄰近搜尋方法,找出嵌入項目的最近鄰。
使用含有索引的 VECTOR_SEARCH
函式:
SELECT query.publication_number AS query_publication_number, query.title AS query_title, base.publication_number AS base_publication_number, base.title AS base_title, distance FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"fraction_lists_to_search": 0.005}');
結果如下所示:
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | query_publication_number | query_title | base_publication_number | base_title | distance | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-106599080-B | A kind of rapid generation for keeping away big vast transfer figure based on GIS | 0.14471956347590609 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-114118544-A | Urban waterlogging detection method and device | 0.17472108931171348 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-20200048143-A | Method and system for mornitoring dry stream using unmanned aerial vehicle | 0.17561990745619782 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-101721695-B1 | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same | 0.17696129365559843 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-109000731-B | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642917 | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
使用 VECTOR_SEARCH
函式進行暴力破解
使用 VECTOR_SEARCH
函式,在 patents2
資料表的 embedding_v1
欄中,找出最接近的嵌入項目。這個查詢不會在搜尋中使用向量索引,因此 VECTOR_SEARCH
會找出嵌入項目的精確最近鄰。
SELECT query.publication_number AS query_publication_number, query.title AS query_title, base.publication_number AS base_publication_number, base.title AS base_title, distance FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"use_brute_force":true}');
結果如下所示:
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | query_publication_number | query_title | base_publication_number | base_title | distance | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-106599080-B | A kind of rapid generation for keeping away big vast transfer figure based on GIS | 0.1447195634759062 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-114118544-A | Urban waterlogging detection method and device | 0.1747210893117136 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-20200048143-A | Method and system for mornitoring dry stream using unmanned aerial vehicle | 0.17561990745619782 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-101721695-B1 | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same | 0.17696129365559843 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-109000731-B | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642928 | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
評估喚回率
使用索引執行向量搜尋時,系統會傳回近似結果,但會降低回憶率。您可以比較向量搜尋與索引傳回的結果,以及向量搜尋與暴力搜尋傳回的結果,藉此計算召回率。在這個資料集中,publication_number
值會用來唯一識別專利,因此會用於比較。
WITH approx_results AS ( SELECT query.publication_number AS query_publication_number, base.publication_number AS base_publication_number FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"fraction_lists_to_search": 0.005}') ), exact_results AS ( SELECT query.publication_number AS query_publication_number, base.publication_number AS base_publication_number FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"use_brute_force":true}') ) SELECT a.query_publication_number, SUM(CASE WHEN a.base_publication_number = e.base_publication_number THEN 1 ELSE 0 END) / 5 AS recall FROM exact_results e LEFT JOIN approx_results a ON e.query_publication_number = a.query_publication_number GROUP BY a.query_publication_number
如果回憶率低於您預期的值,您可以提高 fraction_lists_to_search
值,但缺點是可能會增加延遲時間和資源用量。如要調整向量搜尋,您可以嘗試使用不同引數值多次執行 VECTOR_SEARCH
,將結果儲存至表格,然後比較結果。
清除所用資源
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.