使用向量搜尋搜尋嵌入

本教學課程將說明如何使用 VECTOR_SEARCH 函式 (以及選用的向量索引),針對儲存在 BigQuery 資料表中的嵌入資料執行相似度搜尋

當您將 VECTOR_SEARCH 與向量索引搭配使用時,VECTOR_SEARCH 會使用近似最鄰近項目方法提升向量搜尋成效,但這會導致回溯率降低,因此會傳回較為近似的結果。在沒有向量索引的情況下,VECTOR_SEARCH 會使用暴力搜尋來評估每筆記錄的距離。

所需權限

如要執行本教學課程,您必須具備下列身分和存取管理 (IAM) 權限:

  • 如要建立資料集,您必須具備 bigquery.datasets.create 權限。
  • 如要建立資料表,您必須具備下列權限:

    • bigquery.tables.create
    • bigquery.tables.updateData
    • bigquery.jobs.create
  • 如要建立向量索引,您需要在建立索引的資料表上具備 bigquery.tables.createIndex 權限。

  • 如要捨棄向量索引,您必須在捨棄索引的資料表上擁有 bigquery.tables.deleteIndex 權限。

以下每個預先定義的 IAM 角色都包含使用向量索引所需的權限:

  • BigQuery 資料擁有者 (roles/bigquery.dataOwner)
  • BigQuery 資料編輯器 (roles/bigquery.dataEditor)

費用

VECTOR_SEARCH 函式會使用 BigQuery 運算定價。系統會依據以量計價或版本定價向您收取相似搜尋費用。

  • 以量計價:系統會根據在基礎資料表、索引和搜尋查詢中掃描的位元組數向您收費。
  • 版本定價:系統會根據您在預留版本中完成工作所需的時間長度向您收費。較大且複雜的相似度計算作業會產生較高的費用。

詳情請參閱 BigQuery 定價

事前準備

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the BigQuery API.

    Enable the API

建立資料集

建立 BigQuery 資料集

  1. 前往 Google Cloud 控制台的「BigQuery」頁面。

    前往「BigQuery」頁面

  2. 在「Explorer」窗格中,按一下專案名稱。

  3. 依序點選 「View actions」(查看動作) >「Create dataset」(建立資料集)

    建立資料集,用於儲存教學課程中使用的物件。

  4. 在「Create dataset」頁面上執行下列操作:

    • 在「Dataset ID」(資料集 ID) 中輸入 vector_search

    • 在「位置類型」中選取「多區域」,然後選取「美國 (多個美國區域)」

      公開資料集儲存在 US 多地區中。為簡單起見,請將資料集儲存在相同的位置。

    • 保留其餘預設設定,然後點選「Create dataset」(建立資料集)

建立測試資料表

  1. 根據 Google 專利公開資料集的子集,建立包含專利嵌入資料的 patents 資料表:

    CREATE TABLE vector_search.patents AS
    SELECT * FROM `patents-public-data.google_patents_research.publications`
    WHERE ARRAY_LENGTH(embedding_v1) > 0
     AND publication_number NOT IN ('KR-20180122872-A')
    LIMIT 1000000;
  2. 建立包含專利嵌入資料的 patents2 資料表,找出下列項目的最近鄰:

    CREATE TABLE vector_search.patents2 AS
    SELECT * FROM `patents-public-data.google_patents_research.publications`
    WHERE publication_number = 'KR-20180122872-A';

建立向量索引

  1. patents 資料表的 embeddings_v1 欄上建立 my_index 向量索引:

    CREATE OR REPLACE VECTOR INDEX my_index ON vector_search.patents(embedding_v1)
    STORING(publication_number, title)
    OPTIONS(distance_type='COSINE', index_type='IVF');
  2. 等待向量索引建立完成,然後執行下列查詢,並確認 coverage_percentage 值為 100

    SELECT * FROM vector_search.INFORMATION_SCHEMA.VECTOR_INDEXES;

使用含有索引的 VECTOR_SEARCH 函式

建立並填入向量索引後,請使用 VECTOR_SEARCH 函式,在 patents2 資料表的 embedding_v1 欄中,找出嵌入項目的最鄰近項目。這個查詢會在搜尋中使用向量索引,因此 VECTOR_SEARCH 會使用近似最鄰近搜尋方法,找出嵌入項目的最近鄰。

使用含有索引的 VECTOR_SEARCH 函式:

SELECT query.publication_number AS query_publication_number,
  query.title AS query_title,
  base.publication_number AS base_publication_number,
  base.title AS base_title,
  distance
FROM
  VECTOR_SEARCH(
    TABLE vector_search.patents,
    'embedding_v1',
    TABLE vector_search.patents2,
    top_k => 5,
    distance_type => 'COSINE',
    options => '{"fraction_lists_to_search": 0.005}');

結果如下所示:

+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| query_publication_number |                         query_title                         | base_publication_number |                                                        base_title                                                        |      distance       |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-106599080-B          | A kind of rapid generation for keeping away big vast transfer figure based on GIS                                        | 0.14471956347590609 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-114118544-A          | Urban waterlogging detection method and device                                                                           | 0.17472108931171348 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-20200048143-A        | Method and system for mornitoring dry stream using unmanned aerial vehicle                                               | 0.17561990745619782 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-101721695-B1         | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same         | 0.17696129365559843 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-109000731-B          | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642917 |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+

使用 VECTOR_SEARCH 函式進行暴力破解

使用 VECTOR_SEARCH 函式,在 patents2 資料表的 embedding_v1 欄中,找出最接近的嵌入項目。這個查詢不會在搜尋中使用向量索引,因此 VECTOR_SEARCH 會找出嵌入項目的精確最近鄰。

SELECT query.publication_number AS query_publication_number,
  query.title AS query_title,
  base.publication_number AS base_publication_number,
  base.title AS base_title,
  distance
FROM
  VECTOR_SEARCH(
    TABLE vector_search.patents,
    'embedding_v1',
    TABLE vector_search.patents2,
    top_k => 5,
    distance_type => 'COSINE',
    options => '{"use_brute_force":true}');

結果如下所示:

+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| query_publication_number |                         query_title                         | base_publication_number |                                                        base_title                                                        |      distance       |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-106599080-B          | A kind of rapid generation for keeping away big vast transfer figure based on GIS                                        |  0.1447195634759062 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-114118544-A          | Urban waterlogging detection method and device                                                                           |  0.1747210893117136 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-20200048143-A        | Method and system for mornitoring dry stream using unmanned aerial vehicle                                               | 0.17561990745619782 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-101721695-B1         | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same         | 0.17696129365559843 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-109000731-B          | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642928 |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+

評估喚回率

使用索引執行向量搜尋時,系統會傳回近似結果,但會降低回憶率。您可以比較向量搜尋與索引傳回的結果,以及向量搜尋與暴力搜尋傳回的結果,藉此計算召回率。在這個資料集中,publication_number 值會用來唯一識別專利,因此會用於比較。

WITH approx_results AS (
  SELECT query.publication_number AS query_publication_number,
    base.publication_number AS base_publication_number
  FROM
    VECTOR_SEARCH(
      TABLE vector_search.patents,
      'embedding_v1',
      TABLE vector_search.patents2,
      top_k => 5,
      distance_type => 'COSINE',
      options => '{"fraction_lists_to_search": 0.005}')
),
  exact_results AS (
  SELECT query.publication_number AS query_publication_number,
    base.publication_number AS base_publication_number
  FROM
    VECTOR_SEARCH(
      TABLE vector_search.patents,
      'embedding_v1',
      TABLE vector_search.patents2,
      top_k => 5,
      distance_type => 'COSINE',
      options => '{"use_brute_force":true}')
)

SELECT
  a.query_publication_number,
  SUM(CASE WHEN a.base_publication_number = e.base_publication_number THEN 1 ELSE 0 END) / 5 AS recall
FROM exact_results e LEFT JOIN approx_results a
  ON e.query_publication_number = a.query_publication_number
GROUP BY a.query_publication_number

如果回憶率低於您預期的值,您可以提高 fraction_lists_to_search 值,但缺點是可能會增加延遲時間和資源用量。如要調整向量搜尋,您可以嘗試使用不同引數值多次執行 VECTOR_SEARCH,將結果儲存至表格,然後比較結果。

清除所用資源

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.