篩選自訂搜尋的結構化或非結構化資料

如果您有使用結構化資料或含有中繼資料的非結構化資料的搜尋應用程式,可以使用中繼資料篩選搜尋查詢。本頁面說明如何使用中繼資料欄位,將搜尋範圍限制在特定文件組上。

事前準備

請確認您已建立應用程式,並擷取結構化資料或含有結構描述資料的非結構化資料。詳情請參閱「建立搜尋應用程式」。

中繼資料範例

請參考以下四個 PDF 檔案 (document_1.pdfdocument_2.pdfdocument_3.pdfdocument_4.pdf) 的中繼資料範例。這些中繼資料會與 PDF 檔案一併存放在 Cloud Storage 值區的 JSON 檔案中。您可以參考這個範例,瞭解本頁的內容。

{"id": "1", "structData": {"title": "Policy on accepting corrected claims", "category": ["persona_A"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_1.pdf"}}
{"id": "2", "structData": {"title": "Claims documentation and reporting guidelines for commercial members", "category": ["persona_A", "persona_B"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_2.pdf"}}
{"id": "3", "structData": {"title": "Claims guidelines for bundled services and supplies for commercial members", "category": ["persona_B", "persona_C"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_3.pdf"}}
{"id": "4", "structData": {"title": "Advantage claims submission guidelines", "category": ["persona_A", "persona_C"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_4.pdf"}}

篩選器運算式語法

請務必瞭解用來定義搜尋篩選條件的篩選器運算式語法。篩選器運算式語法可用以下擴充巴科斯諾爾形式來概略說明:

  # A single expression or multiple expressions that are joined by "AND" or "OR".
  filter = expression, { " AND " | "OR", expression };
  # Expressions can be prefixed with "-" or "NOT" to express a negation.
  expression = [ "-" | "NOT " ],
    # A parenthetical expression.
    | "(", expression, ")"
    # A simple expression applying to a text field.
    # Function "ANY" returns true if the field exactly matches any of the literals.
    ( text_field, ":", "ANY", "(", literal, { ",", literal }, ")"
    # A simple expression applying to a numerical field. Function "IN" returns true
    # if a field value is within the range. By default, lower_bound is inclusive and
    # upper_bound is exclusive.
    | numerical_field, ":", "IN", "(", lower_bound, ",", upper_bound, ")"
    # A simple expression that applies to a numerical field and compares with a double value.
    | numerical_field, comparison, double
    # An expression that applies to a geolocation field with text/street/postal address.
    |  geolocation_field, ":", "GEO_DISTANCE(", literal, ",", distance_in_meters, ")"
    # An expression that applies to a geolocation field with latitude and longitude.
    | geolocation_field, ":", "GEO_DISTANCE(", latitude_double, ",", longitude_double, ",", distance_in_meters, ")"
    # Datetime field
    | datetime_field, comparison, literal_iso_8601_datetime_format);
  # A lower_bound is either a double or "*", which represents negative infinity.
  # Explicitly specify inclusive bound with the character 'i' or exclusive bound
  # with the character 'e'.
  lower_bound = ( double, [ "e" | "i" ] ) | "*";
  # An upper_bound is either a double or "*", which represents infinity.
  # Explicitly specify inclusive bound with the character 'i' or exclusive bound
  # with the character 'e'.
  upper_bound = ( double, [ "e" | "i" ] ) | "*";
  # Supported comparison operators.
  comparison = "<=" | "<" | ">=" | ">" | "=";
  # A literal is any double quoted string. You must escape backslash (\) and
  # quote (") characters.
  literal = double quoted string;
  text_field = text field - for example, category;
  numerical_field = numerical field - for example, score;
  geolocation_field = field of geolocation data type - for example home_address, location;
  datetime_field = field of datetime data type - for example creation_date, expires_on;
  literal_iso_8601_datetime_format = either a double quoted string representing ISO 8601 datetime or a numerical field representing microseconds from unix epoch.

使用中繼資料篩選器進行搜尋

如要使用中繼資料篩選器進行搜尋,請按照下列步驟操作:

  1. 決定要用來篩選搜尋查詢的結構描述欄位。舉例來說,如果您要使用開始前一節中的中繼資料,可以將 category 欄位做為搜尋篩選器。使用者可以依據 persona_Apersona_Bpersona_C 進行篩選,將搜尋結果限制在與所需人物角色相關的文件。

  2. 讓中繼資料欄位可進行索引:

    1. 前往 Google Cloud 控制台的「AI Applications」頁面,然後在導覽選單中點選「應用程式」

      前往「應用程式」頁面

    2. 按一下搜尋應用程式。

    3. 在導覽選單中,按一下「資料」

    4. 按一下 [Schema] (結構定義) 分頁標籤。這個分頁會顯示目前的欄位設定。

    5. 按一下 [編輯]

    6. 找出要讓系統可索引的欄位,然後勾選「可索引」核取方塊。

    7. 按一下 [儲存]。詳情請參閱「設定欄位設定」。

  3. 找出資料儲存庫 ID。如果您已取得資料儲存庫 ID,請略過至下一個步驟。

    1. 前往 Google Cloud 控制台的「AI Applications」頁面,然後在導覽選單中按一下「資料儲存庫」

      前往「資料儲存庫」頁面

    2. 點按資料儲存庫的名稱。

    3. 在資料儲存庫的「資料」頁面中,取得資料儲存庫 ID。

  4. 取得搜尋結果。

    curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/servingConfigs/default_search:search" \
    -d '{
    "query": "QUERY",
    "filter": "FILTER"
    }'
    

    更改下列內容:

    • PROJECT_ID:專案 ID。
    • DATA_STORE_ID:資料儲存庫的 ID。
    • QUERY:要搜尋的查詢文字。
    • FILTER:選用。文字欄位,可讓您使用篩選運算式語法,對特定欄位組進行篩選。預設值為空字串,表示不會套用篩選器。

    舉例來說,假設您已匯入含有中繼資料的四個 PDF 檔案,詳情請參閱「事前準備」一節。您想搜尋包含「claims」一詞的文件,且只查詢 category 值為 persona_A 的文件。您可以透過在呼叫中加入下列陳述式來執行這項操作:

    "query": "claims",
    "filter": "category: ANY(\"persona_A\")"
    

    詳情請參閱「取得含有結構化或非結構化資料的應用程式搜尋結果」一文中的 REST 分頁。

    按一下即可查看回覆範例。

    如果您執行上述程序中的搜尋作業,則會收到類似以下的回應。請注意,回應中包含三份 category 值為 persona_A 的文件。

    {
    "results": [
    {
      "id": "2",
      "document": {
        "name": "projects/abcdefg/locations/global/collections/default_collection/dataStores/search_store_id/branches/0/documents/2",
        "id": "2",
        "structData": {
          "title": "Claims documentation and reporting guidelines for commercial members",
          "category": [
            "persona_A",
            "persona_B"
          ]
        },
        "derivedStructData": {
          "link": "gs://bucketname_87654321/data/document_2.pdf",
          "extractive_answers": [
            {
              "pageNumber": "1",
              "content": "lorem ipsum"
            }
          ]
        }
      }
    },
    {
      "id": "1",
      "document": {
        "name": "projects/abcdefg/locations/global/collections/default_collection/dataStores/search_store_id/branches/0/documents/1",
        "id": "1",
        "structData": {
          "title": "Policy on accepting corrected claims",
          "category": [
            "persona_A"
          ]
        },
        "derivedStructData": {
          "extractive_answers": [
            {
              "pageNumber": "2",
              "content": "lorem ipsum"
            }
          ],
          "link": "gs://bucketname_87654321/data/document_1.pdf"
        }
      }
    },
    {
      "id": "4",
      "document": {
        "name": "projects/abcdefg/locations/global/collections/default_collection/dataStores/search_store_id/branches/0/documents/4",
        "id": "4",
        "structData": {
          "title": "Advantage claims submission guidelines",
          "category": [
            "persona_A",
            "persona_C"
          ]
        },
        "derivedStructData": {
          "extractive_answers": [
            {
              "pageNumber": "47",
              "content": "lorem ipsum"
            }
          ],
          "link": "gs://bucketname_87654321/data/document_4.pdf"
        }
      }
    }
    ],
    "totalSize": 330,
    "attributionToken": "UvBRCgsI26PxpQYQs7vQZRIkNjRiYWY1MTItMDAwMC0yZWIwLTg3MTAtMTQyMjNiYzYzMWEyIgdHRU5FUklDKhSOvp0VpovvF8XL8xfC8J4V1LKdFQ",
    "guidedSearchResult": {},
    "summary": {}
    }
    

篩選運算式範例

下表列出篩選器運算式的範例。

篩選器 只會傳回符合下列條件的文件結果:
category: ANY("persona_A") 文字欄位 categorypersona_A
score: IN(*, 100.0e) 數值欄位 score 大於負無窮,小於 100.0
non-smoking = "true" 布林值 non-smoking 為 true
pet-friendly = "false" 布林值 pet-friendly 為 false
manufactured_date = "2023" manufactured date 是 2023 年的任何時間
manufactured_date >= "2024-04-16" manufactured_date 在 2024 年 4 月 16 日當天或之後
manufactured_date < "2024-04-16T12:00:00-07:00" manufactured_date 是在 2024 年 4 月 16 日太平洋夏令時間中午前
office.location:GEO_DISTANCE("1600 Amphitheater Pkwy, Mountain View, CA, 94043", 500) 地理位置欄位 office.location 距離 1600 Amphitheatre Pkwy 的距離在 500 公尺內
NOT office.location:GEO_DISTANCE("Palo Alto, CA", 1000) 地理位置欄位 office.location 不在加州帕羅奧圖 1 公里範圍內。
office.location:GEO_DISTANCE(34.1829, -121.293, 500) 地理位置欄位 office.location 位於緯度 34.1829 和經度 -121.293 半徑 500 公尺的範圍內
category: ANY("persona_A") AND score: IN(*, 100.0e) categorypersona_A,且 score 小於 100
office.location:GEO_DISTANCE("Mountain View, CA", 500) OR office.location:GEO_DISTANCE("Palo Alto, CA", 500) office.location 距離山景城或帕羅奧圖不到 500 公尺。
(price<175 AND pet-friendly = "true") OR (price<125 AND pet-friendly = "false") price 小於 175,我可以攜帶寵物,或是 price 小於 125,我無法攜帶寵物

後續步驟

  • 如要瞭解篩選器對搜尋品質的影響,請評估搜尋品質。詳情請參閱「評估搜尋品質」。