使用 SQL 和 Python 分析多模态数据

本教程将向您展示如何使用 SQL 查询和 Python 用户定义的函数 (UDF) 分析多模态数据

本教程使用的是公开的 Cymbal 宠物商店数据集中的产品目录。

目标

  • 使用 ObjectRef 值将图片数据与结构化数据一起存储在 BigQuery 标准表中。
  • 使用 AI.GENERATE_TABLE 函数根据标准表中的图片数据生成文本。
  • 使用 Python UDF 转换现有图片以创建新图片。
  • 使用 Python UDF 将 PDF 分块,以便进行进一步分析。
  • 使用 Gemini 模型和 ML.GENERATE_TEXT 函数分析分块的 PDF 数据。
  • 使用 ML.GENERATE_EMBEDDING 函数根据标准表中的图片数据生成嵌入。
  • 使用 ObjectRef 值数组处理有序多模式数据。

费用

在本文档中,您将使用 Google Cloud的以下收费组件:

  • BigQuery: you incur costs for the data that you process in BigQuery.
  • BigQuery Python UDFs: you incur costs for using Python UDFs.
  • Cloud Storage: you incur costs for the objects stored in Cloud Storage.
  • Vertex AI: you incur costs for calls to Vertex AI models.

您可使用价格计算器根据您的预计使用情况来估算费用。 Google Cloud 新用户可能有资格申请免费试用

如需了解详情,请参阅以下价格页面:

准备工作

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the BigQuery, BigQuery Connection, Cloud Storage, and Vertex AI APIs.

    Enable the APIs

所需的角色

如需获得完成本教程所需的权限,请让您的管理员为您授予以下 IAM 角色:

如需详细了解如何授予角色,请参阅管理对项目、文件夹和组织的访问权限

您也可以通过自定义角色或其他预定义角色来获取所需的权限。

设置

在本部分中,您将创建本教程中使用的数据集、连接、表和模型。

创建数据集

创建一个 BigQuery 数据集来包含您在本教程中创建的对象:

  1. 在 Google Cloud 控制台中,前往 BigQuery 页面。

    转到 BigQuery

  2. 探索器窗格中,选择您的项目。

  3. 展开 操作选项,然后点击创建数据集。系统随即会打开创建数据集窗格。

  4. 对于数据集 ID,请输入 cymbal_pets

  5. 点击创建数据集

创建存储桶

创建一个 Cloud Storage 存储桶来存储转换后的对象:

  1. 转至存储桶页面。

    进入“存储桶”

  2. 点击 创建

  3. 创建存储桶页面的开始使用部分中,输入符合存储桶名称要求的全局唯一名称。

  4. 点击创建

创建连接

创建 Cloud 资源连接并获取连接的服务账号。 BigQuery 使用该连接来访问 Cloud Storage 中的对象:

  1. 转到 BigQuery 页面。

    转到 BigQuery

  2. 浏览器窗格中,点击 添加数据

    系统随即会打开添加数据对话框。

  3. 过滤条件窗格中的数据源类型部分,选择商务应用

    或者,在搜索数据源字段中,输入 Vertex AI

  4. 精选数据源部分中,点击 Vertex AI

  5. 点击 Vertex AI 模型:BigQuery 联合解决方案卡片。

  6. 连接类型列表中,选择 Vertex AI 远程模型、远程函数和 BigLake(Cloud 资源)

  7. 连接 ID 字段中,输入 cymbal_conn

  8. 点击创建连接

  9. 点击转到连接

  10. 连接信息窗格中,复制服务账号 ID 以在后续步骤中使用。

向连接的服务账号授予权限

向连接的服务账号授予适当的角色以访问其他服务。您必须在您在准备工作部分创建或选择的项目中授予这些角色。在其他项目中授予这些角色会导致错误 bqcx-1234567890-xxxx@gcp-sa-bigquery-condel.iam.gserviceaccount.com does not have the permission to access resource

授予对 Cloud Storage 存储桶的权限

向服务账号授予使用您创建的存储桶中对象的权限:

  1. 转至存储桶页面。

    进入“存储桶”

  2. 点击您创建的存储桶的名称。

  3. 点击权限

  4. 点击 授予访问权限。 系统随即会打开授予访问权限对话框。

  5. 新的主账号字段中,输入您之前复制的服务账号 ID。

  6. 选择角色字段中,选择 Cloud Storage,然后选择 Storage Object User

  7. 点击保存

授予使用 Vertex AI 模型的权限

向服务账号授予使用 Vertex AI 模型的权限:

  1. 前往 IAM 和管理页面。

    转到“IAM 和管理”

  2. 点击 授予访问权限。 系统随即会打开授予访问权限对话框。

  3. 新的主账号字段中,输入您之前复制的服务账号 ID。

  4. 选择角色字段中,选择 Vertex AI,然后选择 Vertex AI User

  5. 点击保存

创建示例数据表

创建表来存储 Cymbal 宠物产品信息。

创建 products

创建一个包含 Cymbal 宠物商品信息的标准表:

  1. 在 Google Cloud 控制台中,前往 BigQuery 页面。

    转到 BigQuery

  2. 在查询编辑器中,运行以下查询以创建 products 表:

    LOAD DATA OVERWRITE cymbal_pets.products
    FROM
      FILES(
        format = 'avro',
        uris = [
          'gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/tables/products/products_*.avro']);

创建 product_images

创建一个包含 Cymbal 宠物商品图片的对象表:

  • BigQuery 页面的查询编辑器中,运行以下查询以创建 product_images 表:

    CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_images
      WITH CONNECTION `us.cymbal_conn`
      OPTIONS (
        object_metadata = 'SIMPLE',
        uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png'],
        max_staleness = INTERVAL 30 MINUTE,
        metadata_cache_mode = AUTOMATIC);

创建 product_manuals

创建一个包含 Cymbal 宠物产品手册的对象表:

  • BigQuery 页面的查询编辑器中,运行以下查询以创建 product_manuals 表:

    CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_manuals
      WITH CONNECTION `us.cymbal_conn`
      OPTIONS (
        object_metadata = 'SIMPLE',
        uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/documents/*.pdf']);

创建文本生成模型

创建表示 Vertex AI Gemini 模型的 BigQuery ML 远程模型

  • BigQuery 页面的查询编辑器中,运行以下查询以创建远程模型:

    CREATE OR REPLACE MODEL `cymbal_pets.gemini`
      REMOTE WITH CONNECTION `us.cymbal_conn`
      OPTIONS (ENDPOINT = 'gemini-2.0-flash');

创建嵌入生成模型

创建表示 Vertex AI 多模态嵌入模型的 BigQuery ML 远程模型:

  • BigQuery 页面的查询编辑器中,运行以下查询以创建远程模型:

    CREATE OR REPLACE MODEL `cymbal_pets.embedding_model`
      REMOTE WITH CONNECTION `us.cymbal_conn`
      OPTIONS (ENDPOINT = 'multimodalembedding@001');

创建包含多模态数据的 products_mm

创建一个 products_mm 表,其中包含一个 image 列,该列使用 product_images 对象表中的商品图片进行填充。创建的 image 列是使用 ObjectRef 格式的 STRUCT 列。

  1. BigQuery 页面的查询编辑器中,运行以下查询以创建 products_mm 表并填充 image 列:

    CREATE OR REPLACE TABLE cymbal_pets.products_mm
    AS
    SELECT products.* EXCEPT (uri), ot.ref AS image FROM cymbal_pets.products
    INNER JOIN cymbal_pets.product_images ot
    ON ot.uri = products.uri;
  2. BigQuery 页面的查询编辑器中,运行以下查询以查看 image 列数据:

    SELECT product_name, image
    FROM cymbal_pets.products_mm`

    结果类似于以下内容:

    +--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+
    | product_name                   | image.uri                            | image.version | image.authorizer              | image.details                                  |
    +--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+
    |  AquaClear Aquarium Background | gs://cloud-samples-data/bigquery/    | 1234567891011 | myproject.region.myconnection | {"gcs_metadata":{"content_type":"image/png",   |
    |                                | tutorials/cymbal-pets/images/        |               |                               | "md5_hash":"494f63b9b137975ff3e7a11b060edb1d", |
    |                                | aquaclear-aquarium-background.png    |               |                               | "size":1282805,"updated":1742492680017000}}    |
    +--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+
    |  AquaClear Aquarium            | gs://cloud-samples-data/bigquery/    | 2345678910112 | myproject.region.myconnection | {"gcs_metadata":{"content_type":"image/png",   |
    |  Gravel Vacuum                 | tutorials/cymbal-pets/images/        |               |                               | "md5_hash":"b7bfc2e2641a77a402a1937bcf0003fd", |
    |                                | aquaclear-aquarium-gravel-vacuum.png |               |                               | "size":820254,"updated":1742492682411000}}     |
    +--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+
    | ...                            | ...                                  | ...           |                               | ...                                            |
    +--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+
    

使用 Gemini 模型生成商品信息

使用 Gemini 模型为宠物店商品生成以下数据:

  • products_mm 表添加 image_description 列。
  • 填充 products_mm 表的 animal_typesearch_keywordssubcategory 列。
  • 运行一个查询,以返回每个商品品牌的说明以及该品牌的商品数量。品牌说明是通过分析相应品牌的所有商品(包括商品图片)的产品信息生成的。
  1. BigQuery 页面的查询编辑器中,运行以下查询以创建并填充 image_description 列:

    CREATE OR REPLACE TABLE cymbal_pets.products_mm
    AS
    SELECT
      product_id,
      product_name,
      brand,
      category,
      subcategory,
      animal_type,
      search_keywords,
      price,
      description,
      inventory_level,
      supplier_id,
      average_rating,
      image,
      image_description
    FROM
      AI.GENERATE_TABLE(
        MODEL `cymbal_pets.gemini`,
        (
          SELECT
            ('Can you describe the following image?', OBJ.GET_ACCESS_URL(image, 'r')) AS prompt,
            *
          FROM
            cymbal_pets.products_mm
        ),
        STRUCT('image_description STRING' AS output_schema));
  2. BigQuery 页面的查询编辑器中,运行以下查询,以使用生成的数据更新 animal_typesearch_keywordssubcategory 列:

    UPDATE cymbal_pets.products_mm p
    SET
      p.animal_type = s.animal_type,
      p.search_keywords = s.search_keywords,
      p.subcategory = s.subcategory
    FROM
      (
        SELECT
          animal_type,
          search_keywords,
          subcategory,
          uri
        FROM
          AI.GENERATE_TABLE(
            MODEL `cymbal_pets.gemini`,
            (
              SELECT
                (
                  'For the image of a pet product, concisely generate the following metadata.'
                    '1) animal_type and 2) 5 SEO search keywords, and 3) product subcategory',
                  OBJ.GET_ACCESS_URL(image, 'r'),
                  description) AS prompt,
                image.uri AS uri,
              FROM cymbal_pets.products_mm
            ),
            STRUCT(
              'animal_type STRING, search_keywords ARRAY<STRING>, subcategory STRING' AS output_schema,
              100 AS max_output_tokens))
      ) s
    WHERE p.image.uri = s.uri;
  3. BigQuery 页面的查询编辑器中,运行以下查询以查看生成的数据:

    SELECT
      product_name,
      image_description,
      animal_type,
      search_keywords,
      subcategory,
    FROM cymbal_pets.products_mm;

    结果类似于以下内容:

    +--------------------------------+-------------------------------------+-------------+------------------------+------------------+
    | product_name                   | image.description                   | animal_type | search_keywords        | subcategory      |
    +--------------------------------+-------------------------------------+-------------+------------------------+------------------+
    |  AquaClear Aquarium Background | The image shows a colorful coral    | fish        | aquarium background    | aquarium decor   |
    |                                | reef backdrop. The background is a  |             | fish tank backdrop     |                  |
    |                                | blue ocean with a bright light...   |             | coral reef decor       |                  |
    |                                |                                     |             | underwater scenery     |                  |
    |                                |                                     |             | aquarium decoration    |                  |
    +--------------------------------+-------------------------------------+-------------+------------------------+------------------+
    |  AquaClear Aquarium            | The image shows a long, clear       | fish        | aquarium gravel vacuum | aquarium         |
    |  Gravel Vacuum                 | plastic tube with a green hose      |             | aquarium cleaning      | cleaning         |
    |                                | attached to one end. The tube...    |             | aquarium maintenance   |                  |
    |                                |                                     |             | fish tank cleaning     |                  |
    |                                |                                     |             | gravel siphon          |                  |
    +--------------------------------+-------------------------------------+-------------+------------------------+------------------+
    | ...                            | ...                                 | ...         |  ...                   | ...              |
    +--------------------------------+-------------------------------------+-------------+------------------------+------------------+
    
  4. BigQuery 页面的查询编辑器中,运行以下查询以生成每个商品品牌的说明,以及该品牌的商品数量统计信息:

    SELECT
      brand,
      brand_description,
      cnt
    FROM
      AI.GENERATE_TABLE(
        MODEL `cymbal_pets.gemini`,
        (
          SELECT
            brand,
            COUNT(*) AS cnt,
            (
              'Use the images and text to give one concise brand description for a website brand page.'
                'Return the description only.',
              ARRAY_AGG(OBJ.GET_ACCESS_URL(image, 'r')),
              ARRAY_AGG(description),
              ARRAY_AGG(category),
              ARRAY_AGG(subcategory)) AS prompt
          FROM cymbal_pets.products_mm
          GROUP BY brand
        ),
        STRUCT('brand_description STRING' AS output_schema))
    ORDER BY cnt DESC;

    结果类似于以下内容:

    +--------------+-------------------------------------+-----+
    | brand        | brand.description                   | cnt |
    +--------------+-------------------------------------+-----+
    |  AquaClear   | AquaClear is a brand of aquarium    | 33  |
    |              | and pond care products that offer   |     |
    |              | a wide range of solutions for...    |     |
    +--------------+-------------------------------------+-----+
    |  Ocean       | Ocean Bites is a brand of cat food  | 28  |
    |  Bites       | that offers a variety of recipes    |     |
    |              | and formulas to meet the specific.. |     |
    +--------------+-------------------------------------+-----+
    |  ...         | ...                                 |...  |
    +--------------+-------------------------------------+-----+
    

创建 Python UDF 以转换商品图片

创建一个 Python UDF 以将产品图片转换为灰度图片。

Python UDF 使用开源库,还使用并行执行来同时转换多个图片。

  1. BigQuery 页面的查询编辑器中,运行以下查询以创建 to_grayscale UDF:

    CREATE OR REPLACE FUNCTION cymbal_pets.to_grayscale(src_json STRING, dst_json STRING)
    RETURNS STRING
    LANGUAGE python
    WITH CONNECTION `us.cymbal_conn`
    OPTIONS (entry_point='to_grayscale', runtime_version='python-3.11', packages=['numpy', 'opencv-python'])
    AS """
    
    import cv2 as cv
    import numpy as np
    from urllib.request import urlopen, Request
    import json
    
    # Transform the image to grayscale.
    def to_grayscale(src_ref, dst_ref):
      src_json = json.loads(src_ref)
      srcUrl = src_json["access_urls"]["read_url"]
    
      dst_json = json.loads(dst_ref)
      dstUrl = dst_json["access_urls"]["write_url"]
    
      req = urlopen(srcUrl)
      arr = np.asarray(bytearray(req.read()), dtype=np.uint8)
      img = cv.imdecode(arr, -1) # 'Load it as it is'
    
      # Convert the image to grayscale
      gray_image = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
    
      # Send POST request to the URL
      _, img_encoded = cv.imencode('.png', gray_image)
    
      req = Request(url=dstUrl, data=img_encoded.tobytes(), method='PUT', headers = {
          "Content-Type": "image/png",
      })
      with urlopen(req) as f:
          pass
      return dst_ref
    """;

美化商品图片

创建 products_grayscale 表,其中包含一个 ObjectRef 列,该列包含灰度图片的目标路径和授权者。目标路径派生自原始图片路径。

创建表格后,运行 to_grayscale 函数以创建灰度图片,将其写入 Cloud Storage 存储桶,然后返回包含灰度图片的访问网址和元数据的 ObjectRefRuntime 值。

  1. BigQuery 页面的查询编辑器中,运行以下查询以创建 products_grayscale 表:

    CREATE OR REPLACE TABLE cymbal_pets.products_grayscale
    AS
    SELECT
      product_id,
      product_name,
      image,
      OBJ.MAKE_REF(
        CONCAT('gs://BUCKET/cymbal-pets-images/grayscale/', REGEXP_EXTRACT(image.uri, r'([^/]+)$')),
        'us.cymbal_conn') AS gray_image
    FROM cymbal_pets.products_mm;

    BUCKET 替换为您创建的存储桶的名称。

  2. BigQuery 页面的查询编辑器中,运行以下查询以创建灰度图片,将其写入 Cloud Storage 存储桶,然后返回包含灰度图片的访问网址和元数据的 ObjectRefRuntime 值:

    SELECT cymbal_pets.to_grayscale(
      TO_JSON_STRING(OBJ.GET_ACCESS_URL(image, 'r')),
      TO_JSON_STRING(OBJ.GET_ACCESS_URL(gray_image, 'rw')))
    FROM cymbal_pets.products_grayscale;

    结果类似于以下内容:

    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | f0                                                                                                                                                                    |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"access_urls":{"expiry_time":"2025-04-26T03:00:48Z",                                                                                                                 |
    | "read_url":"https://storage.googleapis.com/mybucket/cymbal-pets-images%2Fgrayscale%2Focean-bites-salmon-%26-tuna-cat-food.png?additional_read URL_information",       |
    | "write_url":"https://storage.googleapis.com/myproject/cymbal-pets-images%2Fgrayscale%2Focean-bites-salmon-%26-tuna-cat-food.png?additional_write URL_information"},   |
    | "objectref":{"authorizer":"myproject.region.myconnection","uri":"gs://myproject/cymbal-pets-images/grayscale/ocean-bites-salmon-&-tuna-cat-food.png"}}                |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"access_urls":{"expiry_time":"2025-04-26T03:00:48Z",                                                                                                                 |
    | "read_url":"https://storage.googleapis.com/mybucket/cymbal-pets-images%2Fgrayscale%2Ffluffy-buns-guinea-pig-tunnel.png?additional _read URL_information",             |
    | "write_url":"https://storage.googleapis.com/myproject/cymbal-pets-images%2Fgrayscale%2Focean-bites-salmon-%26-tuna-cat-food.png?additional_write_URL_information"},   |
    | "objectref":{"authorizer":"myproject.region.myconnection","uri":"gs://myproject/cymbal-pets-images%2Fgrayscale%2Ffluffy-buns-guinea-pig-tunnel.png"}}                 |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |  ...                                                                                                                                                                  |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

创建 Python UDF 以分块处理 PDF 数据

创建一个 Python UDF,将包含 Cymbal 宠物产品手册的 PDF 对象分块为多个部分。

PDF 文件通常非常大,可能无法在对生成式 AI 模型的单次调用中处理完毕。通过对 PDF 进行分块处理,您可以将 PDF 数据存储为可用于模型的格式,以便更轻松地进行分析。

  1. BigQuery 页面的查询编辑器中,运行以下查询以创建 chunk_pdf UDF:

    -- This function chunks the product manual PDF into multiple parts.
    -- The function accepts an ObjectRefRuntime value for the PDF file and the chunk size.
    -- It then parses the PDF, chunks the contents, and returns an array of chunked text.
    CREATE OR REPLACE FUNCTION cymbal_pets.chunk_pdf(src_json STRING, chunk_size INT64, overlap_size INT64)
    RETURNS ARRAY<STRING>
    LANGUAGE python
    WITH CONNECTION `us.cymbal_conn`
    OPTIONS (entry_point='chunk_pdf', runtime_version='python-3.11', packages=['pypdf'])
    AS """
    import io
    import json
    
    from pypdf import PdfReader  # type: ignore
    from urllib.request import urlopen, Request
    
    def chunk_pdf(src_ref: str, chunk_size: int, overlap_size: int) -> str:
      src_json = json.loads(src_ref)
      srcUrl = src_json["access_urls"]["read_url"]
    
      req = urlopen(srcUrl)
      pdf_file = io.BytesIO(bytearray(req.read()))
      reader = PdfReader(pdf_file, strict=False)
    
      # extract and chunk text simultaneously
      all_text_chunks = []
      curr_chunk = ""
      for page in reader.pages:
          page_text = page.extract_text()
          if page_text:
              curr_chunk += page_text
              # split the accumulated text into chunks of a specific size with overlaop
              # this loop implements a sliding window approach to create chunks
              while len(curr_chunk) >= chunk_size:
                  split_idx = curr_chunk.rfind(" ", 0, chunk_size)
                  if split_idx == -1:
                      split_idx = chunk_size
                  actual_chunk = curr_chunk[:split_idx]
                  all_text_chunks.append(actual_chunk)
                  overlap = curr_chunk[split_idx + 1 : split_idx + 1 + overlap_size]
                  curr_chunk = overlap + curr_chunk[split_idx + 1 + overlap_size :]
      if curr_chunk:
          all_text_chunks.append(curr_chunk)
    
      return all_text_chunks
    """;

分析 PDF 数据

运行 chunk_pdf 函数以将 product_manuals 表中的 PDF 数据分块,然后创建一个 product_manual_chunk_strings 表,其中每行包含一个 PDF 分块。对 product_manual_chunk_strings 数据使用 Gemini 模型,以汇总产品手册中找到的法律信息。

  1. BigQuery 页面的查询编辑器中,运行以下查询以创建 product_manual_chunk_strings 表:

    CREATE OR REPLACE TABLE cymbal_pets.product_manual_chunk_strings
    AS
    SELECT chunked
    FROM cymbal_pets.product_manuals,
    UNNEST (cymbal_pets.chunk_pdf(
      TO_JSON_STRING(
        OBJ.GET_ACCESS_URL(OBJ.MAKE_REF(uri, 'us.cymbal_conn'), 'r')),
        1000,
        100
    )) as chunked;
  2. BigQuery 页面的查询编辑器中,运行以下查询,以使用 Gemini 模型分析 PDF 数据:

    SELECT
      ml_generate_text_llm_result
    FROM
      ML.GENERATE_TEXT(
        MODEL `cymbal_pets.gemini`,
        (
          SELECT
            (
              'Can you summarize the product manual as bullet points? Highlight the legal clauses',
              chunked) AS prompt,
          FROM cymbal_pets.product_manual_chunk_strings
        ),
        STRUCT(
          TRUE AS FLATTEN_JSON_OUTPUT));

    结果类似于以下内容:

    +-------------------------------------------------------------------------------------------------------------------------------------------+
    | ml_generate_text_llm_result                                                                                                               |
    +-------------------------------------------------------------------------------------------------------------------------------------------+
    | ## CritterCuisine Pro 5000 Automatic Pet Feeder Manual Summary:                                                                           |
    |                                                                                                                                           |
    | **Safety:**                                                                                                                               |
    |                                                                                                                                           |
    | * **Stability:** Place feeder on a level, stable surface to prevent tipping.                                                              |
    | * **Power Supply:** Only use the included AC adapter. Using an incompatible adapter can damage the unit and void the warranty.            |
    | * **Cord Safety:** Keep the power cord out of reach of pets to prevent chewing or entanglement.                                           |
    | * **Children:** Supervise children around the feeder. This is not a toy.                                                                  |
    | * **Pet Health:** Consult your veterinarian before using an automatic feeder if your pet has special dietary needs, health conditions, or |
    +-------------------------------------------------------------------------------------------------------------------------------------------+
    | ## Product Manual Summary:                                                                                                                |
    |                                                                                                                                           |
    | **6.3 Manual Feeding:**                                                                                                                   |
    |                                                                                                                                           |
    | * Press MANUAL button to dispense a single portion (Meal 1 size). **(Meal Enabled)**                                                      |
    |                                                                                                                                           |
    | **6.4 Recording a Voice Message:**                                                                                                        |
    |                                                                                                                                           |
    | * Press and hold VOICE button.                                                                                                            |
    | * Speak clearly into the microphone (up to 10 seconds).                                                                                   |
    | * Release VOICE button to finish recording.                                                                                               |
    | * Briefly press VOICE button to play back the recording.                                                                                  |
    | * To disable the voice message, record a blank message (hold VOICE button for 10 seconds without speaking). **(Meal Enabled)**            |
    |                                                                                                                                           |
    | **6.5 Low Food Level Indicator:**                                                                                                         |
    +-------------------------------------------------------------------------------------------------------------------------------------------+
    | ...                                                                                                                                       |
    +-------------------------------------------------------------------------------------------------------------------------------------------+
    

从图片数据生成嵌入,然后使用向量搜索功能通过嵌入返回类似图片。

在生产环境中,我们建议您先创建向量索引,然后再运行向量搜索。矢量索引可让您更快地执行矢量搜索,同时需要降低召回率并返回更多近似结果。

  1. BigQuery 页面的查询编辑器中,运行以下查询以创建 products_embeddings 表:

    CREATE OR REPLACE TABLE cymbal_pets.products_embedding
    AS
    SELECT product_id, ml_generate_embedding_result as embedding, content as image
    FROM ML.GENERATE_EMBEDDING(
    MODEL `cymbal_pets.embedding_model`,
      (
        SELECT OBJ.GET_ACCESS_URL(image, 'r') as content, image, product_id
        FROM cymbal_pets.products_mm
      ),
      STRUCT ()
    );
  2. BigQuery 页面的查询编辑器中,运行以下查询以执行向量搜索,以返回与给定输入图片相似的商品图片:

    SELECT *
    FROM
    VECTOR_SEARCH(
      TABLE cymbal_pets.products_embedding,
      'embedding',
      (SELECT ml_generate_embedding_result as embedding FROM ML.GENERATE_EMBEDDING(
        MODEL `cymbal_pets.embedding_model`,
        (SELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF('gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/cozy-naps-cat-scratching-post-with-condo.png', 'us.cymbal_conn')) as content)
      ))
    );

    结果类似于以下内容:

    +-----------------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+
    | query.embedding | base.product_id | base.embedding | base.image.uri                               | base.image.version | base.image.authorizer         | base.image.details                             | distance       |
    +-----------------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+
    | -0.0112330541   | 181             | -0.0112330541  | gs://cloud-samples-data/bigquery/            | 12345678910        | myproject.region.myconnection | {"gcs_metadata":{"content_type":               | 0.0            |
    | 0.0142525584    |                 |  0.0142525584  | tutorials/cymbal-pets/images/                |                    |                               | "image/png","md5_hash":"21234567hst16555w60j", |                |
    | 0.0135886827    |                 |  0.0135886827  | cozy-naps-cat-scratching-post-with-condo.png |                    |                               | "size":828318,"updated":1742492688982000}}     |                |
    | 0.0149955815    |                 |  0.0149955815  |                                              |                    |                               |                                                |                |
    | ...             |                 |  ...           |                                              |                    |                               |                                                |                |
    |                 |                 |                |                                              |                    |                               |                                                |                |
    |                 |                 |                |                                              |                    |                               |                                                |                |
    +-----------------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+
    | -0.0112330541   | 187             | -0.0190353896  | gs://cloud-samples-data/bigquery/            | 23456789101        | myproject.region.myconnection | {"gcs_metadata":{"content_type":               | 0.4216330832.. |
    | 0.0142525584    |                 |  0.0116206668  | tutorials/cymbal-pets/images/                |                    |                               | "image/png","md5_hash":"7328728fhakd9937djo4", |                |
    | 0.0135886827    |                 |  0.0136198215  | cozy-naps-cat-scratching-post-with-bed.png   |                    |                               | "size":860113,"updated":1742492688774000}}     |                |
    | 0.0149955815    |                 |  0.0173457414  |                                              |                    |                               |                                                |                |
    | ...             |                 |  ...           |                                              |                    |                               |                                                |                |
    |                 |                 |                |                                              |                    |                               |                                                |                |
    |                 |                 |                |                                              |                    |                               |                                                |                |
    +---------C--------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+
    | ...             | ...             | ...            | ...                                          | ...                | ...                           | ...                                            | ...            |
    +-----------------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+
    

使用 ObjectRef 值数组处理有序多模式数据

本指南介绍如何完成以下任务:

  1. 重新创建 product_manuals 表,使其同时包含 Crittercuisine 5000 产品手册的 PDF 文件,以及该手册的每个页面的 PDF 文件。
  2. 创建一个表,将手册映射到其分块。表示完整手册的 ObjectRef 值存储在 STRUCT<uri STRING, version STRING, authorizer STRING, details JSON>> 列中。表示手册页的 ObjectRef 值存储在 ARRAY<STRUCT<uri STRING, version STRING, authorizer STRING, details JSON>> 列中。
  3. 一起分析一个 ObjectRef 值数组,以返回单个生成的值。
  4. 分别分析 ObjectRef 值数组,并为每个数组值返回生成的值。

在分析任务中,您需要将 ObjectRef 值数组转换为 ObjectRefRuntime 值的有序列表,然后将该列表传递给 Gemini 模型,并在问题中指定 ObjectRefRuntime 值。ObjectRefRuntime 值提供签名网址,模型会使用这些网址访问 Cloud Storage 中的对象信息。

如需使用 ObjectRef 值数组处理有序多模态数据,请按以下步骤操作:

  1. 转到 BigQuery 页面。

    转到 BigQuery

  2. 在查询编辑器中,运行以下查询以重新创建 product_manuals 表:

    CREATE OR REPLACE EXTERNAL TABLE `cymbal_pets.product_manuals`
      WITH CONNECTION `us.cymbal_conn`
      OPTIONS (
        object_metadata = 'SIMPLE',
        uris = [
            'gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/documents/*.pdf',
            'gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/document_chunks/*.pdf']);
  3. 在查询编辑器中,运行以下查询以将 PDF 数据写入 map_manual_to_chunks 表:

    -- Extract the file and chunks into a single table.
    -- Store the chunks in the chunks column as array of ObjectRefs (ordered by page number)
    CREATE OR REPLACE TABLE cymbal_pets.map_manual_to_chunks
    AS
    SELECT ARRAY_AGG(m1.ref)[0] manual, ARRAY_AGG(m2.ref ORDER BY m2.ref.uri) chunks
    FROM cymbal_pets.product_manuals m1
    JOIN cymbal_pets.product_manuals m2
      ON
        REGEXP_EXTRACT(m1.uri, r'.*/([^.]*).[^/]+')
        = REGEXP_EXTRACT(m2.uri, r'.*/([^.]*)_page[0-9]+.[^/]+')
    GROUP BY m1.uri;
  4. 在查询编辑器中,运行以下查询以查看 map_manual_to_chunks 表中的 PDF 数据:

    SELECT *
    FROM cymbal_pets.map_manual_to_chunks;

    结果类似于以下内容:

    +-------------------------------------+--------------------------------+-----------------------------------+------------------------------------------------------+-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+
    | manual.uri                          | manual.version                 | manual.authorizer                 | manual.details                                       | chunks.uri                                | chunks.version                  | chunks.authorizer                  | chunks.details                                        |
    +-------------------------------------+--------------------------------+-----------------------------------+------------------------------------------------------+-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+
    | gs://cloud-samples-data/bigquery/   | 1742492785900455               | myproject.region.myconnection     | {"gcs_metadata":{"content_type":"application/pef",   | gs://cloud-samples-data/bigquery/         | 1745875761227129                | myproject.region.myconnection      | {"gcs_metadata":{"content_type":"application/pdf",    |
    | tutorials/cymbal-pets/documents/    |                                |                                   | "md5_hash":"c9032b037693d15a33210d638c763d0e",       | tutorials/cymbal-pets/documents/          |                                 |                                    | "md5_hash":"5a1116cce4978ec1b094d8e8b49a1d7c",        |
    | crittercuisine_5000_user_manual.pdf |                                |                                   | "size":566105,"updated":1742492785941000}}           | crittercuisine_5000_user_manual_page1.pdf |                                 |                                    | "size":504583,"updated":1745875761266000}}            |
    |                                     |                                |                                   |                                                      +-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+
    |                                     |                                |                                   |                                                      | crittercuisine_5000_user_manual_page1.pdf | 1745875760613874                | myproject.region.myconnection      | {"gcs_metadata":{"content_type":"application/pdf",    |
    |                                     |                                |                                   |                                                      | tutorials/cymbal-pets/documents/          |                                 |                                    | "md5_hash":"94d03ec65d28b173bc87eac7e587b325",        |
    |                                     |                                |                                   |                                                      | crittercuisine_5000_user_manual_page2.pdf |                                 |                                    | "size":94622,"updated":1745875760649000}}             |
    |                                     |                                |                                   |                                                      +-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+
    |                                     |                                |                                   |                                                      | ...                                       | ...                             |  ...                               | ...                                                   |
    +-------------------------------------+--------------------------------+-----------------------------------+------------------------------------------------------+-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+
    
  5. 在查询编辑器中,运行以下查询,根据对 ObjectRef 值数组的分析,从 Gemini 模型生成单个响应:

    WITH
      manuals AS (
        SELECT
          OBJ.GET_ACCESS_URL(manual, 'r') AS manual,
          ARRAY(
            SELECT OBJ.GET_ACCESS_URL(chunk, 'r') AS chunk
            FROM UNNEST(m1.chunks) AS chunk WITH OFFSET AS idx
            ORDER BY idx
          ) AS chunks
        FROM cymbal_pets.map_manual_to_chunks AS m1
      )
    SELECT ml_generate_text_llm_result AS Response
    FROM
      ML.GENERATE_TEXT(
        MODEL `cymbal_pets.gemini`,
        (
          SELECT
            (
              'Can you provide a page by page summary for the first 3 pages of the attached manual? Only write one line for each page. The pages are provided in serial order',
              manuals.chunks) AS prompt,
          FROM manuals
        ),
        STRUCT(TRUE AS FLATTEN_JSON_OUTPUT));

    结果类似于以下内容:

    +-------------------------------------------+
    | Response                                  |
    +-------------------------------------------+
    | Page 1: This manual is for the            |
    | CritterCuisine Pro 5000 automatic         |
    | pet feeder.                               |
    | Page 2: The manual covers safety          |
    | precautions, what's included,             |
    | and product overview.                     |
    | Page 3: The manual covers assembly,       |
    | initial setup, and programming the clock. |
    +-------------------------------------------+
    
  6. 在查询编辑器中,运行以下查询,根据对 ObjectRef 值数组的分析,从 Gemini 模型生成多个回答:

    WITH
      input_chunked_objrefs AS (
        SELECT row_id, offset, chunk_ref
        FROM
          (
            SELECT ROW_NUMBER() OVER () AS row_id, * FROM `cymbal_pets.map_manual_to_chunks`
          ) AS indexed_table
        LEFT JOIN
          UNNEST(indexed_table.chunks) AS chunk_ref
          WITH OFFSET
      ),
      get_access_urls AS (
        SELECT row_id, offset, chunk_ref, OBJ.GET_ACCESS_URL(chunk_ref, 'r') AS ObjectRefRuntime
        FROM input_chunked_objrefs
      ),
      valid_get_access_urls AS (
        SELECT *
        FROM get_access_urls
        WHERE ObjectRefRuntime['runtime_errors'] IS NULL
      ),
      ordered_output_objrefruntime_array AS (
        SELECT ARRAY_AGG(ObjectRefRuntime ORDER BY offset) AS ObjectRefRuntimeArray
        FROM valid_get_access_urls
        GROUP BY row_id
      )
    SELECT
      page1_summary,
      page2_summary,
      page3_summary
    FROM
      AI.GENERATE_TABLE(
        MODEL `cymbal_pets.gemini`,
        (
          SELECT
            (
              'Can you provide a page by page summary for the first 3 pages of the attached manual? Only write one line for each page. The pages are provided in serial order',
              ObjectRefRuntimeArray) AS prompt,
          FROM ordered_output_objrefruntime_array
        ),
        STRUCT(
          'page1_summary STRING, page2_summary STRING, page3_summary STRING' AS output_schema));

    结果类似于以下内容:

    +-----------------------------------------------+-------------------------------------------+----------------------------------------------------+
    | page1_summary                                 | page2_summary                             | page3_summary                                      |
    +-----------------------------------------------+-------------------------------------------+----------------------------------------------------+
    | This manual provides an overview of the       | This section explains how to program      | This page covers connecting the feeder to Wi-Fi    |
    | CritterCuisine Pro 5000 automatic pet feeder, | the feeder's clock, set feeding           | using the CritterCuisine Connect app,  remote      |
    | including its features, safety precautions,   | schedules, copy and delete meal settings, | feeding, managing feeding schedules, viewing       |
    | assembly instructions, and initial setup.     | manually feed your pet, record            | feeding logs, receiving low food alerts,           |
    |                                               | a voice message, and understand           | updating firmware, creating multiple pet profiles, |
    |                                               | the low food level indicator.             | sharing access with other users, and cleaning      |
    |                                               |                                           | and maintaining the feeder.                        |
    +-----------------------------------------------+-------------------------------------------+----------------------------------------------------+
    

清理

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.