使用 SQL 和 Python 分析多模态数据
本教程将向您展示如何使用 SQL 查询和 Python 用户定义的函数 (UDF) 分析多模态数据。
本教程使用的是公开的 Cymbal 宠物商店数据集中的产品目录。
目标
- 使用
ObjectRef
值将图片数据与结构化数据一起存储在 BigQuery 标准表中。 - 使用
AI.GENERATE_TABLE
函数根据标准表中的图片数据生成文本。 - 使用 Python UDF 转换现有图片以创建新图片。
- 使用 Python UDF 将 PDF 分块,以便进行进一步分析。
- 使用 Gemini 模型和
ML.GENERATE_TEXT
函数分析分块的 PDF 数据。 - 使用
ML.GENERATE_EMBEDDING
函数根据标准表中的图片数据生成嵌入。 - 使用
ObjectRef
值数组处理有序多模式数据。
费用
在本文档中,您将使用 Google Cloud的以下收费组件:
- BigQuery: you incur costs for the data that you process in BigQuery.
- BigQuery Python UDFs: you incur costs for using Python UDFs.
- Cloud Storage: you incur costs for the objects stored in Cloud Storage.
- Vertex AI: you incur costs for calls to Vertex AI models.
您可使用价格计算器根据您的预计使用情况来估算费用。
如需了解详情,请参阅以下价格页面:
准备工作
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery, BigQuery Connection, Cloud Storage, and Vertex AI APIs.
所需的角色
如需获得完成本教程所需的权限,请让您的管理员为您授予以下 IAM 角色:
-
创建连接:
BigQuery Connection Admin (
roles/bigquery.connectionAdmin
) -
向关联的服务账号授予权限:
Project IAM Admin (
roles/resourcemanager.projectIamAdmin
) -
创建 Cloud Storage 存储桶:
Storage Admin (
roles/storage.admin
) -
创建数据集、模型、UDF 和表,以及运行 BigQuery 作业:
BigQuery Admin (
roles/bigquery.admin
) -
创建可让您读取和修改 Cloud Storage 对象的网址:
BigQuery ObjectRef Admin (
roles/bigquery.objectRefAdmin
)
如需详细了解如何授予角色,请参阅管理对项目、文件夹和组织的访问权限。
设置
在本部分中,您将创建本教程中使用的数据集、连接、表和模型。
创建数据集
创建一个 BigQuery 数据集来包含您在本教程中创建的对象:
在 Google Cloud 控制台中,前往 BigQuery 页面。
在探索器窗格中,选择您的项目。
展开
操作选项,然后点击创建数据集。系统随即会打开创建数据集窗格。对于数据集 ID,请输入
cymbal_pets
。点击创建数据集。
创建存储桶
创建一个 Cloud Storage 存储桶来存储转换后的对象:
创建连接
创建 Cloud 资源连接并获取连接的服务账号。 BigQuery 使用该连接来访问 Cloud Storage 中的对象:
转到 BigQuery 页面。
在浏览器窗格中,点击
添加数据。系统随即会打开添加数据对话框。
在过滤条件窗格中的数据源类型部分,选择商务应用。
或者,在搜索数据源字段中,输入
Vertex AI
。在精选数据源部分中,点击 Vertex AI。
点击 Vertex AI 模型:BigQuery 联合解决方案卡片。
在连接类型列表中,选择 Vertex AI 远程模型、远程函数和 BigLake(Cloud 资源)。
在连接 ID 字段中,输入
cymbal_conn
。点击创建连接。
点击转到连接。
在连接信息窗格中,复制服务账号 ID 以在后续步骤中使用。
向连接的服务账号授予权限
向连接的服务账号授予适当的角色以访问其他服务。您必须在您在准备工作部分创建或选择的项目中授予这些角色。在其他项目中授予这些角色会导致错误 bqcx-1234567890-xxxx@gcp-sa-bigquery-condel.iam.gserviceaccount.com
does not have the permission to access resource
。
授予对 Cloud Storage 存储桶的权限
向服务账号授予使用您创建的存储桶中对象的权限:
转至存储桶页面。
点击您创建的存储桶的名称。
点击权限。
点击
授予访问权限。 系统随即会打开授予访问权限对话框。在新的主账号字段中,输入您之前复制的服务账号 ID。
在选择角色字段中,选择 Cloud Storage,然后选择 Storage Object User。
点击保存。
授予使用 Vertex AI 模型的权限
向服务账号授予使用 Vertex AI 模型的权限:
前往 IAM 和管理页面。
点击
授予访问权限。 系统随即会打开授予访问权限对话框。在新的主账号字段中,输入您之前复制的服务账号 ID。
在选择角色字段中,选择 Vertex AI,然后选择 Vertex AI User。
点击保存。
创建示例数据表
创建表来存储 Cymbal 宠物产品信息。
创建 products
表
创建一个包含 Cymbal 宠物商品信息的标准表:
在 Google Cloud 控制台中,前往 BigQuery 页面。
在查询编辑器中,运行以下查询以创建
products
表:LOAD DATA OVERWRITE cymbal_pets.products FROM FILES( format = 'avro', uris = [ 'gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/tables/products/products_*.avro']);
创建 product_images
表
创建一个包含 Cymbal 宠物商品图片的对象表:
在 BigQuery 页面的查询编辑器中,运行以下查询以创建
product_images
表:CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_images WITH CONNECTION `us.cymbal_conn` OPTIONS ( object_metadata = 'SIMPLE', uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png'], max_staleness = INTERVAL 30 MINUTE, metadata_cache_mode = AUTOMATIC);
创建 product_manuals
表
创建一个包含 Cymbal 宠物产品手册的对象表:
在 BigQuery 页面的查询编辑器中,运行以下查询以创建
product_manuals
表:CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_manuals WITH CONNECTION `us.cymbal_conn` OPTIONS ( object_metadata = 'SIMPLE', uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/documents/*.pdf']);
创建文本生成模型
创建表示 Vertex AI Gemini 模型的 BigQuery ML 远程模型:
在 BigQuery 页面的查询编辑器中,运行以下查询以创建远程模型:
CREATE OR REPLACE MODEL `cymbal_pets.gemini` REMOTE WITH CONNECTION `us.cymbal_conn` OPTIONS (ENDPOINT = 'gemini-2.0-flash');
创建嵌入生成模型
创建表示 Vertex AI 多模态嵌入模型的 BigQuery ML 远程模型:
在 BigQuery 页面的查询编辑器中,运行以下查询以创建远程模型:
CREATE OR REPLACE MODEL `cymbal_pets.embedding_model` REMOTE WITH CONNECTION `us.cymbal_conn` OPTIONS (ENDPOINT = 'multimodalembedding@001');
创建包含多模态数据的 products_mm
表
创建一个 products_mm
表,其中包含一个 image
列,该列使用 product_images
对象表中的商品图片进行填充。创建的 image
列是使用 ObjectRef
格式的 STRUCT
列。
在 BigQuery 页面的查询编辑器中,运行以下查询以创建
products_mm
表并填充image
列:CREATE OR REPLACE TABLE cymbal_pets.products_mm AS SELECT products.* EXCEPT (uri), ot.ref AS image FROM cymbal_pets.products INNER JOIN cymbal_pets.product_images ot ON ot.uri = products.uri;
在 BigQuery 页面的查询编辑器中,运行以下查询以查看
image
列数据:SELECT product_name, image FROM cymbal_pets.products_mm`
结果类似于以下内容:
+--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+ | product_name | image.uri | image.version | image.authorizer | image.details | +--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+ | AquaClear Aquarium Background | gs://cloud-samples-data/bigquery/ | 1234567891011 | myproject.region.myconnection | {"gcs_metadata":{"content_type":"image/png", | | | tutorials/cymbal-pets/images/ | | | "md5_hash":"494f63b9b137975ff3e7a11b060edb1d", | | | aquaclear-aquarium-background.png | | | "size":1282805,"updated":1742492680017000}} | +--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+ | AquaClear Aquarium | gs://cloud-samples-data/bigquery/ | 2345678910112 | myproject.region.myconnection | {"gcs_metadata":{"content_type":"image/png", | | Gravel Vacuum | tutorials/cymbal-pets/images/ | | | "md5_hash":"b7bfc2e2641a77a402a1937bcf0003fd", | | | aquaclear-aquarium-gravel-vacuum.png | | | "size":820254,"updated":1742492682411000}} | +--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+ | ... | ... | ... | | ... | +--------------------------------+--------------------------------------+-----------------------------------------------+------------------------------------------------+
使用 Gemini 模型生成商品信息
使用 Gemini 模型为宠物店商品生成以下数据:
- 向
products_mm
表添加image_description
列。 - 填充
products_mm
表的animal_type
、search_keywords
和subcategory
列。 - 运行一个查询,以返回每个商品品牌的说明以及该品牌的商品数量。品牌说明是通过分析相应品牌的所有商品(包括商品图片)的产品信息生成的。
在 BigQuery 页面的查询编辑器中,运行以下查询以创建并填充
image_description
列:CREATE OR REPLACE TABLE cymbal_pets.products_mm AS SELECT product_id, product_name, brand, category, subcategory, animal_type, search_keywords, price, description, inventory_level, supplier_id, average_rating, image, image_description FROM AI.GENERATE_TABLE( MODEL `cymbal_pets.gemini`, ( SELECT ('Can you describe the following image?', OBJ.GET_ACCESS_URL(image, 'r')) AS prompt, * FROM cymbal_pets.products_mm ), STRUCT('image_description STRING' AS output_schema));
在 BigQuery 页面的查询编辑器中,运行以下查询,以使用生成的数据更新
animal_type
、search_keywords
和subcategory
列:UPDATE cymbal_pets.products_mm p SET p.animal_type = s.animal_type, p.search_keywords = s.search_keywords, p.subcategory = s.subcategory FROM ( SELECT animal_type, search_keywords, subcategory, uri FROM AI.GENERATE_TABLE( MODEL `cymbal_pets.gemini`, ( SELECT ( 'For the image of a pet product, concisely generate the following metadata.' '1) animal_type and 2) 5 SEO search keywords, and 3) product subcategory', OBJ.GET_ACCESS_URL(image, 'r'), description) AS prompt, image.uri AS uri, FROM cymbal_pets.products_mm ), STRUCT( 'animal_type STRING, search_keywords ARRAY<STRING>, subcategory STRING' AS output_schema, 100 AS max_output_tokens)) ) s WHERE p.image.uri = s.uri;
在 BigQuery 页面的查询编辑器中,运行以下查询以查看生成的数据:
SELECT product_name, image_description, animal_type, search_keywords, subcategory, FROM cymbal_pets.products_mm;
结果类似于以下内容:
+--------------------------------+-------------------------------------+-------------+------------------------+------------------+ | product_name | image.description | animal_type | search_keywords | subcategory | +--------------------------------+-------------------------------------+-------------+------------------------+------------------+ | AquaClear Aquarium Background | The image shows a colorful coral | fish | aquarium background | aquarium decor | | | reef backdrop. The background is a | | fish tank backdrop | | | | blue ocean with a bright light... | | coral reef decor | | | | | | underwater scenery | | | | | | aquarium decoration | | +--------------------------------+-------------------------------------+-------------+------------------------+------------------+ | AquaClear Aquarium | The image shows a long, clear | fish | aquarium gravel vacuum | aquarium | | Gravel Vacuum | plastic tube with a green hose | | aquarium cleaning | cleaning | | | attached to one end. The tube... | | aquarium maintenance | | | | | | fish tank cleaning | | | | | | gravel siphon | | +--------------------------------+-------------------------------------+-------------+------------------------+------------------+ | ... | ... | ... | ... | ... | +--------------------------------+-------------------------------------+-------------+------------------------+------------------+
在 BigQuery 页面的查询编辑器中,运行以下查询以生成每个商品品牌的说明,以及该品牌的商品数量统计信息:
SELECT brand, brand_description, cnt FROM AI.GENERATE_TABLE( MODEL `cymbal_pets.gemini`, ( SELECT brand, COUNT(*) AS cnt, ( 'Use the images and text to give one concise brand description for a website brand page.' 'Return the description only.', ARRAY_AGG(OBJ.GET_ACCESS_URL(image, 'r')), ARRAY_AGG(description), ARRAY_AGG(category), ARRAY_AGG(subcategory)) AS prompt FROM cymbal_pets.products_mm GROUP BY brand ), STRUCT('brand_description STRING' AS output_schema)) ORDER BY cnt DESC;
结果类似于以下内容:
+--------------+-------------------------------------+-----+ | brand | brand.description | cnt | +--------------+-------------------------------------+-----+ | AquaClear | AquaClear is a brand of aquarium | 33 | | | and pond care products that offer | | | | a wide range of solutions for... | | +--------------+-------------------------------------+-----+ | Ocean | Ocean Bites is a brand of cat food | 28 | | Bites | that offers a variety of recipes | | | | and formulas to meet the specific.. | | +--------------+-------------------------------------+-----+ | ... | ... |... | +--------------+-------------------------------------+-----+
创建 Python UDF 以转换商品图片
创建一个 Python UDF 以将产品图片转换为灰度图片。
Python UDF 使用开源库,还使用并行执行来同时转换多个图片。
在 BigQuery 页面的查询编辑器中,运行以下查询以创建
to_grayscale
UDF:CREATE OR REPLACE FUNCTION cymbal_pets.to_grayscale(src_json STRING, dst_json STRING) RETURNS STRING LANGUAGE python WITH CONNECTION `us.cymbal_conn` OPTIONS (entry_point='to_grayscale', runtime_version='python-3.11', packages=['numpy', 'opencv-python']) AS """ import cv2 as cv import numpy as np from urllib.request import urlopen, Request import json # Transform the image to grayscale. def to_grayscale(src_ref, dst_ref): src_json = json.loads(src_ref) srcUrl = src_json["access_urls"]["read_url"] dst_json = json.loads(dst_ref) dstUrl = dst_json["access_urls"]["write_url"] req = urlopen(srcUrl) arr = np.asarray(bytearray(req.read()), dtype=np.uint8) img = cv.imdecode(arr, -1) # 'Load it as it is' # Convert the image to grayscale gray_image = cv.cvtColor(img, cv.COLOR_BGR2GRAY) # Send POST request to the URL _, img_encoded = cv.imencode('.png', gray_image) req = Request(url=dstUrl, data=img_encoded.tobytes(), method='PUT', headers = { "Content-Type": "image/png", }) with urlopen(req) as f: pass return dst_ref """;
美化商品图片
创建 products_grayscale
表,其中包含一个 ObjectRef
列,该列包含灰度图片的目标路径和授权者。目标路径派生自原始图片路径。
创建表格后,运行 to_grayscale
函数以创建灰度图片,将其写入 Cloud Storage 存储桶,然后返回包含灰度图片的访问网址和元数据的 ObjectRefRuntime
值。
在 BigQuery 页面的查询编辑器中,运行以下查询以创建
products_grayscale
表:CREATE OR REPLACE TABLE cymbal_pets.products_grayscale AS SELECT product_id, product_name, image, OBJ.MAKE_REF( CONCAT('gs://BUCKET/cymbal-pets-images/grayscale/', REGEXP_EXTRACT(image.uri, r'([^/]+)$')), 'us.cymbal_conn') AS gray_image FROM cymbal_pets.products_mm;
将
BUCKET
替换为您创建的存储桶的名称。在 BigQuery 页面的查询编辑器中,运行以下查询以创建灰度图片,将其写入 Cloud Storage 存储桶,然后返回包含灰度图片的访问网址和元数据的
ObjectRefRuntime
值:SELECT cymbal_pets.to_grayscale( TO_JSON_STRING(OBJ.GET_ACCESS_URL(image, 'r')), TO_JSON_STRING(OBJ.GET_ACCESS_URL(gray_image, 'rw'))) FROM cymbal_pets.products_grayscale;
结果类似于以下内容:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | f0 | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | {"access_urls":{"expiry_time":"2025-04-26T03:00:48Z", | | "read_url":"https://storage.googleapis.com/mybucket/cymbal-pets-images%2Fgrayscale%2Focean-bites-salmon-%26-tuna-cat-food.png?additional_read URL_information", | | "write_url":"https://storage.googleapis.com/myproject/cymbal-pets-images%2Fgrayscale%2Focean-bites-salmon-%26-tuna-cat-food.png?additional_write URL_information"}, | | "objectref":{"authorizer":"myproject.region.myconnection","uri":"gs://myproject/cymbal-pets-images/grayscale/ocean-bites-salmon-&-tuna-cat-food.png"}} | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | {"access_urls":{"expiry_time":"2025-04-26T03:00:48Z", | | "read_url":"https://storage.googleapis.com/mybucket/cymbal-pets-images%2Fgrayscale%2Ffluffy-buns-guinea-pig-tunnel.png?additional _read URL_information", | | "write_url":"https://storage.googleapis.com/myproject/cymbal-pets-images%2Fgrayscale%2Focean-bites-salmon-%26-tuna-cat-food.png?additional_write_URL_information"}, | | "objectref":{"authorizer":"myproject.region.myconnection","uri":"gs://myproject/cymbal-pets-images%2Fgrayscale%2Ffluffy-buns-guinea-pig-tunnel.png"}} | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ... | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
创建 Python UDF 以分块处理 PDF 数据
创建一个 Python UDF,将包含 Cymbal 宠物产品手册的 PDF 对象分块为多个部分。
PDF 文件通常非常大,可能无法在对生成式 AI 模型的单次调用中处理完毕。通过对 PDF 进行分块处理,您可以将 PDF 数据存储为可用于模型的格式,以便更轻松地进行分析。
在 BigQuery 页面的查询编辑器中,运行以下查询以创建
chunk_pdf
UDF:-- This function chunks the product manual PDF into multiple parts. -- The function accepts an ObjectRefRuntime value for the PDF file and the chunk size. -- It then parses the PDF, chunks the contents, and returns an array of chunked text. CREATE OR REPLACE FUNCTION cymbal_pets.chunk_pdf(src_json STRING, chunk_size INT64, overlap_size INT64) RETURNS ARRAY<STRING> LANGUAGE python WITH CONNECTION `us.cymbal_conn` OPTIONS (entry_point='chunk_pdf', runtime_version='python-3.11', packages=['pypdf']) AS """ import io import json from pypdf import PdfReader # type: ignore from urllib.request import urlopen, Request def chunk_pdf(src_ref: str, chunk_size: int, overlap_size: int) -> str: src_json = json.loads(src_ref) srcUrl = src_json["access_urls"]["read_url"] req = urlopen(srcUrl) pdf_file = io.BytesIO(bytearray(req.read())) reader = PdfReader(pdf_file, strict=False) # extract and chunk text simultaneously all_text_chunks = [] curr_chunk = "" for page in reader.pages: page_text = page.extract_text() if page_text: curr_chunk += page_text # split the accumulated text into chunks of a specific size with overlaop # this loop implements a sliding window approach to create chunks while len(curr_chunk) >= chunk_size: split_idx = curr_chunk.rfind(" ", 0, chunk_size) if split_idx == -1: split_idx = chunk_size actual_chunk = curr_chunk[:split_idx] all_text_chunks.append(actual_chunk) overlap = curr_chunk[split_idx + 1 : split_idx + 1 + overlap_size] curr_chunk = overlap + curr_chunk[split_idx + 1 + overlap_size :] if curr_chunk: all_text_chunks.append(curr_chunk) return all_text_chunks """;
分析 PDF 数据
运行 chunk_pdf
函数以将 product_manuals
表中的 PDF 数据分块,然后创建一个 product_manual_chunk_strings
表,其中每行包含一个 PDF 分块。对 product_manual_chunk_strings
数据使用 Gemini 模型,以汇总产品手册中找到的法律信息。
在 BigQuery 页面的查询编辑器中,运行以下查询以创建
product_manual_chunk_strings
表:CREATE OR REPLACE TABLE cymbal_pets.product_manual_chunk_strings AS SELECT chunked FROM cymbal_pets.product_manuals, UNNEST (cymbal_pets.chunk_pdf( TO_JSON_STRING( OBJ.GET_ACCESS_URL(OBJ.MAKE_REF(uri, 'us.cymbal_conn'), 'r')), 1000, 100 )) as chunked;
在 BigQuery 页面的查询编辑器中,运行以下查询,以使用 Gemini 模型分析 PDF 数据:
SELECT ml_generate_text_llm_result FROM ML.GENERATE_TEXT( MODEL `cymbal_pets.gemini`, ( SELECT ( 'Can you summarize the product manual as bullet points? Highlight the legal clauses', chunked) AS prompt, FROM cymbal_pets.product_manual_chunk_strings ), STRUCT( TRUE AS FLATTEN_JSON_OUTPUT));
结果类似于以下内容:
+-------------------------------------------------------------------------------------------------------------------------------------------+ | ml_generate_text_llm_result | +-------------------------------------------------------------------------------------------------------------------------------------------+ | ## CritterCuisine Pro 5000 Automatic Pet Feeder Manual Summary: | | | | **Safety:** | | | | * **Stability:** Place feeder on a level, stable surface to prevent tipping. | | * **Power Supply:** Only use the included AC adapter. Using an incompatible adapter can damage the unit and void the warranty. | | * **Cord Safety:** Keep the power cord out of reach of pets to prevent chewing or entanglement. | | * **Children:** Supervise children around the feeder. This is not a toy. | | * **Pet Health:** Consult your veterinarian before using an automatic feeder if your pet has special dietary needs, health conditions, or | +-------------------------------------------------------------------------------------------------------------------------------------------+ | ## Product Manual Summary: | | | | **6.3 Manual Feeding:** | | | | * Press MANUAL button to dispense a single portion (Meal 1 size). **(Meal Enabled)** | | | | **6.4 Recording a Voice Message:** | | | | * Press and hold VOICE button. | | * Speak clearly into the microphone (up to 10 seconds). | | * Release VOICE button to finish recording. | | * Briefly press VOICE button to play back the recording. | | * To disable the voice message, record a blank message (hold VOICE button for 10 seconds without speaking). **(Meal Enabled)** | | | | **6.5 Low Food Level Indicator:** | +-------------------------------------------------------------------------------------------------------------------------------------------+ | ... | +-------------------------------------------------------------------------------------------------------------------------------------------+
生成嵌入并执行向量搜索
从图片数据生成嵌入,然后使用向量搜索功能通过嵌入返回类似图片。
在生产环境中,我们建议您先创建向量索引,然后再运行向量搜索。矢量索引可让您更快地执行矢量搜索,同时需要降低召回率并返回更多近似结果。
在 BigQuery 页面的查询编辑器中,运行以下查询以创建
products_embeddings
表:CREATE OR REPLACE TABLE cymbal_pets.products_embedding AS SELECT product_id, ml_generate_embedding_result as embedding, content as image FROM ML.GENERATE_EMBEDDING( MODEL `cymbal_pets.embedding_model`, ( SELECT OBJ.GET_ACCESS_URL(image, 'r') as content, image, product_id FROM cymbal_pets.products_mm ), STRUCT () );
在 BigQuery 页面的查询编辑器中,运行以下查询以执行向量搜索,以返回与给定输入图片相似的商品图片:
SELECT * FROM VECTOR_SEARCH( TABLE cymbal_pets.products_embedding, 'embedding', (SELECT ml_generate_embedding_result as embedding FROM ML.GENERATE_EMBEDDING( MODEL `cymbal_pets.embedding_model`, (SELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF('gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/cozy-naps-cat-scratching-post-with-condo.png', 'us.cymbal_conn')) as content) )) );
结果类似于以下内容:
+-----------------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+ | query.embedding | base.product_id | base.embedding | base.image.uri | base.image.version | base.image.authorizer | base.image.details | distance | +-----------------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+ | -0.0112330541 | 181 | -0.0112330541 | gs://cloud-samples-data/bigquery/ | 12345678910 | myproject.region.myconnection | {"gcs_metadata":{"content_type": | 0.0 | | 0.0142525584 | | 0.0142525584 | tutorials/cymbal-pets/images/ | | | "image/png","md5_hash":"21234567hst16555w60j", | | | 0.0135886827 | | 0.0135886827 | cozy-naps-cat-scratching-post-with-condo.png | | | "size":828318,"updated":1742492688982000}} | | | 0.0149955815 | | 0.0149955815 | | | | | | | ... | | ... | | | | | | | | | | | | | | | | | | | | | | | | +-----------------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+ | -0.0112330541 | 187 | -0.0190353896 | gs://cloud-samples-data/bigquery/ | 23456789101 | myproject.region.myconnection | {"gcs_metadata":{"content_type": | 0.4216330832.. | | 0.0142525584 | | 0.0116206668 | tutorials/cymbal-pets/images/ | | | "image/png","md5_hash":"7328728fhakd9937djo4", | | | 0.0135886827 | | 0.0136198215 | cozy-naps-cat-scratching-post-with-bed.png | | | "size":860113,"updated":1742492688774000}} | | | 0.0149955815 | | 0.0173457414 | | | | | | | ... | | ... | | | | | | | | | | | | | | | | | | | | | | | | +---------C--------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+ | ... | ... | ... | ... | ... | ... | ... | ... | +-----------------+-----------------+----------------+----------------------------------------------+--------------------+-------------------------------+------------------------------------------------+----------------+
使用 ObjectRef
值数组处理有序多模式数据
本指南介绍如何完成以下任务:
- 重新创建
product_manuals
表,使其同时包含Crittercuisine 5000
产品手册的 PDF 文件,以及该手册的每个页面的 PDF 文件。 - 创建一个表,将手册映射到其分块。表示完整手册的
ObjectRef
值存储在STRUCT<uri STRING, version STRING, authorizer STRING, details JSON>>
列中。表示手册页的ObjectRef
值存储在ARRAY<STRUCT<uri STRING, version STRING, authorizer STRING, details JSON>>
列中。 - 一起分析一个
ObjectRef
值数组,以返回单个生成的值。 - 分别分析
ObjectRef
值数组,并为每个数组值返回生成的值。
在分析任务中,您需要将 ObjectRef
值数组转换为 ObjectRefRuntime
值的有序列表,然后将该列表传递给 Gemini 模型,并在问题中指定 ObjectRefRuntime
值。ObjectRefRuntime
值提供签名网址,模型会使用这些网址访问 Cloud Storage 中的对象信息。
如需使用 ObjectRef
值数组处理有序多模态数据,请按以下步骤操作:
转到 BigQuery 页面。
在查询编辑器中,运行以下查询以重新创建
product_manuals
表:CREATE OR REPLACE EXTERNAL TABLE `cymbal_pets.product_manuals` WITH CONNECTION `us.cymbal_conn` OPTIONS ( object_metadata = 'SIMPLE', uris = [ 'gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/documents/*.pdf', 'gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/document_chunks/*.pdf']);
在查询编辑器中,运行以下查询以将 PDF 数据写入
map_manual_to_chunks
表:-- Extract the file and chunks into a single table. -- Store the chunks in the chunks column as array of ObjectRefs (ordered by page number) CREATE OR REPLACE TABLE cymbal_pets.map_manual_to_chunks AS SELECT ARRAY_AGG(m1.ref)[0] manual, ARRAY_AGG(m2.ref ORDER BY m2.ref.uri) chunks FROM cymbal_pets.product_manuals m1 JOIN cymbal_pets.product_manuals m2 ON REGEXP_EXTRACT(m1.uri, r'.*/([^.]*).[^/]+') = REGEXP_EXTRACT(m2.uri, r'.*/([^.]*)_page[0-9]+.[^/]+') GROUP BY m1.uri;
在查询编辑器中,运行以下查询以查看
map_manual_to_chunks
表中的 PDF 数据:SELECT * FROM cymbal_pets.map_manual_to_chunks;
结果类似于以下内容:
+-------------------------------------+--------------------------------+-----------------------------------+------------------------------------------------------+-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+ | manual.uri | manual.version | manual.authorizer | manual.details | chunks.uri | chunks.version | chunks.authorizer | chunks.details | +-------------------------------------+--------------------------------+-----------------------------------+------------------------------------------------------+-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+ | gs://cloud-samples-data/bigquery/ | 1742492785900455 | myproject.region.myconnection | {"gcs_metadata":{"content_type":"application/pef", | gs://cloud-samples-data/bigquery/ | 1745875761227129 | myproject.region.myconnection | {"gcs_metadata":{"content_type":"application/pdf", | | tutorials/cymbal-pets/documents/ | | | "md5_hash":"c9032b037693d15a33210d638c763d0e", | tutorials/cymbal-pets/documents/ | | | "md5_hash":"5a1116cce4978ec1b094d8e8b49a1d7c", | | crittercuisine_5000_user_manual.pdf | | | "size":566105,"updated":1742492785941000}} | crittercuisine_5000_user_manual_page1.pdf | | | "size":504583,"updated":1745875761266000}} | | | | | +-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+ | | | | | crittercuisine_5000_user_manual_page1.pdf | 1745875760613874 | myproject.region.myconnection | {"gcs_metadata":{"content_type":"application/pdf", | | | | | | tutorials/cymbal-pets/documents/ | | | "md5_hash":"94d03ec65d28b173bc87eac7e587b325", | | | | | | crittercuisine_5000_user_manual_page2.pdf | | | "size":94622,"updated":1745875760649000}} | | | | | +-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+ | | | | | ... | ... | ... | ... | +-------------------------------------+--------------------------------+-----------------------------------+------------------------------------------------------+-------------------------------------------+---------------------------------+------------------------------------+-------------------------------------------------------+
在查询编辑器中,运行以下查询,根据对
ObjectRef
值数组的分析,从 Gemini 模型生成单个响应:WITH manuals AS ( SELECT OBJ.GET_ACCESS_URL(manual, 'r') AS manual, ARRAY( SELECT OBJ.GET_ACCESS_URL(chunk, 'r') AS chunk FROM UNNEST(m1.chunks) AS chunk WITH OFFSET AS idx ORDER BY idx ) AS chunks FROM cymbal_pets.map_manual_to_chunks AS m1 ) SELECT ml_generate_text_llm_result AS Response FROM ML.GENERATE_TEXT( MODEL `cymbal_pets.gemini`, ( SELECT ( 'Can you provide a page by page summary for the first 3 pages of the attached manual? Only write one line for each page. The pages are provided in serial order', manuals.chunks) AS prompt, FROM manuals ), STRUCT(TRUE AS FLATTEN_JSON_OUTPUT));
结果类似于以下内容:
+-------------------------------------------+ | Response | +-------------------------------------------+ | Page 1: This manual is for the | | CritterCuisine Pro 5000 automatic | | pet feeder. | | Page 2: The manual covers safety | | precautions, what's included, | | and product overview. | | Page 3: The manual covers assembly, | | initial setup, and programming the clock. | +-------------------------------------------+
在查询编辑器中,运行以下查询,根据对
ObjectRef
值数组的分析,从 Gemini 模型生成多个回答:WITH input_chunked_objrefs AS ( SELECT row_id, offset, chunk_ref FROM ( SELECT ROW_NUMBER() OVER () AS row_id, * FROM `cymbal_pets.map_manual_to_chunks` ) AS indexed_table LEFT JOIN UNNEST(indexed_table.chunks) AS chunk_ref WITH OFFSET ), get_access_urls AS ( SELECT row_id, offset, chunk_ref, OBJ.GET_ACCESS_URL(chunk_ref, 'r') AS ObjectRefRuntime FROM input_chunked_objrefs ), valid_get_access_urls AS ( SELECT * FROM get_access_urls WHERE ObjectRefRuntime['runtime_errors'] IS NULL ), ordered_output_objrefruntime_array AS ( SELECT ARRAY_AGG(ObjectRefRuntime ORDER BY offset) AS ObjectRefRuntimeArray FROM valid_get_access_urls GROUP BY row_id ) SELECT page1_summary, page2_summary, page3_summary FROM AI.GENERATE_TABLE( MODEL `cymbal_pets.gemini`, ( SELECT ( 'Can you provide a page by page summary for the first 3 pages of the attached manual? Only write one line for each page. The pages are provided in serial order', ObjectRefRuntimeArray) AS prompt, FROM ordered_output_objrefruntime_array ), STRUCT( 'page1_summary STRING, page2_summary STRING, page3_summary STRING' AS output_schema));
结果类似于以下内容:
+-----------------------------------------------+-------------------------------------------+----------------------------------------------------+ | page1_summary | page2_summary | page3_summary | +-----------------------------------------------+-------------------------------------------+----------------------------------------------------+ | This manual provides an overview of the | This section explains how to program | This page covers connecting the feeder to Wi-Fi | | CritterCuisine Pro 5000 automatic pet feeder, | the feeder's clock, set feeding | using the CritterCuisine Connect app, remote | | including its features, safety precautions, | schedules, copy and delete meal settings, | feeding, managing feeding schedules, viewing | | assembly instructions, and initial setup. | manually feed your pet, record | feeding logs, receiving low food alerts, | | | a voice message, and understand | updating firmware, creating multiple pet profiles, | | | the low food level indicator. | sharing access with other users, and cleaning | | | | and maintaining the feeder. | +-----------------------------------------------+-------------------------------------------+----------------------------------------------------+
清理
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.