此页面由 Cloud Translation API 翻译。

使用 BigQuery DataFrames 在 Python 中分析多模态数据

本教程将展示如何使用 BigQuery DataFrames 类和方法在 Python 笔记本中分析多模态数据。

本教程使用公开的 Cymbal 宠物店数据集中的产品目录。

如需上传已填充本教程中涵盖的任务的笔记本，请参阅 BigFrames 多模态 DataFrame。

目标

创建多模态 DataFrame。
在 DataFrame 中合并结构化数据和非结构化数据。
转换图片。
根据图片数据生成文本和嵌入。
将 PDF 分块以供进一步分析。

费用

在本文档中，您将使用 Google Cloud的以下收费组件：

BigQuery: you incur costs for the data that you process in BigQuery.
BigQuery Python UDFs: you incur costs for using BigQuery DataFrames image transformation and chunk PDF methods.
Cloud Storage: you incur costs for the objects stored in Cloud Storage.
Vertex AI: you incur costs for calls to Vertex AI models.

您可使用价格计算器根据您的预计使用情况来估算费用。

新 Google Cloud 用户可能有资格申请免费试用。

如需了解详情，请参阅以下价格页面：

准备工作

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery, BigQuery Connection, Cloud Storage, and Vertex AI APIs.
Enable the APIs

所需的角色

如需获得完成本教程所需的权限，请让您的管理员为您授予以下 IAM 角色：

创建连接： BigQuery Connection Admin (roles/bigquery.connectionAdmin)
为连接的服务账号授予权限： Project IAM Admin (roles/resourcemanager.projectIamAdmin)
创建 Cloud Storage 存储桶： Storage Admin (roles/storage.admin)
运行 BigQuery 作业： BigQuery User (roles/bigquery.user)
创建和调用 Python UDF： BigQuery Data Editor (roles/bigquery.dataEditor)
创建用于读取和修改 Cloud Storage 对象的网址： BigQuery ObjectRef Admin (roles/bigquery.objectRefAdmin)
使用笔记本：
- BigQuery Read Session User (roles/bigquery.readSessionUser)
- Notebook Runtime User (roles/aiplatform.notebookRuntimeUser)
- Notebook Runtime User (roles/aiplatform.notebookRuntimeUser)
- Code Creator (roles/dataform.codeCreator)

如需详细了解如何授予角色，请参阅管理对项目、文件夹和组织的访问权限。

您也可以通过自定义角色或其他预定义角色来获取所需的权限。

设置

在本部分中，您将创建本教程中使用的 Cloud Storage 存储桶、连接和笔记本。

创建存储桶

创建一个 Cloud Storage 存储桶来存储转换后的对象：

在 Google Cloud 控制台中，前往存储桶页面。

进入“存储桶”
点击创建。
在创建存储桶页面上的开始使用部分中，输入符合存储桶名称要求的全局唯一的名称。
点击创建。

创建连接

创建 Cloud 资源连接并获取连接的服务账号。 BigQuery 使用连接资源访问 Cloud Storage 中的对象。

转到 BigQuery 页面。

转到 BigQuery
在浏览器窗格中，点击 添加数据。

系统随即会打开添加数据对话框。
在过滤条件窗格中的数据源类型部分，选择企业应用。

或者，在搜索数据源字段中，您可以输入 Vertex AI。
在精选数据源部分中，点击 Vertex AI。
点击 Vertex AI 模型：BigQuery 联合解决方案卡片。
在连接类型列表中，选择 Vertex AI 远程模型、远程函数和 BigLake（Cloud 资源）。
在连接 ID 字段中，输入 bigframes-default-connection。
点击创建连接。
点击转到连接。
在连接信息窗格中，复制服务账号 ID 以在后续步骤中使用。

向连接的服务账号授予权限

向连接的服务账号授予访问 Cloud Storage 和 Vertex AI 所需的角色。您必须在您在准备工作部分创建或选择的项目中授予这些角色。

如需授予该角色，请按以下步骤操作：

前往 IAM 和管理页面。

转到“IAM 和管理”
点击 授予访问权限。
在新的主账号字段中，输入您之前复制的服务账号 ID。
在选择角色字段中，选择 Cloud Storage，然后选择 Storage Object User。
点击添加其他角色。
在选择角色字段中，选择 Vertex AI，然后选择 Vertex AI User。
点击保存。

创建笔记本

创建一个可运行 Python 代码的笔记本：

转到 BigQuery 页面。

转到 BigQuery
在编辑器窗格的标签页栏中，点击 SQL 查询旁边的下拉箭头，然后点击笔记本。
在从模板开始窗格中，点击关闭。
依次点击连接> 连接到运行时。
如果您有现有的运行时，请接受默认设置，然后点击连接。如果您没有现有的运行时，请选择创建新运行时，然后点击连接。

运行时可能需要几分钟的时间才能设置完毕。

创建多模态 DataFrame

使用 Session 类的 from_glob_path 方法创建一个集成结构化数据和非结构化数据的多模态 DataFrame：

在笔记本中，创建一个代码单元，并将以下代码复制到其中：

import bigframes

# Flags to control preview image/video preview size
bigframes.options.display.blob_display_width = 300

import bigframes.pandas as bpd

# Create blob columns from wildcard path.
df_image = bpd.from_glob_path(
    "gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*", name="image"
)
# Other ways are: from string uri column
# df = bpd.DataFrame({"uri": ["gs://<my_bucket>/<my_file_0>", "gs://<my_bucket>/<my_file_1>"]})
# df["blob_col"] = df["uri"].str.to_blob()

# From an existing object table
# df = bpd.read_gbq_object_table("<my_object_table>", name="blob_col")

# Take only the 5 images to deal with. Preview the content of the Mutimodal DataFrame
df_image = df_image.head(5)
df_image

点击运行。

对 df_image 的最终调用会返回已添加到 DataFrame 中的图片。或者，您也可以调用 .display 方法。

在 DataFrame 中合并结构化数据和非结构化数据

在多模态 DataFrame 中合并文本和图片数据：

在笔记本中，创建一个代码单元，并将以下代码复制到其中：

# Combine unstructured data with structured data
df_image["author"] = ["alice", "bob", "bob", "alice", "bob"]  # type: ignore
df_image["content_type"] = df_image["image"].blob.content_type()
df_image["size"] = df_image["image"].blob.size()
df_image["updated"] = df_image["image"].blob.updated()
df_image

点击运行。

该代码会返回 DataFrame 数据。

在笔记本中，创建一个代码单元，并将以下代码复制到其中：

# Filter images and display, you can also display audio and video types. Use width/height parameters to constrain window sizes.
df_image[df_image["author"] == "alice"]["image"].blob.display()

点击运行。

此代码会返回 DataFrame 中 author 列值为 alice 的图片。

执行图片转换

使用 Series.BlobAccessor 类的以下方法转换图片数据：

转换后的图片会写入 Cloud Storage。

转换图片：

在笔记本中，创建一个代码单元，并将以下代码复制到其中：

df_image["blurred"] = df_image["image"].blob.image_blur(
    (20, 20), dst=f"{dst_bucket}/image_blur_transformed/", engine="opencv"
)
df_image["resized"] = df_image["image"].blob.image_resize(
    (300, 200), dst=f"{dst_bucket}/image_resize_transformed/", engine="opencv"
)
df_image["normalized"] = df_image["image"].blob.image_normalize(
    alpha=50.0,
    beta=150.0,
    norm_type="minmax",
    dst=f"{dst_bucket}/image_normalize_transformed/",
    engine="opencv",
)

# You can also chain functions together
df_image["blur_resized"] = df_image["blurred"].blob.image_resize(
    (300, 200), dst=f"{dst_bucket}/image_blur_resize_transformed/", engine="opencv"
)
df_image

更新对 {dst_bucket} 的所有引用，以引用您创建的存储桶，格式为 gs://mybucket。
点击运行。

该代码会返回原始图片及其所有转换。

生成文本

使用 GeminiTextGenerator 类的 predict 方法根据多模态数据生成文本：

在笔记本中，创建一个代码单元，并将以下代码复制到其中：

from bigframes.ml import llm

gemini = llm.GeminiTextGenerator(model_name="gemini-2.0-flash-001")

# Deal with first 2 images as example
df_image = df_image.head(2)

# Ask the same question on the images
df_image = df_image.head(2)
answer = gemini.predict(df_image, prompt=["what item is it?", df_image["image"]])
answer[["ml_generate_text_llm_result", "image"]]

点击运行。

该代码会返回 df_image 中的前两张图片，以及针对这两张图片的问题 what item is it? 生成的文本。

在笔记本中，创建一个代码单元，并将以下代码复制到其中：

# Ask different questions
df_image["question"] = [  # type: ignore
    "what item is it?",
    "what color is the picture?",
]
answer_alt = gemini.predict(
    df_image, prompt=[df_image["question"], df_image["image"]]
)
answer_alt[["ml_generate_text_llm_result", "image"]]

点击运行。

该代码会返回 df_image 中的前两张图片，其中第一张图片包含针对问题 what item is it? 生成的文字，第二张图片包含针对问题 what color is the picture? 生成的文字。

生成嵌入

使用 MultimodalEmbeddingGenerator 类的 predict 方法为多模态数据生成嵌入：

在笔记本中，创建一个代码单元，并将以下代码复制到其中：

# Generate embeddings on images
embed_model = llm.MultimodalEmbeddingGenerator()
embeddings = embed_model.predict(df_image["image"])
embeddings

点击运行。

该代码返回通过调用嵌入模型生成的嵌入。

分割 PDF

使用 Series.BlobAccessor 类的 pdf_chunk 方法对 PDF 对象进行分块：

在笔记本中，创建一个代码单元，并将以下代码复制到其中：

# PDF chunking
df_pdf = bpd.from_glob_path(
    "gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/documents/*", name="pdf"
)
df_pdf["chunked"] = df_pdf["pdf"].blob.pdf_chunk(engine="pypdf")
chunked = df_pdf["chunked"].explode()
chunked

点击运行。

该代码会返回分块的 PDF 数据。

清理

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.