在 BigQuery Studio 笔记本中运行 PySpark 代码

本文档介绍了如何在 BigQuery Python 笔记本中运行 PySpark 代码。

准备工作

如果您尚未创建项目和 Cloud Storage 存储桶，请先创建这些资源。 Google Cloud

设置项目
1. 如果您没有可用的 Cloud Storage 存储桶，请在项目中创建 Cloud Storage 存储桶。
2. 设置笔记本
  - 笔记本凭证：默认情况下，笔记本会话会使用您的用户凭证。或者，它也可以使用会话服务账号凭证。
    - 用户凭证：您的用户账号必须具有以下 Identity and Access Management 角色：
      - Dataproc Editor（roles/dataproc.editor 角色）
      - BigQuery Studio User（roles/bigquery.studioUser 角色）
      - 会话服务账号的 Service Account User (roles/iam.serviceAccountUser) 角色。此角色包含模拟服务账号所需的 iam.serviceAccounts.actAs 权限。
    - 服务账号凭证：如果您想为笔记本会话指定服务账号凭证，而不是用户凭证，则会话服务账号必须具有以下角色：
      - Dataproc Worker（roles/dataproc.worker 角色）
  - 笔记本运行时：除非您选择其他运行时，否则笔记本会使用默认的 Vertex AI 运行时。如果您想定义自己的运行时，请在 Google Cloud 控制台的运行时页面中创建运行时。请注意，使用 NumPy 库时，在笔记本运行时中使用 NumPy 1.26 版，该版本受 Spark 3.5 支持。