Para ter as permissões necessárias para usar o Spark e o Dataproc com o metastore do BigLake como um repositório de metadados,
peça ao administrador para conceder a você os seguintes papéis do IAM:
Crie tabelas do metastore do BigLake no Spark:
Worker do Dataproc (roles/dataproc.worker)
na conta de serviço do Dataproc sem servidor no projeto
READONLY_ICEBERG_TABLE_NAME: um nome para sua tabela somente leitura.
BUCKET_PATH: o caminho para o bucket do Cloud Storage que contém os dados da tabela externa, no formato ['gs://bucket_name/[folder_name/]file_name'].
No PySpark, consulte a tabela padrão, a tabela BigLake para Apache Iceberg no BigQuery e a tabela externa do Apache Iceberg.
frompyspark.sqlimportSparkSession# Create a spark sessionspark=SparkSession.builder\.appName("BigLake Metastore Iceberg")\.config("spark.sql.catalog.CATALOG_NAME","org.apache.iceberg.spark.SparkCatalog")\.config("spark.sql.catalog.CATALOG_NAME.catalog-impl","org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog")\.config("spark.sql.catalog.CATALOG_NAME.gcp_project","PROJECT_ID")\.config("spark.sql.catalog.CATALOG_NAME.gcp_location","LOCATION")\.config("spark.sql.catalog.CATALOG_NAME.warehouse","WAREHOUSE_DIRECTORY")\.getOrCreate()spark.conf.set("viewsEnabled","true")# Use the blms_catalogspark.sql("USE `CATALOG_NAME`;")spark.sql("USE NAMESPACE DATASET_NAME;")# Configure spark for temp resultsspark.sql("CREATE namespace if not exists MATERIALIZATION_NAMESPACE");spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")# List the tables in the datasetdf=spark.sql("SHOW TABLES;")df.show();# Query the tablessql="""SELECT * FROM DATASET_NAME.TABLE_NAME"""df=spark.read.format("bigquery").load(sql)df.show()sql="""SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""df=spark.read.format("bigquery").load(sql)df.show()sql="""SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""df=spark.read.format("bigquery").load(sql)df.show()
Substitua:
WAREHOUSE_DIRECTORY: o URI da pasta do Cloud Storage conectada à tabela do BigLake Iceberg no BigQuery e à tabela externa do Iceberg.
CATALOG_NAME: o nome do catálogo que você está
usando.
MATERIALIZATION_NAMESPACE: o namespace para armazenar resultados temporários.
Execute o script PySpark usando o Spark sem servidor.
SCRIPT_PATH: o caminho para o script
usado pelo job em lote.
PROJECT_ID: o ID do Google Cloud projeto
em que o job em lote será executado.
REGION: a região em que sua carga de trabalho é executada.
YOUR_BUCKET: o local do bucket do Cloud Storage para fazer upload das dependências da carga de trabalho.
O prefixo de URI gs:// do bucket não é necessário. É possível especificar o caminho ou o nome do bucket, por exemplo, mybucketname1.
[[["Fácil de entender","easyToUnderstand","thumb-up"],["Meu problema foi resolvido","solvedMyProblem","thumb-up"],["Outro","otherUp","thumb-up"]],[["Difícil de entender","hardToUnderstand","thumb-down"],["Informações incorretas ou exemplo de código","incorrectInformationOrSampleCode","thumb-down"],["Não contém as informações/amostras de que eu preciso","missingTheInformationSamplesINeed","thumb-down"],["Problema na tradução","translationIssue","thumb-down"],["Outro","otherDown","thumb-down"]],["Última atualização 2025-08-17 UTC."],[],[],null,["# Use BigLake metastore with tables in BigQuery\n=============================================\n\nThis document explains how to use BigLake metastore with BigQuery\ntables and Spark.\n\nWith BigLake metastore, you can create and use\n[standard (built-in) tables](/bigquery/docs/tables),\n[BigLake tables for Apache Iceberg in BigQuery](/bigquery/docs/iceberg-tables),\nand [Apache Iceberg external tables](/bigquery/docs/iceberg-external-tables)\nfrom BigQuery.\n\nBefore you begin\n----------------\n\n1. Enable billing for your Google Cloud project. Learn how to [check if billing is enabled on a project](/billing/docs/how-to/verify-billing-enabled).\n2. Enable the BigQuery, and Dataproc APIs.\n\n [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,dataproc.googleapis.com)\n3. Optional: Understand how [BigLake metastore works](/bigquery/docs/about-blms) and why you should\n use it.\n\n### Required roles\n\n\nTo get the permissions that\nyou need to use Spark and Dataproc with BigLake metastore as a metadata store,\n\nask your administrator to grant you the\nfollowing IAM roles:\n\n- Create BigLake metastore tables in Spark:\n - [Dataproc Worker](/iam/docs/roles-permissions/dataproc#dataproc.worker) (`roles/dataproc.worker`) on the Dataproc Serverless service account in the project\n - [BigQuery Data Editor](/iam/docs/roles-permissions/bigquery#bigquery.dataEditor) (`roles/bigquery.dataEditor`) on the Dataproc Serverless service account in the project\n - [Storage Object Admin](/iam/docs/roles-permissions/storage#storage.objectAdmin) (`roles/storage.objectAdmin`) on the Dataproc Serverless service account in the project\n- Query BigLake metastore tables in BigQuery:\n - [BigQuery Data Viewer](/iam/docs/roles-permissions/bigquery#bigquery.dataViewer) (`roles/bigquery.dataViewer`) on the project\n - [BigQuery User](/iam/docs/roles-permissions/bigquery#bigquery.user) (`roles/bigquery.user`) on the project\n - [Storage Object Viewer](/iam/docs/roles-permissions/storage#storage.objectViewer) (`roles/storage.objectViewer`) on the project\n\n\nFor more information about granting roles, see [Manage access to projects, folders, and organizations](/iam/docs/granting-changing-revoking-access).\n\n\nYou might also be able to get\nthe required permissions through [custom\nroles](/iam/docs/creating-custom-roles) or other [predefined\nroles](/iam/docs/roles-overview#predefined).\n\nConnect to a table\n------------------\n\n1. Create a [dataset](/bigquery/docs/datasets) in the Google Cloud console.\n\n ```googlesql\n CREATE SCHEMA `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME;\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the ID of the Google Cloud project to create the dataset.\n - \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e: a name for your dataset.\n2. Create a [Cloud Resource Connection](/bigquery/docs/create-cloud-resource-connection).\n\n3. Create a standard BigQuery table.\n\n ```googlesql\n CREATE TABLE `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.TABLE_NAME (name STRING,id INT64);\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eTABLE_NAME\u003c/var\u003e: a name for your table.\n4. Insert data into the standard BigQuery table.\n\n ```googlesql\n INSERT INTO `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.TABLE_NAME VALUES ('test_name1', 123),('test_name2', 456),('test_name3', 789);\n ```\n5. Create a\n [BigLake table for Apache Iceberg in BigQuery](/bigquery/docs/iceberg-tables#iceberg-table-workflows).\n\n For example, to create a table, run the following `CREATE` statement. \n\n ```googlesql\n CREATE TABLE `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.ICEBERG_TABLE_NAME(\n name STRING,id INT64\n )\n WITH CONNECTION `\u003cvar translate=\"no\"\u003eCONNECTION_NAME\u003c/var\u003e`\n OPTIONS (\n file_format = 'PARQUET',\n table_format = 'ICEBERG',\n storage_uri = '\u003cvar translate=\"no\"\u003eSTORAGE_URI\u003c/var\u003e');\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eICEBERG_TABLE_NAME\u003c/var\u003e: a name for your BigLake table for Apache Iceberg in BigQuery. For example, `iceberg_managed_table`.\n - \u003cvar translate=\"no\"\u003eCONNECTION_NAME\u003c/var\u003e: the name of your connection. You created this in the previous step. For example, `myproject.us.myconnection`.\n - \u003cvar translate=\"no\"\u003eSTORAGE_URI\u003c/var\u003e: a fully qualified Cloud Storage URI. For example, `gs://mybucket/table`.\n6. Insert data into the BigLake table for Apache Iceberg in BigQuery.\n\n ```googlesql\n INSERT INTO `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.ICEBERG_TABLE_NAME VALUES ('test_name1', 123),('test_name2', 456),('test_name3', 789);\n ```\n7. Create an [Apache Iceberg external table](/bigquery/docs/iceberg-external-tables).\n\n For example, to create an Iceberg external table,\n run the following `CREATE` statement. \n\n ```googlesql\n CREATE OR REPLACE EXTERNAL TABLE `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.READONLY_ICEBERG_TABLE_NAME\n WITH CONNECTION `\u003cvar translate=\"no\"\u003eCONNECTION_NAME\u003c/var\u003e`\n OPTIONS (\n format = 'ICEBERG',\n uris =\n ['\u003cvar translate=\"no\"\u003eBUCKET_PATH\u003c/var\u003e'],\n require_partition_filter = FALSE);\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eREADONLY_ICEBERG_TABLE_NAME\u003c/var\u003e: a name for your read-only table.\n - \u003cvar translate=\"no\"\u003eBUCKET_PATH\u003c/var\u003e: the path to the Cloud Storage bucket that contains the data for the external table, in the format `['gs://bucket_name/[folder_name/]file_name']`.\n8. From PySpark, query the standard table, BigLake table for Apache Iceberg in BigQuery, and\n Apache Iceberg external table.\n\n ```googlesql\n from pyspark.sql import SparkSession\n\n # Create a spark session\n spark = SparkSession.builder \\\n .appName(\"BigLake Metastore Iceberg\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e\", \"org.apache.iceberg.spark.SparkCatalog\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.catalog-impl\", \"org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.gcp_project\", \"\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.gcp_location\", \"\u003cvar translate=\"no\"\u003eLOCATION\u003c/var\u003e\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.warehouse\", \"\u003cvar translate=\"no\"\u003eWAREHOUSE_DIRECTORY\u003c/var\u003e\") \\\n .getOrCreate()\n spark.conf.set(\"viewsEnabled\",\"true\")\n\n # Use the blms_catalog\n spark.sql(\"USE `\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e`;\")\n spark.sql(\"USE NAMESPACE \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e;\")\n\n # Configure spark for temp results\n spark.sql(\"CREATE namespace if not exists \u003cvar translate=\"no\"\u003eMATERIALIZATION_NAMESPACE\u003c/var\u003e\");\n spark.conf.set(\"materializationDataset\",\"\u003cvar translate=\"no\"\u003eMATERIALIZATION_NAMESPACE\u003c/var\u003e\")\n\n # List the tables in the dataset\n df = spark.sql(\"SHOW TABLES;\")\n df.show();\n\n # Query the tables\n sql = \"\"\"SELECT * FROM \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e.\u003cvar translate=\"no\"\u003eTABLE_NAME\u003c/var\u003e\"\"\"\n df = spark.read.format(\"bigquery\").load(sql)\n df.show()\n\n sql = \"\"\"SELECT * FROM \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e.\u003cvar translate=\"no\"\u003eICEBERG_TABLE_NAME\u003c/var\u003e\"\"\"\n df = spark.read.format(\"bigquery\").load(sql)\n df.show()\n\n sql = \"\"\"SELECT * FROM \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e.\u003cvar translate=\"no\"\u003eREADONLY_ICEBERG_TABLE_NAME\u003c/var\u003e\"\"\"\n df = spark.read.format(\"bigquery\").load(sql)\n df.show()\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eWAREHOUSE_DIRECTORY\u003c/var\u003e: the URI of the Cloud Storage folder that's connected to your BigLake Iceberg table in BigQuery and your Iceberg external table.\n - \u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e: the name of the catalog that you're using.\n - \u003cvar translate=\"no\"\u003eMATERIALIZATION_NAMESPACE\u003c/var\u003e: the namespace for storing temp results.\n9. Run the PySpark script using Serverless Spark.\n\n ```bash\n gcloud dataproc batches submit pyspark SCRIPT_PATH \\\n --version=2.2 \\\n --project=PROJECT_ID \\\n --region=REGION \\\n --deps-bucket=YOUR_BUCKET \\\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eSCRIPT_PATH\u003c/var\u003e: the path to the script that the batch job uses.\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the ID of the Google Cloud project to run the batch job in.\n - \u003cvar translate=\"no\"\u003eREGION\u003c/var\u003e: the region where your workload runs.\n - \u003cvar translate=\"no\"\u003eYOUR_BUCKET\u003c/var\u003e: the location of the Cloud Storage bucket to upload workload dependencies. The `gs://` URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example, `mybucketname1`.\n\nWhat's next\n-----------\n\n- Set up [optional BigLake metastore features](/bigquery/docs/blms-features)."]]