frompyspark.sqlimportSparkSession# Create a spark sessionspark=SparkSession.builder\.appName("BigLake Metastore Iceberg")\.config("spark.sql.catalog.CATALOG_NAME","org.apache.iceberg.spark.SparkCatalog")\.config("spark.sql.catalog.CATALOG_NAME.catalog-impl","org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog")\.config("spark.sql.catalog.CATALOG_NAME.gcp_project","PROJECT_ID")\.config("spark.sql.catalog.CATALOG_NAME.gcp_location","LOCATION")\.config("spark.sql.catalog.CATALOG_NAME.warehouse","WAREHOUSE_DIRECTORY")\.getOrCreate()spark.conf.set("viewsEnabled","true")# Use the blms_catalogspark.sql("USE `CATALOG_NAME`;")spark.sql("USE NAMESPACE DATASET_NAME;")# Configure spark for temp resultsspark.sql("CREATE namespace if not exists MATERIALIZATION_NAMESPACE");spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")# List the tables in the datasetdf=spark.sql("SHOW TABLES;")df.show();# Query the tablessql="""SELECT * FROM DATASET_NAME.TABLE_NAME"""df=spark.read.format("bigquery").load(sql)df.show()sql="""SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""df=spark.read.format("bigquery").load(sql)df.show()sql="""SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""df=spark.read.format("bigquery").load(sql)df.show()
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-28。"],[],[],null,["# Use BigLake metastore with tables in BigQuery\n=============================================\n\nThis document explains how to use BigLake metastore with BigQuery\ntables and Spark.\n\nWith BigLake metastore, you can create and use\n[standard (built-in) tables](/bigquery/docs/tables),\n[BigLake tables for Apache Iceberg in BigQuery](/bigquery/docs/iceberg-tables),\nand [Apache Iceberg external tables](/bigquery/docs/iceberg-external-tables)\nfrom BigQuery.\n\nBefore you begin\n----------------\n\n1. Enable billing for your Google Cloud project. Learn how to [check if billing is enabled on a project](/billing/docs/how-to/verify-billing-enabled).\n2. Enable the BigQuery, and Dataproc APIs.\n\n [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,dataproc.googleapis.com)\n3. Optional: Understand how [BigLake metastore works](/bigquery/docs/about-blms) and why you should\n use it.\n\n### Required roles\n\n\nTo get the permissions that\nyou need to use Spark and Dataproc with BigLake metastore as a metadata store,\n\nask your administrator to grant you the\nfollowing IAM roles:\n\n- Create BigLake metastore tables in Spark:\n - [Dataproc Worker](/iam/docs/roles-permissions/dataproc#dataproc.worker) (`roles/dataproc.worker`) on the Dataproc Serverless service account in the project\n - [BigQuery Data Editor](/iam/docs/roles-permissions/bigquery#bigquery.dataEditor) (`roles/bigquery.dataEditor`) on the Dataproc Serverless service account in the project\n - [Storage Object Admin](/iam/docs/roles-permissions/storage#storage.objectAdmin) (`roles/storage.objectAdmin`) on the Dataproc Serverless service account in the project\n- Query BigLake metastore tables in BigQuery:\n - [BigQuery Data Viewer](/iam/docs/roles-permissions/bigquery#bigquery.dataViewer) (`roles/bigquery.dataViewer`) on the project\n - [BigQuery User](/iam/docs/roles-permissions/bigquery#bigquery.user) (`roles/bigquery.user`) on the project\n - [Storage Object Viewer](/iam/docs/roles-permissions/storage#storage.objectViewer) (`roles/storage.objectViewer`) on the project\n\n\nFor more information about granting roles, see [Manage access to projects, folders, and organizations](/iam/docs/granting-changing-revoking-access).\n\n\nYou might also be able to get\nthe required permissions through [custom\nroles](/iam/docs/creating-custom-roles) or other [predefined\nroles](/iam/docs/roles-overview#predefined).\n\nConnect to a table\n------------------\n\n1. Create a [dataset](/bigquery/docs/datasets) in the Google Cloud console.\n\n ```googlesql\n CREATE SCHEMA `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME;\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the ID of the Google Cloud project to create the dataset.\n - \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e: a name for your dataset.\n2. Create a [Cloud Resource Connection](/bigquery/docs/create-cloud-resource-connection).\n\n3. Create a standard BigQuery table.\n\n ```googlesql\n CREATE TABLE `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.TABLE_NAME (name STRING,id INT64);\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eTABLE_NAME\u003c/var\u003e: a name for your table.\n4. Insert data into the standard BigQuery table.\n\n ```googlesql\n INSERT INTO `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.TABLE_NAME VALUES ('test_name1', 123),('test_name2', 456),('test_name3', 789);\n ```\n5. Create a\n [BigLake table for Apache Iceberg in BigQuery](/bigquery/docs/iceberg-tables#iceberg-table-workflows).\n\n For example, to create a table, run the following `CREATE` statement. \n\n ```googlesql\n CREATE TABLE `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.ICEBERG_TABLE_NAME(\n name STRING,id INT64\n )\n WITH CONNECTION `\u003cvar translate=\"no\"\u003eCONNECTION_NAME\u003c/var\u003e`\n OPTIONS (\n file_format = 'PARQUET',\n table_format = 'ICEBERG',\n storage_uri = '\u003cvar translate=\"no\"\u003eSTORAGE_URI\u003c/var\u003e');\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eICEBERG_TABLE_NAME\u003c/var\u003e: a name for your BigLake table for Apache Iceberg in BigQuery. For example, `iceberg_managed_table`.\n - \u003cvar translate=\"no\"\u003eCONNECTION_NAME\u003c/var\u003e: the name of your connection. You created this in the previous step. For example, `myproject.us.myconnection`.\n - \u003cvar translate=\"no\"\u003eSTORAGE_URI\u003c/var\u003e: a fully qualified Cloud Storage URI. For example, `gs://mybucket/table`.\n6. Insert data into the BigLake table for Apache Iceberg in BigQuery.\n\n ```googlesql\n INSERT INTO `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.ICEBERG_TABLE_NAME VALUES ('test_name1', 123),('test_name2', 456),('test_name3', 789);\n ```\n7. Create an [Apache Iceberg external table](/bigquery/docs/iceberg-external-tables).\n\n For example, to create an Iceberg external table,\n run the following `CREATE` statement. \n\n ```googlesql\n CREATE OR REPLACE EXTERNAL TABLE `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.READONLY_ICEBERG_TABLE_NAME\n WITH CONNECTION `\u003cvar translate=\"no\"\u003eCONNECTION_NAME\u003c/var\u003e`\n OPTIONS (\n format = 'ICEBERG',\n uris =\n ['\u003cvar translate=\"no\"\u003eBUCKET_PATH\u003c/var\u003e'],\n require_partition_filter = FALSE);\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eREADONLY_ICEBERG_TABLE_NAME\u003c/var\u003e: a name for your read-only table.\n - \u003cvar translate=\"no\"\u003eBUCKET_PATH\u003c/var\u003e: the path to the Cloud Storage bucket that contains the data for the external table, in the format `['gs://bucket_name/[folder_name/]file_name']`.\n8. From PySpark, query the standard table, BigLake table for Apache Iceberg in BigQuery, and\n Apache Iceberg external table.\n\n ```googlesql\n from pyspark.sql import SparkSession\n\n # Create a spark session\n spark = SparkSession.builder \\\n .appName(\"BigLake Metastore Iceberg\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e\", \"org.apache.iceberg.spark.SparkCatalog\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.catalog-impl\", \"org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.gcp_project\", \"\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.gcp_location\", \"\u003cvar translate=\"no\"\u003eLOCATION\u003c/var\u003e\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.warehouse\", \"\u003cvar translate=\"no\"\u003eWAREHOUSE_DIRECTORY\u003c/var\u003e\") \\\n .getOrCreate()\n spark.conf.set(\"viewsEnabled\",\"true\")\n\n # Use the blms_catalog\n spark.sql(\"USE `\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e`;\")\n spark.sql(\"USE NAMESPACE \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e;\")\n\n # Configure spark for temp results\n spark.sql(\"CREATE namespace if not exists \u003cvar translate=\"no\"\u003eMATERIALIZATION_NAMESPACE\u003c/var\u003e\");\n spark.conf.set(\"materializationDataset\",\"\u003cvar translate=\"no\"\u003eMATERIALIZATION_NAMESPACE\u003c/var\u003e\")\n\n # List the tables in the dataset\n df = spark.sql(\"SHOW TABLES;\")\n df.show();\n\n # Query the tables\n sql = \"\"\"SELECT * FROM \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e.\u003cvar translate=\"no\"\u003eTABLE_NAME\u003c/var\u003e\"\"\"\n df = spark.read.format(\"bigquery\").load(sql)\n df.show()\n\n sql = \"\"\"SELECT * FROM \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e.\u003cvar translate=\"no\"\u003eICEBERG_TABLE_NAME\u003c/var\u003e\"\"\"\n df = spark.read.format(\"bigquery\").load(sql)\n df.show()\n\n sql = \"\"\"SELECT * FROM \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e.\u003cvar translate=\"no\"\u003eREADONLY_ICEBERG_TABLE_NAME\u003c/var\u003e\"\"\"\n df = spark.read.format(\"bigquery\").load(sql)\n df.show()\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eWAREHOUSE_DIRECTORY\u003c/var\u003e: the URI of the Cloud Storage folder that's connected to your BigLake Iceberg table in BigQuery and your Iceberg external table.\n - \u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e: the name of the catalog that you're using.\n - \u003cvar translate=\"no\"\u003eMATERIALIZATION_NAMESPACE\u003c/var\u003e: the namespace for storing temp results.\n9. Run the PySpark script using Serverless Spark.\n\n ```bash\n gcloud dataproc batches submit pyspark SCRIPT_PATH \\\n --version=2.2 \\\n --project=PROJECT_ID \\\n --region=REGION \\\n --deps-bucket=YOUR_BUCKET \\\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eSCRIPT_PATH\u003c/var\u003e: the path to the script that the batch job uses.\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the ID of the Google Cloud project to run the batch job in.\n - \u003cvar translate=\"no\"\u003eREGION\u003c/var\u003e: the region where your workload runs.\n - \u003cvar translate=\"no\"\u003eYOUR_BUCKET\u003c/var\u003e: the location of the Cloud Storage bucket to upload workload dependencies. The `gs://` URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example, `mybucketname1`.\n\nWhat's next\n-----------\n\n- Set up [optional BigLake metastore features](/bigquery/docs/blms-features)."]]