Manage open source metadata with BigLake Metastore
BigLake Metastore is a unified physical metadata service for data analytics products on Google Cloud. BigLake Metastore provides a single source of truth for metadata and lets you manage and access data from multiple sources. BigLake Metastore is accessible from BigQuery and various open data processing engines on Dataproc, making it a useful tool for data analysts and engineers.
For management of business metadata, see Dataplex.
How BigLake Metastore works
BigLake Metastore is a serverless service that does not require you to provision resources before you use it. You can use it as a serverless alternative to Hive Metastore in Dataproc clusters. BigLake Metastore functions in the same way as Hive Metastore through its Hive-compatible APIs, and you are able to immediately query open-format tables in BigQuery without any further steps. BigLake Metastore only supports Apache Iceberg tables.
BigLake Metastore provides APIs, client libraries, and data engine integration (such as Apache Spark) to manage catalogs, databases, and tables.
Limitations
BigLake Metastore is subject to the following limitations:
- BigLake Metastore does not support Apache Hive tables.
- Identity and Access Management (IAM) roles and permissions can only be granted to projects. Giving IAM permissions to resources is not supported.
- Cloud Monitoring is not supported.
- BigLake Metastore catalogs and databases have the following
naming limitations:
- Names can be up to 1,024 characters in length.
- Names can contain only UTF-8 letters (uppercase, lowercase), numbers, and underscores.
- Names must be unique for each project and region combination.
- BigLake Metastore tables follow the same naming conventions as BigQuery tables. For more information, see Table naming.
Before you begin
You need to enable billing and the BigLake API before using BigLake Metastore.
- Ask your administrator to grant you the Service Usage Admin
(
roles/serviceusage.serviceUsageAdmin
) IAM role on your project. For more information about granting roles, see Manage access. - Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project.
Enable the BigLake API.
Required roles
- To have full control over BigLake Metastore resources, you
need the BigLake Admin role (
roles/biglake.admin
). If you are using a BigQuery Spark connector service account, a Dataproc Serverless service account, or a Dataproc VM service account, grant the BigLake Admin role to the account. - To have read-only access to BigLake Metastore resources, you
need the BigLake Viewer role (
roles/biglake.viewer
). For example, when querying a BigLake Metastore table in BigQuery, the user or the BigQuery connection service account must have the BigLake Viewer role. - To create BigQuery tables with connections, you need the
BigQuery Connection User role
(
roles/bigquery.connectionUser
). For more information about sharing connections, see Share connections with users.
Depending on the use case, the identity who calls BigLake Metastore can be users or service accounts:
- User: when directly calling the BigLake Rest API, or when querying a BigQuery Iceberg table without a connection from BigQuery. BigQuery uses the user's credentials in this circumstance.
- BigQuery Cloud Resource Connection: when querying a BigQuery Iceberg table with a connection from BigQuery. BigQuery uses the connection service account credential to access BigLake Metastore.
- BigQuery Spark Connector: when using Spark with BigLake Metastore in a BigQuery Spark stored procedure. Spark uses the service account credential of the Spark Connector to access BigLake Metastore and create BigQuery tables.
- Dataproc Serverless service account: when using Spark with BigLake in Dataproc Serverless. Spark uses the service account credential.
- Dataproc VM service account: when using Dataproc (not Dataproc Serverless). Apache Spark uses the VM service account credential.
Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, see Viewing the grantable roles on resources.
To see the exact permissions that are required to access BigLake Metastore resources, expand the Required permissions section:
Required permissions
biglake.tables.get
at the project level, for all read-only accesses. Querying a BigQuery Iceberg table is read-only.biglake.{catalogs|databases|tables}.*
at the project level, for all read and write permissions. Typically, Apache Spark needs the ability to read and write data, including the ability to create, manage, and view catalogs, databases, and tables.bigquery.connections.delegate
at the BigQuery Cloud Resource Connection level or higher, for creating a BigQuery Iceberg table using a connection.
Connect to BigLake Metastore
The following sections explain how to connect to BigLake Metastore. These sections install and use the BigLake Apache Iceberg catalog plugin, indicated by the JAR files in the following methods. The catalog plugin connects to BigLake Metastore from open source engines like Apache Spark.
Connect with a Dataproc VM
To connect to BigLake Metastore with a Dataproc VM, do the following:
- Use SSH to connect to Dataproc.
In the Spark SQL CLI, use the following statement to install and configure the Apache Iceberg custom catalog to work with BigLake Metastore:
spark-sql \ --packages ICEBERG_SPARK_PACKAGE \ --jars BIGLAKE_ICEBERG_CATALOG_JAR \ --conf spark.sql.catalog.SPARK_CATALOG=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.SPARK_CATALOG.catalog-impl=org.apache.iceberg.gcp.biglake.BigLakeCatalog \ --conf spark.sql.catalog.SPARK_CATALOG.gcp_project=PROJECT_ID \ --conf spark.sql.catalog.SPARK_CATALOG.gcp_location=LOCATION \ --conf spark.sql.catalog.SPARK_CATALOG.blms_catalog=BLMS_CATALOG \ --conf spark.sql.catalog.SPARK_CATALOG.warehouse=GCS_DATA_WAREHOUSE_FOLDER \ --conf spark.sql.catalog.SPARK_HMS_CATALOG=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.SPARK_HMS_CATALOG.type=hive \ --conf spark.sql.catalog.SPARK_HMS_CATALOG.uri=thrift://HMS_URI:9083
Replace the following:
ICEBERG_SPARK_PACKAGE
: the version of Apache Iceberg with Spark to use. We recommend using the Spark version that matches the Spark version in your Dataproc or Dataproc serverless instance. To view a list of available Apache Iceberg versions, see Apache Iceberg downloads. For example, the flag for Apache Spark 3.3 is:
--packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.13:1.2.1
BIGLAKE_ICEBERG_CATALOG_JAR
: the Cloud Storage URI of the Iceberg custom catalog plugin to install. Depending on your environment, select one of the following:Iceberg 1.2.0
: gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.1-with-dependencies.jarIceberg 0.14.0
: gs://spark-lib/biglake/biglake-catalog-iceberg0.14.0-0.1.1-with-dependencies.jar
SPARK_CATALOG
: the catalog identifier for Spark. It is linked to a BigLake Metastore catalog.PROJECT_ID
: the Google Cloud project ID of the BigLake Metastore catalog that the Spark catalog links with.LOCATION
: the Google Cloud location of the BigLake Metastore catalog that the Spark catalog links with.BLMS_CATALOG
: the BigLake Metastore catalog ID that the Spark catalog links with. The catalog does not need to exist, and it can be created in Spark.GCS_DATA_WAREHOUSE_FOLDER
: the Cloud Storage folder where Spark creates all files. It starts withgs://
.HMS_DB
: (optional) the HMS database containing the table to copy from.HMS_TABLE
: (optional) the HMS table to copy from.HMS_URI
: (optional) the HMS Thrift endpoint.
Connect with a Dataproc cluster
Alternatively, you can submit a Dataproc job to a cluster. The following sample installs the appropriate Iceberg Custom Catalog.
To connect with a Dataproc cluster, submit a job with the following specifications:
CONFS="spark.sql.catalog.SPARK_CATALOG=org.apache.iceberg.spark.SparkCatalog," CONFS+="spark.sql.catalog.SPARK_CATALOG.catalog-impl=org.apache.iceberg.gcp.biglake.BigLakeCatalog," CONFS+="spark.sql.catalog.SPARK_CATALOG.gcp_project=PROJECT_ID," CONFS+="spark.sql.catalog.SPARK_CATALOG.gcp_location=LOCATION," CONFS+="spark.sql.catalog.SPARK_CATALOG.blms_catalog=BLMS_CATALOG," CONFS+="spark.sql.catalog.SPARK_CATALOG.warehouse=GCS_DATA_WAREHOUSE_FOLDER," CONFS+="spark.jars.packages=ICEBERG_SPARK_PACKAGE" gcloud dataproc jobs submit spark-sql --cluster=DATAPROC_CLUSTER \ --project=DATAPROC_PROJECT_ID \ --region=DATAPROC_LOCATION \ --jars=BIGLAKE_ICEBERG_CATALOG_JAR \ --properties="${CONFS}" \ --file=QUERY_FILE_PATH
Replace the following:
DATAPROC_CLUSTER
: the Dataproc cluster to submit the job to.DATAPROC_PROJECT_ID
: the project ID of the Dataproc cluster. This ID can be different fromPROJECT_ID
.DATAPROC_LOCATION
: the location of the Dataproc cluster. This location can be different fromLOCATION
.QUERY_FILE_PATH
: the path to the file containing queries to run.
Connect with Dataproc Serverless
Similarly, you can submit a batch workload to Dataproc Serverless. To do so, follow the batch workload instructions with the following additional flags:
--properties="${CONFS}"
--jars=BIGLAKE_ICEBERG_CATALOG_JAR
Connect with BigQuery stored procedures
You can use BigQuery stored procedures to run Dataproc Serverless jobs. The process is similar to running Dataproc Serverless jobs directly in Dataproc.
Create metastore resources
The following sections describe how to create resources in the metastore.
Create catalogs
Catalog names have constraints; for more information, see Limitations. To create a catalog, select one of the following options:
API
Use the
projects.locations.catalogs.create
method and specify the name of a catalog.
Spark SQL
CREATE NAMESPACE SPARK_CATALOG;
Terraform
This creates a BigLake database named 'my_database' of type 'HIVE' in the catalog specified by the 'google_biglake_catalog.default.id' variable. For more information, see the Terraform BigLake documentation.
resource "google_biglake_catalog" "default" { name = "my_catalog" location = "US" }
Create databases
Database names have constraints; for more information, see Limitations. To ensure that your database resource is compatible with data engines, we recommend creating databases using data engines instead of manually crafting the resource body. To create a database, select one of the following options:
API
Use the
projects.locations.catalogs.databases.create
method and specify the name of a database.
Spark SQL
CREATE NAMESPACE SPARK_CATALOG.BLMS_DB;
Replace the following:
BLMS_DB
: the BigLake Metastore database ID to create
Terraform
This creates a BigLake database named 'my_database' of type 'HIVE' in the catalog specified by the 'google_biglake_catalog.default.id' variable. For more information, see the Terraform BigLake documentation.
resource "google_biglake_database" "default" { name = "my_database" catalog = google_biglake_catalog.default.id type = "HIVE" hive_options { location_uri = "gs://${google_storage_bucket.default.name}/${google_storage_bucket_object.metadata_directory.name}" parameters = { "owner" = "Alex" } } }
Create tables
Table names have constraints. For more information, see Table naming. To create a table, select one of the following options:
API
Use the
projects.locations.catalogs.databases.tables.create
method and specify the name of a table.
Spark SQL
CREATE TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE (id bigint, data string) USING iceberg;
Replace the following:
BLMS_TABLE
: the BigLake Metastore table ID to create
Terraform
This creates a BigLake Metastore table with the name "my_table" and type "HIVE" in the database specified by the "google_biglake_database.default.id" variable. Refer to the Terraform Provider Documentation for more information: BigLake Table.
resource "google_biglake_table" "default" { name = "my-table" database = google_biglake_database.default.id type = "HIVE" hive_options { table_type = "MANAGED_TABLE" storage_descriptor { location_uri = "gs://${google_storage_bucket.default.name}/${google_storage_bucket_object.data_directory.name}" input_format = "org.apache.hadoop.mapred.SequenceFileInputFormat" output_format = "org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat" } parameters = { "spark.sql.create.version" = "3.1.3" "spark.sql.sources.schema.numParts" = "1" "transient_lastDdlTime" = "1680894197" "spark.sql.partitionProvider" = "catalog" "owner" = "Alex" "spark.sql.sources.schema.part.0" = jsonencode({ "type" : "struct", "fields" : [ { "name" : "id", "type" : "integer", "nullable" : true, "metadata" : {} }, { "name" : "name", "type" : "string", "nullable" : true, "metadata" : {} }, { "name" : "age", "type" : "integer", "nullable" : true, "metadata" : {} } ] }) "spark.sql.sources.provider" = "iceberg" "provider" = "iceberg" } } }
E2E Terraform Example
This GitHub example provides a runnable E2E example that creates a "BigLake" Metastore Catalog, Database, and Table. For more information on how to use this example, refer to Basic Terraform Commands.
Copy an Iceberg table from Hive Metastore to BigLake Metastore
To create an Iceberg table and copy a Hive Metastore table over to BigLake Metastore, use the following Spark SQL statement:
CREATE TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE (id bigint, data string) USING iceberg TBLPROPERTIES(hms_table='HMS_DB.HMS_TABLE');
Link BigLake tables to BigLake Metastore tables
BigLake Metastore is the recommended metastore for querying BigLake Iceberg tables. When creating an Iceberg table in Spark, you can optionally create a linked BigLake Iceberg table at the same time.
Automatically link tables
To create an Iceberg table in Spark and automatically create a BigLake Iceberg table at the same time, use the following Spark SQL statement:
CREATE TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE (id bigint, data string) USING iceberg TBLPROPERTIES(bq_table='BQ_TABLE_PATH', bq_connection='BQ_RESOURCE_CONNECTION');
Replace the following:
BQ_TABLE_PATH
: the path of the BigLake Iceberg table to create. Follow the BigQuery table path syntax. It uses the same project as the BigLake Metastore catalog if the project is unspecified.BQ_RESOURCE_CONNECTION
(optional): the format isproject.location.connection-id
. If specified, BigQuery queries use the Cloud Resource connection credentials to access BigLake Metastore. If not specified, BigQuery creates a regular external table instead of a BigLake table.
Manually link tables
To manually create BigLake Iceberg tables
links with specified BigLake Metastore table URIs
(blms://…
), use the following BigQuery SQL statement:
CREATE EXTERNAL TABLE 'BQ_TABLE_PATH' WITH CONNECTION `BQ_RESOURCE_CONNECTION` OPTIONS ( format = 'ICEBERG', uris = ['blms://projects/PROJECT_ID/locations/LOCATION/catalogs/BLMS_CATALOG/databases/BLMS_DB/tables/BLMS_TABLE'] )
View metastore resources
The following sections describe how to view resources in BigLake Metastore.
View catalogs
To see all databases in a catalog, use the
projects.locations.catalogs.list
method and specify the name of a catalog.
To see information about a catalog, use the
projects.locations.catalogs.get
method and specify the name of a catalog.
View databases
To view a database, do the following:
API
To see all tables in a database, use the
projects.locations.catalogs.databases.list
method and specify the name of a database.
To see information about a database, use the
projects.locations.catalogs.databases.get
method and specify the name of a database.
Spark SQL
To see all databases in a catalog, use the following statement:
SHOW { DATABASES | NAMESPACES } IN SPARK_CATALOG;
To see information about a defined database, use the following statement:
DESCRIBE { DATABASE | NAMESPACE } [EXTENDED] SPARK_CATALOG.BLMS_DB;
View tables
To view all tables in a database or view a defined table, do the following:
API
To see all tables in a database, use the
projects.locations.catalogs.databases.tables.list
method and specify the name of a database.
To see information about a table, use the
projects.locations.catalogs.databases.tables.get
method and specify the name of a table.
Spark SQL
To see all tables in a database, use the following statement:
SHOW TABLES IN SPARK_CATALOG.BLMS_DB;
To see information about a defined table, use the following statement:
DESCRIBE TABLE [EXTENDED] SPARK_CATALOG.BLMS_DB.BLMS_TABLE;
Modify metastore resources
The following sections describe how to modify resources in the metastore.
Update tables
To avoid avoid conflicts when multiple jobs try to update the same table at the
same time, BigLake Metastore uses optimistic locking. To use
optimistic locking, you first need to get the current version of the table
(called an etag) by using the GetTable
method. Then you can make changes
to the table and use the UpdateTable
method, passing in the previously
fetched etag. If another job updates the table after you fetch the etag,
the UpdateTable
method fails. This measure ensures
that only one job can update the table at a time, preventing conflicts.
To update a table, select one of the following options:
API
Use the
projects.locations.catalogs.databases.tables.patch
method and specify the name of a table.
Spark SQL
For table update options in SQL, see
ALTER TABLE
.
Rename tables
To rename a table, select one of the following options:
API
Use the
projects.locations.catalogs.databases.tables.rename
method and specify the name of a table and a newName
value.
Spark SQL
ALTER TABLE BLMS_TABLE RENAME TO NEW_BLMS_TABLE;
Replace the following:
NEW_BLMS_TABLE
: the new name forBLMS_TABLE
. Must be in the same dataset asBLMS_TABLE
.
Delete metastore resources
The following sections describe how to delete resources in BigLake Metastore.
Delete catalogs
To delete a catalog, select one of the following options:
API
Use the
projects.locations.catalogs.delete
method and specify the name of a catalog. This method does not delete the
associated files on Google Cloud.
Spark SQL
DROP NAMESPACE SPARK_CATALOG;
Delete databases
To delete a database, select one of the following options:
API
Use the
projects.locations.catalogs.databases.delete
method and specify the name of a database. This method does not delete the
associated files on Google Cloud.
Spark SQL
DROP NAMESPACE SPARK_CATALOG.BLMS_DB;
Delete tables
To delete a table, select one of the following options:
API
Use the
projects.locations.catalogs.databases.tables.delete
method and specify the name of a table. This method does not delete the
associated files on Google Cloud.
Spark SQL
To only drop the table, use the following statement:
DROP TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE;
To drop the table and delete the associated files on Google Cloud, use the following statement:
DROP TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE PURGE;