Manage open source metadata with BigLake Metastore

BigLake Metastore is a unified physical metadata service for data analytics products on Google Cloud. BigLake Metastore provides a single source of truth for metadata and lets you manage and access data from multiple sources. BigLake Metastore is accessible from BigQuery and various open data processing engines on Dataproc, making it a useful tool for data analysts and engineers.

For management of business metadata, see Dataplex.

How BigLake Metastore works

BigLake Metastore is a serverless service that does not require you to provision resources before you use it. You can use it as a serverless alternative to Hive Metastore in Dataproc clusters. BigLake Metastore functions in the same way as Hive Metastore through its Hive-compatible APIs, and you are able to immediately query open-format tables in BigQuery without any further steps. BigLake Metastore only supports Apache Iceberg tables.

BigLake Metastore provides APIs, client libraries, and data engine integration (such as Apache Spark) to manage catalogs, databases, and tables.

Limitations

BigLake Metastore is subject to the following limitations:

  • BigLake Metastore does not support Apache Hive tables.
  • Identity and Access Management (IAM) roles and permissions can only be granted to projects. Giving IAM permissions to resources is not supported.
  • Cloud Monitoring is not supported.
  • BigLake Metastore catalogs and databases have the following naming limitations:
    • Names can be up to 1,024 characters in length.
    • Names can contain only UTF-8 letters (uppercase, lowercase), numbers, and underscores.
    • Names must be unique for each project and region combination.
  • BigLake Metastore tables follow the same naming conventions as BigQuery tables. For more information, see Table naming.

Before you begin

You need to enable billing and the BigLake API before using BigLake Metastore.

  1. Ask your administrator to grant you the Service Usage Admin (roles/serviceusage.serviceUsageAdmin) IAM role on your project. For more information about granting roles, see Manage access.
  2. Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project.
  3. Enable the BigLake API.

    Enable the API

Required roles

  • To have full control over BigLake Metastore resources, you need the BigLake Admin role (roles/biglake.admin). If you are using a BigQuery Spark connector service account, a Dataproc Serverless service account, or a Dataproc VM service account, grant the BigLake Admin role to the account.
  • To have read-only access to BigLake Metastore resources, you need the BigLake Viewer role (roles/biglake.viewer). For example, when querying a BigLake Metastore table in BigQuery, the user or the BigQuery connection service account must have the BigLake Viewer role.
  • To create BigQuery tables with connections, you need the BigQuery Connection User role (roles/bigquery.connectionUser). For more information about sharing connections, see Share connections with users.

Depending on the use case, the identity who calls BigLake Metastore can be users or service accounts:

  • User: when directly calling the BigLake Rest API, or when querying a BigQuery Iceberg table without a connection from BigQuery. BigQuery uses the user's credentials in this circumstance.
  • BigQuery Cloud Resource Connection: when querying a BigQuery Iceberg table with a connection from BigQuery. BigQuery uses the connection service account credential to access BigLake Metastore.
  • BigQuery Spark Connector: when using Spark with BigLake Metastore in a BigQuery Spark stored procedure. Spark uses the service account credential of the Spark Connector to access BigLake Metastore and create BigQuery tables.
  • Dataproc Serverless service account: when using Spark with BigLake in Dataproc Serverless. Spark uses the service account credential.
  • Dataproc VM service account: when using Dataproc (not Dataproc Serverless). Apache Spark uses the VM service account credential.

Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, see Viewing the grantable roles on resources.

To see the exact permissions that are required to access BigLake Metastore resources, expand the Required permissions section:

Required permissions

  • biglake.tables.get at the project level, for all read-only accesses. Querying a BigQuery Iceberg table is read-only.
  • biglake.{catalogs|databases|tables}.* at the project level, for all read and write permissions. Typically, Apache Spark needs the ability to read and write data, including the ability to create, manage, and view catalogs, databases, and tables.
  • bigquery.connections.delegate at the BigQuery Cloud Resource Connection level or higher, for creating a BigQuery Iceberg table using a connection.

Connect to BigLake Metastore

The following sections explain how to connect to BigLake Metastore. These sections install and use the BigLake Apache Iceberg catalog plugin, indicated by the JAR files in the following methods. The catalog plugin connects to BigLake Metastore from open source engines like Apache Spark.

Connect with a Dataproc VM

To connect to BigLake Metastore with a Dataproc VM, do the following:

  1. Use SSH to connect to Dataproc.
  2. In the Spark SQL CLI, use the following statement to install and configure the Apache Iceberg custom catalog to work with BigLake Metastore:

    spark-sql \
      --packages ICEBERG_SPARK_PACKAGE \
      --jars BIGLAKE_ICEBERG_CATALOG_JAR \
      --conf spark.sql.catalog.SPARK_CATALOG=org.apache.iceberg.spark.SparkCatalog \
      --conf spark.sql.catalog.SPARK_CATALOG.catalog-impl=org.apache.iceberg.gcp.biglake.BigLakeCatalog \
      --conf spark.sql.catalog.SPARK_CATALOG.gcp_project=PROJECT_ID \
      --conf spark.sql.catalog.SPARK_CATALOG.gcp_location=LOCATION \
      --conf spark.sql.catalog.SPARK_CATALOG.blms_catalog=BLMS_CATALOG \
      --conf spark.sql.catalog.SPARK_CATALOG.warehouse=GCS_DATA_WAREHOUSE_FOLDER \
      --conf spark.sql.catalog.SPARK_HMS_CATALOG=org.apache.iceberg.spark.SparkCatalog \
      --conf spark.sql.catalog.SPARK_HMS_CATALOG.type=hive \
      --conf spark.sql.catalog.SPARK_HMS_CATALOG.uri=thrift://HMS_URI:9083
      

Replace the following:

  • ICEBERG_SPARK_PACKAGE: the version of Apache Iceberg with Spark to use. We recommend using the Spark version that matches the Spark version in your Dataproc or Dataproc serverless instance. To view a list of available Apache Iceberg versions, see Apache Iceberg downloads. For example, the flag for Apache Spark 3.3 is:
    --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.13:1.2.1
  • BIGLAKE_ICEBERG_CATALOG_JAR: the Cloud Storage URI of the Iceberg custom catalog plugin to install. Depending on your environment, select one of the following:
    • Iceberg 1.2.0: gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.1-with-dependencies.jar
    • Iceberg 0.14.0: gs://spark-lib/biglake/biglake-catalog-iceberg0.14.0-0.1.1-with-dependencies.jar
  • SPARK_CATALOG: the catalog identifier for Spark. It is linked to a BigLake Metastore catalog.
  • PROJECT_ID: the Google Cloud project ID of the BigLake Metastore catalog that the Spark catalog links with.
  • LOCATION: the Google Cloud location of the BigLake Metastore catalog that the Spark catalog links with.
  • BLMS_CATALOG: the BigLake Metastore catalog ID that the Spark catalog links with. The catalog does not need to exist, and it can be created in Spark.
  • GCS_DATA_WAREHOUSE_FOLDER: the Cloud Storage folder where Spark creates all files. It starts with gs://.
  • HMS_DB: (optional) the HMS database containing the table to copy from.
  • HMS_TABLE: (optional) the HMS table to copy from.
  • HMS_URI: (optional) the HMS Thrift endpoint.

Connect with a Dataproc cluster

Alternatively, you can submit a Dataproc job to a cluster. The following sample installs the appropriate Iceberg Custom Catalog.

To connect with a Dataproc cluster, submit a job with the following specifications:

CONFS="spark.sql.catalog.SPARK_CATALOG=org.apache.iceberg.spark.SparkCatalog,"
CONFS+="spark.sql.catalog.SPARK_CATALOG.catalog-impl=org.apache.iceberg.gcp.biglake.BigLakeCatalog,"
CONFS+="spark.sql.catalog.SPARK_CATALOG.gcp_project=PROJECT_ID,"
CONFS+="spark.sql.catalog.SPARK_CATALOG.gcp_location=LOCATION,"
CONFS+="spark.sql.catalog.SPARK_CATALOG.blms_catalog=BLMS_CATALOG,"
CONFS+="spark.sql.catalog.SPARK_CATALOG.warehouse=GCS_DATA_WAREHOUSE_FOLDER,"
CONFS+="spark.jars.packages=ICEBERG_SPARK_PACKAGE"

gcloud dataproc jobs submit spark-sql --cluster=DATAPROC_CLUSTER \
  --project=DATAPROC_PROJECT_ID \
  --region=DATAPROC_LOCATION \
  --jars=BIGLAKE_ICEBERG_CATALOG_JAR \
  --properties="${CONFS}" \
  --file=QUERY_FILE_PATH

Replace the following:

  • DATAPROC_CLUSTER: the Dataproc cluster to submit the job to.
  • DATAPROC_PROJECT_ID: the project ID of the Dataproc cluster. This ID can be different from PROJECT_ID.
  • DATAPROC_LOCATION: the location of the Dataproc cluster. This location can be different from LOCATION.
  • QUERY_FILE_PATH: the path to the file containing queries to run.

Connect with Dataproc Serverless

Similarly, you can submit a batch workload to Dataproc Serverless. To do so, follow the batch workload instructions with the following additional flags:

  • --properties="${CONFS}"
  • --jars=BIGLAKE_ICEBERG_CATALOG_JAR

Connect with BigQuery stored procedures

You can use BigQuery stored procedures to run Dataproc Serverless jobs. The process is similar to running Dataproc Serverless jobs directly in Dataproc.

Create metastore resources

The following sections describe how to create resources in the metastore.

Create catalogs

Catalog names have constraints; for more information, see Limitations. To create a catalog, select one of the following options:

API

Use the projects.locations.catalogs.create method and specify the name of a catalog.

Spark SQL

CREATE NAMESPACE SPARK_CATALOG;

Terraform

This creates a BigLake database named 'my_database' of type 'HIVE' in the catalog specified by the 'google_biglake_catalog.default.id' variable. For more information, see the Terraform BigLake documentation.

resource "google_biglake_catalog" "default" {
name     = "my_catalog"
location = "US"
}

Create databases

Database names have constraints; for more information, see Limitations. To ensure that your database resource is compatible with data engines, we recommend creating databases using data engines instead of manually crafting the resource body. To create a database, select one of the following options:

API

Use the projects.locations.catalogs.databases.create method and specify the name of a database.

Spark SQL

CREATE NAMESPACE SPARK_CATALOG.BLMS_DB;

Replace the following:

  • BLMS_DB: the BigLake Metastore database ID to create

Terraform

This creates a BigLake database named 'my_database' of type 'HIVE' in the catalog specified by the 'google_biglake_catalog.default.id' variable. For more information, see the Terraform BigLake documentation.

resource "google_biglake_database" "default" {
name    = "my_database"
catalog = google_biglake_catalog.default.id
type    = "HIVE"
hive_options {
  location_uri = "gs://${google_storage_bucket.default.name}/${google_storage_bucket_object.metadata_directory.name}"
  parameters = {
    "owner" = "Alex"
  }
}
}

Create tables

Table names have constraints. For more information, see Table naming. To create a table, select one of the following options:

API

Use the projects.locations.catalogs.databases.tables.create method and specify the name of a table.

Spark SQL

CREATE TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE
  (id bigint, data string) USING iceberg;

Replace the following:

  • BLMS_TABLE: the BigLake Metastore table ID to create

Terraform

This creates a BigLake Metastore table with the name "my_table" and type "HIVE" in the database specified by the "google_biglake_database.default.id" variable. Refer to the Terraform Provider Documentation for more information: BigLake Table.

resource "google_biglake_table" "default" {
name     = "my-table"
database = google_biglake_database.default.id
type     = "HIVE"
hive_options {
  table_type = "MANAGED_TABLE"
  storage_descriptor {
    location_uri  = "gs://${google_storage_bucket.default.name}/${google_storage_bucket_object.data_directory.name}"
    input_format  = "org.apache.hadoop.mapred.SequenceFileInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat"
  }
  parameters = {
    "spark.sql.create.version"          = "3.1.3"
    "spark.sql.sources.schema.numParts" = "1"
    "transient_lastDdlTime"             = "1680894197"
    "spark.sql.partitionProvider"       = "catalog"
    "owner"                             = "Alex"
    "spark.sql.sources.schema.part.0" = jsonencode({
      "type" : "struct",
      "fields" : [
        { "name" : "id", "type" : "integer",
          "nullable" : true,
          "metadata" : {}
        },
        {
          "name" : "name",
          "type" : "string",
          "nullable" : true,
          "metadata" : {}
        },
        {
          "name" : "age",
          "type" : "integer",
          "nullable" : true,
          "metadata" : {}
        }
      ]
    })
    "spark.sql.sources.provider" = "iceberg"
    "provider"                   = "iceberg"
  }
}
}

E2E Terraform Example

This GitHub example provides a runnable E2E example that creates a "BigLake" Metastore Catalog, Database, and Table. For more information on how to use this example, refer to Basic Terraform Commands.

Copy an Iceberg table from Hive Metastore to BigLake Metastore

To create an Iceberg table and copy a Hive Metastore table over to BigLake Metastore, use the following Spark SQL statement:

CREATE TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE
  (id bigint, data string) USING iceberg
  TBLPROPERTIES(hms_table='HMS_DB.HMS_TABLE');

BigLake Metastore is the recommended metastore for querying BigLake Iceberg tables. When creating an Iceberg table in Spark, you can optionally create a linked BigLake Iceberg table at the same time.

To create an Iceberg table in Spark and automatically create a BigLake Iceberg table at the same time, use the following Spark SQL statement:

  CREATE TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE
    (id bigint, data string) USING iceberg
    TBLPROPERTIES(bq_table='BQ_TABLE_PATH',
    bq_connection='BQ_RESOURCE_CONNECTION');

Replace the following:

  • BQ_TABLE_PATH: the path of the BigLake Iceberg table to create. Follow the BigQuery table path syntax. It uses the same project as the BigLake Metastore catalog if the project is unspecified.
  • BQ_RESOURCE_CONNECTION (optional): the format is project.location.connection-id. If specified, BigQuery queries use the Cloud Resource connection credentials to access BigLake Metastore. If not specified, BigQuery creates a regular external table instead of a BigLake table.

To manually create BigLake Iceberg tables links with specified BigLake Metastore table URIs (blms://…), use the following BigQuery SQL statement:

CREATE EXTERNAL TABLE 'BQ_TABLE_PATH'
  WITH CONNECTION `BQ_RESOURCE_CONNECTION`
  OPTIONS (
          format = 'ICEBERG',
          uris = ['blms://projects/PROJECT_ID/locations/LOCATION/catalogs/BLMS_CATALOG/databases/BLMS_DB/tables/BLMS_TABLE']
          )

View metastore resources

The following sections describe how to view resources in BigLake Metastore.

View catalogs

To see all databases in a catalog, use the projects.locations.catalogs.list method and specify the name of a catalog.

To see information about a catalog, use the projects.locations.catalogs.get method and specify the name of a catalog.

View databases

To view a database, do the following:

API

To see all tables in a database, use the projects.locations.catalogs.databases.list method and specify the name of a database.

To see information about a database, use the projects.locations.catalogs.databases.get method and specify the name of a database.

Spark SQL

To see all databases in a catalog, use the following statement:

SHOW { DATABASES | NAMESPACES } IN SPARK_CATALOG;

To see information about a defined database, use the following statement:

DESCRIBE { DATABASE | NAMESPACE } [EXTENDED] SPARK_CATALOG.BLMS_DB;

View tables

To view all tables in a database or view a defined table, do the following:

API

To see all tables in a database, use the projects.locations.catalogs.databases.tables.list method and specify the name of a database.

To see information about a table, use the projects.locations.catalogs.databases.tables.get method and specify the name of a table.

Spark SQL

To see all tables in a database, use the following statement:

SHOW TABLES IN SPARK_CATALOG.BLMS_DB;

To see information about a defined table, use the following statement:

DESCRIBE TABLE [EXTENDED] SPARK_CATALOG.BLMS_DB.BLMS_TABLE;

Modify metastore resources

The following sections describe how to modify resources in the metastore.

Update tables

To avoid avoid conflicts when multiple jobs try to update the same table at the same time, BigLake Metastore uses optimistic locking. To use optimistic locking, you first need to get the current version of the table (called an etag) by using the GetTable method. Then you can make changes to the table and use the UpdateTable method, passing in the previously fetched etag. If another job updates the table after you fetch the etag, the UpdateTable method fails. This measure ensures that only one job can update the table at a time, preventing conflicts.

To update a table, select one of the following options:

API

Use the projects.locations.catalogs.databases.tables.patch method and specify the name of a table.

Spark SQL

For table update options in SQL, see ALTER TABLE.

Rename tables

To rename a table, select one of the following options:

API

Use the projects.locations.catalogs.databases.tables.rename method and specify the name of a table and a newName value.

Spark SQL

ALTER TABLE BLMS_TABLE RENAME TO NEW_BLMS_TABLE;

Replace the following:

  • NEW_BLMS_TABLE: the new name for BLMS_TABLE. Must be in the same dataset as BLMS_TABLE.

Delete metastore resources

The following sections describe how to delete resources in BigLake Metastore.

Delete catalogs

To delete a catalog, select one of the following options:

API

Use the projects.locations.catalogs.delete method and specify the name of a catalog. This method does not delete the associated files on Google Cloud.

Spark SQL

DROP NAMESPACE SPARK_CATALOG;

Delete databases

To delete a database, select one of the following options:

API

Use the projects.locations.catalogs.databases.delete method and specify the name of a database. This method does not delete the associated files on Google Cloud.

Spark SQL

DROP NAMESPACE SPARK_CATALOG.BLMS_DB;

Delete tables

To delete a table, select one of the following options:

API

Use the projects.locations.catalogs.databases.tables.delete method and specify the name of a table. This method does not delete the associated files on Google Cloud.

Spark SQL

To only drop the table, use the following statement:

DROP TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE;

To drop the table and delete the associated files on Google Cloud, use the following statement:

DROP TABLE SPARK_CATALOG.BLMS_DB.BLMS_TABLE PURGE;