Introduction to BigLake metastore
BigLake metastore is a unified, managed, serverless, and scalable metastore that connects lakehouse data stored in Cloud Storage or BigQuery to multiple runtimes, including open source runtimes (such as Apache Spark and Apache Flink) and BigQuery.
BigLake metastore provides a single source of truth for managing metadata from multiple engines. It supports key open source table formats, such as Apache Iceberg, through BigLake Iceberg tables and standard BigQuery tables. Additionally, BigLake metastore has support for open APIs and an Iceberg REST catalog (Preview).
Use the following table to help determine where to start your BigLake metastore journey:
Use case | Recommendation |
---|---|
Open source engine needs to access data in Cloud Storage. | Explore the Iceberg REST catalog (Preview). |
Open source engine needs interoperability with BigQuery. | Explore the BigLake metastore integration with open source engines (such as Spark) using the BigQuery custom Iceberg catalog plugin. |
Benefits
BigLake metastore offers several advantages for data management and analysis:
- Serverless architecture. BigLake metastore provides a serverless architecture, eliminating the need for server or cluster management. This helps reduce operational overhead, simplifies deployment, and allows for automatic scaling based on demand.
- Engine interoperability. BigLake metastore provides you with direct table access across open source engines (such as Spark and Flink) and BigQuery, allowing you to query open-format tables without additional configuration. For example, you can create a table in Spark and then query it directly in BigQuery. This helps streamline your analytics workflow and reduces the need for complex data movement or ETL processes.
Unified user experience. BigLake metastore provides a unified workflow across BigQuery and open source engines. This unified experience means you can configure a Spark environment that's self-hosted or hosted by Dataproc through the Iceberg REST catalog (Preview), or you can configure a Spark environment in a BigQuery Studio notebook to do the same thing.
For example, in BigQuery Studio, you can create a table in Spark with a BigQuery Studio notebook.
Then, you can query the same Spark table in the Google Cloud console.
Table formats in BigLake metastore
BigLake supports several table types. Use the following table to help select the format that best fits your use case:
External tables | BigLake Iceberg tables | BigLake Iceberg tables in BigQuery | Standard BigQuery tables | |
---|---|---|---|---|
Metastore | External or self-hosted metastore | BigLake metastore | BigLake metastore | BigLake metastore |
Storage | Cloud Storage / Amazon S3 / Azure | Cloud Storage | Cloud Storage | BigQuery |
Management | Customer or third party | Google (highly managed experience) | Google (most managed experience) | |
Read / Write |
Open source engines (read/write) BigQuery (read only) |
Open source engines (read/write) BigQuery (read only) |
Open source engines (read only with Iceberg
libraries, read/write interoperability with BigQuery Storage API)
BigQuery (read/write) |
Open source engines (read/write interoperability with
BigQuery Storage API) BigQuery (read/write) |
Use cases | Migrations, staging tables for BigQuery loads, self-management | Open lakehouse | Open lakehouse, enterprise-grade storage for analytics, steaming, and AI | Enterprise-grade storage for analytics, steaming, and AI |
Differences with BigLake metastore (classic)
BigLake metastore is the recommended metastore on Google Cloud.
The core differences between BigLake metastore and BigLake metastore (classic) include the following details:
- BigLake metastore (classic) is a standalone metastore service that is distinct from BigQuery and only supports Iceberg tables. It has a different three-part resource model. BigLake metastore (classic) tables aren't automatically discovered from BigQuery.
- Tables in BigLake metastore are accessible from multiple open source engines and BigQuery. BigLake metastore supports direct integration with Spark, which helps reduce redundancy when you store metadata and run jobs. BigLake metastore also supports the Iceberg REST catalog (Preview), which connects lakehouse data across multiple runtimes.
Limitations
The following limitations apply to tables in BigLake metastore:
- You can't create or modify BigLake metastore tables with DDL or DML statements using the BigQuery engine. You can modify BigLake metastore tables using the BigQuery API (with the bq command-line tool or client libraries), but doing so risks making changes that are incompatible with the external engine.
- BigLake metastore tables don't support
renaming operations or
ALTER TABLE ... RENAME TO
Spark SQL statements. - BigLake metastore tables are subject to the same quotas and limits as standard tables.
- Query performance for BigLake metastore tables from the BigQuery engine might be slow compared to querying data in a standard BigQuery table. In general, the query performance for a BigLake metastore table should be equivalent to reading the data directly from Cloud Storage.
- A dry run of a query that uses a BigLake metastore table might report a lower bound of 0 bytes of data, even if rows are returned. This result occurs because the amount of data that is processed from the table can't be determined until the actual query completes. Running the query incurs a cost for processing this data.
- You can't reference a BigLake metastore table in a wildcard table query.
- You can't use the
tabledata.list
method to retrieve data from BigLake metastore tables. Instead, you can save query results to a destination table, then use thetabledata.list
method on that table. - BigLake metastore tables don't support clustering.
- BigLake metastore tables don't support flexible column names.
- The display of table storage statistics for BigLake metastore tables isn't supported.
What's next
- Migrate Dataproc Metastore data to BigLake metastore
- Use BigLake metastore with Dataproc
- Use BigLake metastore with Dataproc Serverless