Optimal data and metadata formats for lakehouses

This document guides you through the optimal data and metadata formats as you design your data lakehouse with BigLake.

A data lakehouse is a data architecture that combines the structure of a data warehouse with the raw data flexibility of a data lake. This architecture provides flexibility and scalability for a wide range of data use cases. The Google Cloud data lakehouse solution is called BigLake, which connects Google Cloud and open source services to create a unified interface for analytics and AI. A data lakehouse that's built with BigLake consists of the following key components:

  • Storage capabilities: Cloud Storage or BigQuery, with Apache Iceberg as the recommended open table format
  • A metastore: BigLake metastore
  • A query engine: BigQuery, Apache Spark, Apache Flink, Trino, or other open source engines
  • A tool for data writing and analytics: various BigQuery and open source connections

BigLake packages all of these components in a single experience with uniform governance. For more information on BigLake architecture and innovations, see BigLake evolved.

Select a metastore

For your metastore, we recommend using BigLake metastore. BigLake metastore is a fully managed and serverless metastore for your lakehouse on Google Cloud. It provides a single source of truth for metadata from multiple sources and is accessible from BigQuery and various open data processing engines, removing the need to copy and synchronize metadata between different repositories with customized tools. BigLake metastore is supported with Dataplex Universal Catalog, which provides unified and fine-grained access controls across all supported engines and enables end-to-end governance that includes comprehensive lineage, data quality, and discoverability capabilities.

Select a table format

With BigLake metastore as the metastore for your open lakehouse, you have following choices for the format of your tables:

  • Choose standard BigQuery tables for data managed in BigQuery. These tables are fully managed by BigQuery and have the most advanced data analytics and management features. You can still connect these tables to BigLake metastore. Choose this option for non-Iceberg tables.
  • Choose BigLake Iceberg tables in BigQuery for a fully managed experience on BigQuery. These tables are Iceberg tables that you create from BigQuery and store in Cloud Storage. Like all tables that use BigLake metastore, they can be read by open source engines or BigQuery. However, BigQuery is the only engine that can directly write to them. Choose this option if you want your extract, transform, and load (ETL) workflow to be managed by BigQuery.
  • Choose BigLake Iceberg tables for a semi-managed experience on Google Cloud. These tables are Iceberg tables that you create from open source engines and store in Cloud Storage. Like all tables that use BigLake metastore, they can be read by open source engines or BigQuery. However, the open source engine that created the table is the only engine that can write to it. Choose this option if you want your ETL workflow to be managed by the open source engine.
  • Choose external tables for tables outside of BigLake metastore. The data and metadata of these tables are completely self-managed, where you fully rely on the capabilities of open table formats (such as Iceberg, Apache Hudi, or Delta Lake). BigQuery only has the ability to read from these tables. Choose this option for data and metadata that you want to manage on your own in a third-party catalog.

Use the following table to compare your table format options:

External tables BigLake Iceberg tables BigLake Iceberg tables in BigQuery Standard BigQuery tables
Metastore External or self-hosted metastore BigLake metastore BigLake metastore BigLake metastore
Storage Cloud Storage / Amazon S3 / Azure Cloud Storage Cloud Storage BigQuery
Management Customer or third party Google Google (highly managed experience) Google (most managed experience)
Read / Write Open source engines (read/write)

BigQuery (read only)
Open source engines (read/write)

BigQuery (read only)
Open source engines (read only with Iceberg libraries, read/write interoperability with BigQuery Storage API)

BigQuery (read/write)

Open source engines (read/write interoperability with BigQuery Storage API)

BigQuery (read/write)

Use cases Migrations, staging tables for BigQuery loads, self-management Open lakehouse Open lakehouse, enterprise-grade storage for analytics, streaming, and AI Enterprise-grade storage for analytics, streaming, and AI

What's next