BigLake overview

BigLake is a storage engine that unites Google Cloud and open source services to create a unified interface for advanced analytics and AI. It provides the foundation that you need to build an open, managed, and high-performance lakehouse with automated data management and built-in governance using Apache Iceberg.

BigLake enables interoperability across all Iceberg-compatible engines like Apache Spark or BigQuery, giving you a consistent view of your data. It also extends Cloud Storage management capabilities, which include features like auto-class tiering, encryption, and customer-managed encryption keys on your storage buckets. Additionally, the built-in integration with Dataplex Universal Catalog ensures that governance policies are defined centrally and are consistently enforced across multiple engines, all while enabling semantic search, data lineage, profiling, and quality checks.

BigLake also offers the option of a fully managed Iceberg experience when integrated with BigQuery. By leveraging BigQuery's highly scalable, real-time metadata management capabilities, you get the best of both worlds—openness and data ownership along with high-performance analytics, streaming, and AI.

Architecture

A data lakehouse that's built with BigLake consists of the following components:

  • Storage capabilities. Cloud Storage with Apache Iceberg as the recommended open table format.
  • A metastore. BigLake metastore is a unified, managed, serverless, and scalable metastore that provides a single source of truth for managing metadata across multiple engines.
  • A query engine. BigQuery, Apache Spark, Apache Flink, Trino, and other open source engines are all compatible with BigLake.
  • A tool for data writing and analytics. BigQuery, Spark, Flink, Trino, and other open source tools integrate with BigLake to provide a variety of paths for writes and analysis.

BigLake packages all of these components in a single experience with uniform governance. For more information on BigLake architecture and innovations, see BigLake evolved.

BigLake metastore

BigLake metastore is a fully managed and serverless metastore for your lakehouse on Google Cloud. It provides a single source of truth for metadata from multiple sources and is accessible from BigQuery and various open data processing engines, removing the need to copy and synchronize metadata between different repositories with customized tools.

BigLake metastore is supported with Dataplex Universal Catalog, which provides unified and fine-grained access controls across all supported engines and enables end-to-end governance that includes comprehensive lineage, data quality, and discoverability capabilities.

Table formats

When building a lakehouse on BigLake, you have the following choices for the format of your tables:

  • BigLake Iceberg tables in BigQuery are Iceberg tables that you create from BigQuery and store in Cloud Storage. Like all tables that use BigLake metastore, they can be read by open source engines and BigQuery. However, BigQuery is the only engine that can directly write to them. This option is best if you want your extract, transform, and load (ETL) workflow to be fully managed by BigQuery.
  • BigLake Iceberg tables are Iceberg tables that you create from open source engines and store in Cloud Storage. Like all tables that use BigLake metastore, they can be read by open source engines and BigQuery. However, the open source engine that created the table is the only engine that can write to it. This option is best if you want your ETL workflow to be managed by the open source engine.
  • Standard BigQuery tables are fully managed by BigQuery and have the most advanced data analytics and management features. You can still connect these tables to BigLake metastore. This option is best for non-Iceberg tables.
  • External tables are tables that are outside of BigLake metastore. The data and metadata of these tables are completely self-managed, where you are fully reliant upon the capabilities of open table formats (such as Iceberg, Apache Hudi, or Delta Lake). BigQuery only has the ability to read from these tables. Choose this option for data and metadata that you want to manage on your own in a third-party catalog.

Use the following chart to compare your table format options:

External tables BigLake Iceberg tables BigLake Iceberg tables in BigQuery Standard BigQuery tables
Metastore External or self-hosted metastore BigLake metastore BigLake metastore BigQuery
Storage Cloud Storage / Amazon S3 / Azure Cloud Storage Cloud Storage BigQuery
Storage optimization Customer or third-party managed Customer or third-party managed Google managed Google managed
Read / Write Open source engines (read/write)

BigQuery (read only)
Open source engines (read/write)

BigQuery (read only)
Open source engines (read only with Iceberg libraries, read/write interoperability with BigQuery Storage API)

BigQuery (read/write)

Open source engines (read/write interoperability with BigQuery Storage API)

BigQuery (read/write)

Use cases Staging tables for BigQuery loads, legacy query-only tables Open lakehouse Open lakehouse with high-performant, enterprise-grade storage for advanced analytics, streaming, and AI Enterprise-grade storage for advanced analytics, streaming, and AI

What's next