Introduction to data governance in BigQuery

BigQuery has built-in governance capabilities that simplify how you discover, manage, monitor, govern, and use your data and AI assets.

Administrators, data stewards, data governance managers, and data custodians can use the governance capabilities in BigQuery to do the following:

  • Discover data.
  • Curate data.
  • Gather and enrich metadata.
  • Manage data quality.
  • Ensure that data is used consistently and in compliance with organizational policies.
  • Share data at scale and in a secure fashion.

At the heart of BigQuery governance capabilities is universal catalog, a centralized inventory of all data assets in your organization. Universal catalog holds business, technical, and runtime metadata for all of your data. It helps you discover relationships and semantics in the metadata by applying artificial intelligence and machine learning.

Universal catalog brings together a data catalog and a fully managed runtime metastore. The metastore in BigQuery lets you use multiple data processing engines to query a single copy of data with a single schema, without data duplication. The data processing engines that you can use include BigQuery, Apache Spark, Apache Flink, and Apache Hive. Your data can be stored in locations like BigQuery storage tables, BigQuery tables for Apache Iceberg, or BigLake external tables.

BigQuery supports an end-to-end data lifecycle, from discovery to use of data. Universal catalog powers BigQuery governance features and capabilities. Governance features are also available in Dataplex.

Data discovery

BigQuery discovers data across the organization in Google Cloud, whether the data is in BigQuery, Spanner, Cloud SQL, Pub/Sub, or Cloud Storage. BigQuery automatically extracts the metadata and stores it in universal catalog. For example, you can use BigQuery to extract metadata for structured and unstructured data from Cloud Storage, and you can automatically create query-ready BigLake tables at scale. This lets you perform analytics with an open source engine without data duplication.

You can also extract and catalog metadata from third-party data sources using custom connectors.

BigQuery offers the following data discovery capabilities:

  • Search. Search for data and AI resources across projects by using BigQuery in the Google Cloud console. BigQuery supports semantic search for data discovery, letting you search with natural language queries.
  • Automatic discovery of Cloud Storage data. Scan for data in Cloud Storage buckets to extract and then catalog metadata. Automatic discovery creates tables for both structured and unstructured data.
  • Metadata import. Import metadata at scale from third-party systems into universal catalog. You can build custom connectors to extract data from your data sources, and then run managed connectivity pipelines that orchestrate the metadata import workflow.

Curation and data stewardship

To improve the discoverability and usability of data, data stewards and administrators can use BigQuery to review, update, and analyze metadata. BigQuery data curation and stewardship capabilities help you ensure that your data is accurate, consistent, and aligned with your organization's policies.

BigQuery offers the following data curation and stewardship capabilities:

  • Business glossary (Preview). Improve context, collaboration, and search by defining your organization's terminology in a glossary. Identify data stewards for the terms, and attach terms to data asset fields.
  • Data insights. Gemini uses metadata to generate natural language questions about your table and the SQL queries to answer them. These data insights help you uncover patterns, assess data quality, and perform statistical analysis.
  • Data profiling. Identify common statistical characteristics of the columns in BigQuery tables to understand and analyze your data more effectively.
  • Data quality. Define and run data quality checks across tables in BigQuery and Cloud Storage, and apply regular and ongoing data controls in BigQuery environments.
  • Data lineage. Track how data moves through your systems: where it comes from, where it's passed to, and what transformations are applied to it. BigQuery supports data lineage at the table- and column-levels.

Next steps for curation and data stewardship

The following table outlines next steps that you can take to learn more about curation and data stewardship features:

Experience level Learning path
New cloud users
  • Run a data profile scan to gain insights about your data, including the limits or averages of your data.
Experienced cloud users

Security and access control

Data access management is the process of defining, enforcing, and monitoring the rules and policies governing who has access to data. Access management ensures that data is only accessible to those who are authorized to access it.

BigQuery offers the following security and access control capabilities:

  • Identity and Access Management (IAM). IAM lets you control who has access to your BigQuery resources, such as projects, datasets, tables, and views. You can grant IAM roles to users, groups, and service accounts. These roles define what they can do with your resources.
  • Column-level access controls and row-level access controls. Column-level and row-level access controls let you restrict access to specific columns and rows in a table, based on user attributes or data values. This control lets you implement fine-grained access to help protect sensitive data from unauthorized access.
  • Data transfer management. VPC Service Controls lets you create perimeters around Google Cloud resources and control access to those resources based on your organization's policies.
  • Audit logs. Audit logs provide you with a detailed record of user activity and system events in your organization. These logs help you enforce data governance policies and identify potential security risks.
  • Data masking. Data masking lets you obscure sensitive data in a table while still permitting authorized users to access the surrounding data. Data masking can also obscure data that matches sensitive data patterns, safeguarding against accidental data disclosure.
  • Encryption. BigQuery automatically encrypts all data at rest and in transit, while letting you customize your encryption settings to meet your specific requirements.

Next steps for security and access control

The following table outlines next steps that you can take to learn more about access control features:

Experience level Learning path
New cloud users
Experienced cloud users
  • For greater flexibility and granularity in managing your permissions, consider creating custom roles that match your needs.
  • Add row and column controls to help control access to specific rows and columns in your tables.
  • Establish an access perimeter around your Google Cloud resources by setting up VPC Service Controls.
  • Add column-level data masking to your table to share information through your organization without revealing sensitive data.
  • Use Sensitive Data Protection to scan your data for sensitive and high-risk information, such as personally identifiable information (PII), financial data, and health information.

Shared data and insights

BigQuery lets you share data and insights at scale within and across organizational boundaries. It has a robust security and privacy framework through a built-in data exchange platform. Using BigQuery sharing, you can discover, access, and consume a data library that's curated by a wide selection of data providers.

BigQuery offers the following sharing capabilities:

  • Share more than data. You can share a wide range of data and AI assets such as BigQuery datasets, tables, views, real-time streams with Pub/Sub topics, SQL stored procedures, and BigQuery ML models.
  • Access Google datasets. Augment your analytics and ML initiatives with Google datasets from Search Trends, DeepMind WeatherNext models, Google Maps Platform, Google Earth Engine, and more.
  • Integrate with data governance principles. Data owners retain control over their data and have the ability to define and configure rules or policies to restrict access and usage.
  • Live, zero-copy data sharing. Data is shared in place with no integration, data movement, or replication needed, ensuring analysis is based on the latest information. Linked datasets created are a live pointer to the shared asset.
  • Enhance security posture. You can use access controls to reduce overprovisioning access, including built-in VPC Service Controls support.
  • Increase visibility with provider usage metrics. Data publishers can view and monitor usage for shared assets such as the number of jobs executed, total bytes scanned, and subscribers for each organization.
  • Collaborate on sensitive data with data clean rooms. Data clean rooms provide a security-enhanced environment in which multiple parties can share, join, and analyze their data assets without moving or revealing the underlying data.
  • Built on BigQuery. You can build on the scalability and massive processing capabilities in BigQuery, allowing for large scale collaborations.

Next steps for sharing

The following table outlines next steps that you can take to learn more about sharing features:

Experience level Learning path
New cloud users
  • Learn how to create and manage exchanges and listings to start sharing within or outside of your organization.
Experienced cloud users

What's next