Discover and catalog Cloud Storage data
This document explains how to use Dataplex automatic discovery, a feature in BigQuery that lets you scan data in Cloud Storage buckets to extract and then catalog metadata. As part of the discovery scan, automatic discovery creates BigLake or external tables for structured data and object tables for unstructured data. This centralized table data eases AI-powered data insights, data security, and governance.
To use automatic discovery of Cloud Storage data, you create and run a discovery scan.
Discovery scan overview
A discovery scan does the following:
- Scans the data in the Cloud Storage bucket or path.
- Groups structured and semi-structured data into tables.
- Collects metadata, such as the table name, schema, and partition definition.
- Creates and updates BigLake, external, or object tables in BigQuery using the schema and partition definition.
For unstructured data, such as images and videos, the discovery scan
detects and registers groups of files that share the same media type as the
BigLake object tables. For example, if gs://images/group1
contains GIF images, and gs://images/group2
contains JPEG images,
the discovery scan detects and registers two filesets.
For structured data, such as Avro, the discovery scan registers groups of files as BigLake external tables and detects files only if they're located in folders that contain the same data format and compatible schema.
The discovery scan supports the following structured and semi-structured data formats:
- Parquet
- Avro
- ORC
- JSON (only the newline-delimited format)
- CSV (but not CSV files that have comment rows)
The discovery scan supports the following compression formats for structured and semi-structured data:
Internal compression for the following formats:
Compression File extension sample Supported format gzip .gz.parquet
Parquet lz4 .lz4.parquet
Parquet Snappy .snappy.parquet
Parquet, ORC, Avro lzo .lzo.parquet
Parquet, ORC External compression for JSON and CSV files:
- gzip
- bzip2
To see the limit of how many tables a discovery scan supports, see Quotas and limits.
The discovered tables are registered in BigQuery as BigLake external tables, BigLake object tables, or external tables. This makes their data available for analysis in BigQuery. Metadata caching for BigLake tables and object tables is also enabled. All the BigLake tables are automatically ingested into BigQuery universal catalog for search and discovery.
Before you begin
Enable the Dataplex API.
Required roles for the Dataplex service account
Before you begin, assign the IAM permissions to the Dataplex service account in your project.
service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com
Replace PROJECT_NUMBER
with the project in which the
Dataplex API is enabled.
To ensure that the Dataplex service account has the necessary permissions to create and run a discovery scan, ask your administrator to grant the Dataplex service account the following IAM roles:
-
Dataplex Discovery Service Agent (
roles/dataplex.discoveryServiceAgent
) on the storage bucket -
Dataplex Discovery Publishing Service Agent (
roles/dataplex.discoveryPublishingServiceAgent
) on the user project -
Create BigLake tables:
Dataplex Discovery BigLake Publishing Service Agent (
roles/dataplex.discoveryBigLakePublishingServiceAgent
) on the BigQuery connection
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to create and run a discovery scan. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create and run a discovery scan:
-
bigquery.datasets.create
on the data source project -
storage.buckets.get
on the data source bucket -
storage.objects.get
on the data source bucket -
storage.objects.list
on the data source bucket -
bigquery.datasets.get
on the data source project -
Provide a connection:
-
bigquery.connections.delegate
on the BigQuery connection -
bigquery.connections.use
on the BigQuery connection
-
Your administrator might also be able to give the Dataplex service account these permissions with custom roles or other predefined roles.
Required roles for the BigQuery connection service account
To ensure that the BigQuery Connection Service account has the necessary
permissions to create a discovery scan,
ask your administrator to grant the BigQuery Connection Service account the
Dataplex Discovery Service Agent (roles/dataplex.discoveryServiceAgent
) IAM role on the Cloud Storage bucket.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to create a discovery scan. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create a discovery scan:
-
bigquery.datasets.create
on the data source project -
storage.buckets.get
on the data source bucket -
storage.objects.get
on the data source bucket -
storage.objects.list
on the data source bucket -
bigquery.datasets.get
on the data source project -
Provide a connection:
-
bigquery.connections.delegate
on the BigQuery connection -
bigquery.connections.use
on the BigQuery connection
-
Your administrator might also be able to give the BigQuery Connection Service account these permissions with custom roles or other predefined roles.
Required roles for end users
To get the permissions that you need to create and manage data discovery scans, ask your administrator to grant you the following IAM roles on the Cloud Storage bucket:
-
Full access to DataScan resources:
Dataplex DataScan Administrator (
roles/dataplex.dataScanAdmin
) - your project -
Write access to DataScan resources:
Dataplex DataScan Editor (
roles/dataplex.dataScanEditor
) - your project -
Read access to DataScan resources, excluding the results:
Dataplex DataScan Viewer (
roles/dataplex.dataScanViewer
) - your project -
Read access to DataScan resources, including the results:
Dataplex DataScan DataViewer (
roles/dataplex.dataScanDataViewer
) - your project
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to create and manage data discovery scans. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create and manage data discovery scans:
-
Create a DataScan:
dataplex.datascans.create
on your project -
Delete a DataScan:
dataplex.datascans.delete
on your project or a DataScan resource -
View DataScan details excluding results:
dataplex.datascans.get
on your projector a DataScan resource -
View DataScan details including results:
dataplex.datascans.getData
on your project or a DataScan resource -
List DataScans:
dataplex.datascans.list
on your project or a DataScan resource -
Run a DataScan:
dataplex.datascans.run
on your project or a DataScan resource -
Update the description of a DataScan:
dataplex.datascans.update
on your projector a DataScan resource -
View the IAM permissions of the DataScan:
dataplex.datascans.getIamPolicy
on your project or a DataScan resource -
Set the IAM permissions on the DataScan:
dataplex.datascans.setIamPolicy
on your project or a DataScan resource
You might also be able to get these permissions with custom roles or other predefined roles.
Create a discovery scan
To discover data, you must create and run a discovery scan. You can set a schedule for the scan or run the scan on demand.
When the discovery scan runs, it creates a new dataset in
BigQuery that corresponds to the Cloud Storage
bucket that was scanned. The BigQuery dataset name is the
same as the Cloud Storage bucket name. Invalid characters in the
bucket name are replaced by an underscore. If the dataset name isn't
available, a suffix is appended (for example, _discovered_001
). The
dataset contains the BigLake external or
non-BigLake external tables that were created by the
discovery scan for further analysis.
Console
In the Google Cloud console, go to the Catalog management page.
In the Cloud Storage discovery tab, click Create.
In the Create discovery scan pane, configure the details about the data to scan.
Enter a name for the scan.
In the Scan ID field, enter a unique ID that follows the resource naming conventions in Google Cloud. If you don't provide an ID, the discovery scan generates the scan ID.
Optional: Provide a description of the scan.
To specify the Cloud Storage bucket that contains the files to scan, in the Bucket field, browse to and select the bucket.
Optional: Define the data to include or exclude from the discovery scan by providing a list of glob patterns for file filtering.
- Include: if only a subset of the data should be scanned, provide a list of glob patterns that match the objects to include.
- Exclude: provide a list of glob patterns that match the objects to exclude.
For example, if you want to exclude
gs://test_bucket/foo/..
from the discovery scan, enter**/foo/**
as the exclude path. Quotation marks cause errors. Make sure to enter**/foo/**
instead of"**/foo/**"
.If you provide both include patterns and exclude patterns, the exclude patterns are applied first.
For Location type, select either Region or Multi-region (whichever is available).
To create BigLake tables from the scanned data, in the Connection ID field, provide your Google Cloud resource connection ID. For more information, see Google Cloud resource connections in BigQuery.
You can create a new connection ID in the same location as the BigQuery dataset location, which is compatible with the Cloud Storage bucket location.
If you don't provide a resource connection ID, the discovery scan creates non-BigLake external tables.
In the Discovery frequency section, configure when you want the discovery scan to run:
Repeat: the scan runs on a predefined schedule. Provide the start time, days to run the scan, and the frequency, such as hourly.
On demand: the scan runs on demand.
Optional: In the JSON or CSV specifications section, specify how the scan should process JSON and CSV files. Click JSON or CSV specifications.
- To configure JSON options, select Enable JSON parsing options.
- Disable type inference: whether the discovery scan should infer data types when scanning data. If you disable type inference for JSON data, all columns are registered as their primitive types, such as string, number, or boolean.
- Encoding format: the character encoding of the data, such as UTF-8, US-ASCII, or ISO-8859-1. If you don't specify a value, UTF-8 is used as the default.
- To configure CSV options, check Enable CSV parsing options.
- Disable type inference: whether the discovery scan should infer data types when scanning data. If you disable type inference for CSV data, all columns are registered as strings.
- Header rows: the number of header rows, either
0
or1
. If you specify the value0
, the discovery scan infers headings and extracts the column names from the file. The default is0
. - Column delimiter character: the character that is used to
separate values. Provide a single character,
\r
(carriage return), or\n
(newline). The default is a comma (,
). - Encoding format: the character encoding of the data, such
as
UTF-8
,US-ASCII
, orISO-8859-1
. If you don't specify a value, UTF-8 is used as the default.
- To configure JSON options, select Enable JSON parsing options.
When you're finished configuring the discovery scan, click Create (for a scheduled scan) or Run now (for an on-demand scan).
A scheduled scan is run according to the schedule that you set.
An on-demand scan is run once initially when you create it, and you can run the scan at any time. It can take several minutes for the discovery scan to run.
gcloud
gcloud alpha dataplex datascans create data-discovery --location=LOCATION --data-source-resource=BUCKET_PATH
Replace the following:
LOCATION
: the location in which you'd like to create your discovery scanBUCKET_PATH
: the Cloud Storage path of the bucket you want to scan
REST
To create a discovery scan, use the dataScans.create
method.
Query published BigLake tables
After you run the discovery scan, BigLake tables are published in a new dataset in BigQuery. The tables are then available for analysis in BigQuery using SQL, or in Dataproc using Apache Spark or HiveQL.
SQL
You can view or query tables in BigQuery. For more information about how to run queries in BigQuery, see Run a query.
Apache Spark
To query BigLake tables using Spark SQL on a Dataproc serverless job, follow these steps:
Create a PySpark script similar to the following sample script:
from pyspark.sql import SparkSession session = ( SparkSession.builder.appName("testing") .config("viewsEnabled","true") .config("materializationDataset", "DATASET_ID") .config("spark.hive.metastore.bigquery.project.id", "PROJECT_ID") .config("spark.hive.metastore.client.factory.class", "com.google.cloud.bigquery.metastore.client.BigQueryMetastoreClientFactory") .enableHiveSupport() .getOrCreate() ) session.sql("show databases").show() session.sql("use TABLE_NAME").show() session.sql("show tables").show() sql = "SELECT * FROM DATASET_ID.TABLE_ID LIMIT 10" df = session.read.format("bigquery").option("dataset", "DATASET_ID").load(sql) df.show()
Replace the following:
DATASET_ID
: ID of dataset for which users have create permissionPROJECT_ID
: ID of project with BigLake tableTABLE_NAME
: Name of BigLake tableTABLE_ID
: ID of BigLake table
Manage published BigLake tables
Published BigLake tables are created and managed in
BigQuery by the discovery scan. By default, the discovery scan
handles new data discovery, schema inferences, and schema evolution every time
the scheduled or on-demand scans are run. To indicate that metadata is managed
by the scan, the scan publishes tables with the label metadata-managed-mode
set to discovery-managed
.
If you want to manage the schema and other metadata like CSV or JSON options
yourself, set the metadata-managed-mode
label to user_managed
. This way, the
schema remains unchanged when the next discovery scan runs. This approach can
be useful in scenarios where the schema that's inferred by the discovery scan is
incorrect or different from what is expected for a given table. When the
metadata-managed-mode
label is set to user_managed
, it can reduce the cost.
To update the label, you can edit the value of the label key
metadata-managed-mode
to user_managed
instead of discovery-managed
. In
this case, the discovery scan doesn't update the schema of the table as long as
the user_managed
label is attached to the table.
Update published BigLake tables
For BigLake tables published using the discovery scan jobs with the default configuration, the schema and other metadata are automatically updated with every discovery scan job run at the scheduled frequency.
To update a published BigLake table, follow these steps:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, expand your project and dataset, then select the table.
In the Details tab, in the Labels section, make sure that the metadata-managed-mode label is set to user_managed. If it is set to a different value, follow these steps:
Click
Edit details.Next to the metadata-managed-mode key, in the value field, enter
user_managed
.
Delete published BigLake tables
To delete a published BigLake table, follow these steps:
Delete the data files for the table in the Cloud Storage bucket.
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, expand your project and dataset, then select the table.
In the Details pane, in the Labels section, make sure the metadata-managed-mode label isn't set to
user_managed
. If it's set touser_managed
, follow these steps:Click Edit details
.Next to the metadata-managed-mode key, in the value field, enter
discovery-managed
.
Click Run. The discovery scan runs on demand.
After the discovery scan runs, the BigLake table is deleted in BigQuery and isn't available to list or query through Spark.
Run a discovery scan on demand
To run a discovery scan on demand, select one of the following options.
Console
In the Google Cloud console, go to the BigQuery page.
In the navigation menu, click Governance > Catalog management.
In the Cloud Storage discovery pane, click the discovery scan you want to run.
Click Run now.
gcloud
To run a discovery scan, use the
gcloud dataplex datascans run
command:
gcloud dataplex datascans runDATASCAN
\ --location=LOCATION
Replace the following variables:
LOCATION
: the Google Cloud region in which the discovery scan was created.DATASCAN
: the name of the discovery scan.
REST
To run a discovery scan on demand, use the
dataScans.run
method
in the Dataplex API.
List discovery scans
To list your discovery scan, select one of the following options.
Console
In the Google Cloud console, go to the BigQuery page.
In the navigation menu, click Governance > Catalog management.
In the Cloud Storage discovery pane, the discovery scans created in the project are listed.
gcloud
gcloud dataplex datascans list --location=LOCATION --project=PROJECT_ID
Replace the following:
LOCATION
: the location of your projectPROJECT_ID
: your Google Cloud project ID
REST
To retrieve the list of discovery scans in your project, use the
dataScans.list
method
in the Dataplex API.
View a discovery scan
To view a discovery scan, select one of the following options.
Console
In the Google Cloud console, go to the BigQuery page.
In the navigation menu, click Governance > Catalog management.
In the Cloud Storage discovery pane, click the discovery scan you want to view details for.
- The Scan details section shows details about the discovery scan.
- The Scan status section shows the discovery results of the latest scan job.
gcloud
gcloud dataplex datascans jobs describe JOB \ --location=LOCATION \ --datascan=DATASCAN \ --view=FULL
Replace the following:
JOB
: the job ID of the discovery scan job.LOCATION
: the Google Cloud region in which the discovery scan was created.DATASCAN
: the name of the discovery scan the job belongs to.--view=FULL
: see the discovery scan job result.
REST
To view the results of a data discovery scan, use the
dataScans.get
method
in the Dataplex API.
View historical discovery scan results
To view historical discovery scan results, select one of the following options.
Console
In the Google Cloud console, go to the BigQuery page.
In the navigation menu, click Governance > Catalog management.
In the Cloud Storage discovery pane, click the discovery scan you want to view details for.
Click the Scan history pane. The Scan history pane provides information about past jobs, including the number of records scanned in each job, each job's status, and the time the jobs were run.
To view detailed information about a job, click the job in the Job ID column.
gcloud
gcloud dataplex datascans jobs list \ --location=LOCATION \ --datascan=DATASCAN
Replace the following:
LOCATION
: the Google Cloud region in which the discovery scan was created.DATASCAN
: the name of the discovery scan the job belongs to.
REST
To view all the jobs of a discovery scan, use the
dataScans.job/list
method
in the Dataplex API.
Update a discovery scan
To change the schedule of a discovery scan, for example to change the schedule from on-demand to recurring, you need to update the discovery scan.
Console
In the Google Cloud console, go to the BigQuery page.
In the navigation menu, click Governance > Catalog management.
In the Cloud Storage discovery pane, for the discovery scan you want to update, click Actions > Edit.
Edit the values.
Click Save.
gcloud
gcloud alpha dataplex datascans update data-discovery SCAN_ID --location=LOCATION
Replace the following:
SCAN_ID
: the ID of the discovery scan you want to updateLOCATION
: the location of the discovery scan you want to update
REST
To update a discovery scan, use the
dataScans.patch
method
in the Dataplex API.
Delete a discovery scan
To delete a discovery scan, select one of the following options.
Console
In the Google Cloud console, go to the BigQuery page.
In the navigation menu, click Governance > Catalog management.
In the Cloud Storage discovery pane, for the discovery scan you want to delete, click Actions > Delete.
Click Delete.
gcloud
gcloud dataplex datascans delete SCAN_ID --location=LOCATION --async
Replace the following:
SCAN_ID
: the ID of the discovery scan you want to delete.LOCATION
: the location of the discovery scan you want to delete.
REST
To delete a discovery scan, use the
dataScans.delete
method
in the Dataplex API.