This page describes how to create a Dataplex Universal Catalog data quality scan.
To learn about data quality scans, see About auto data quality.
Before you begin
-
Enable the Dataplex API.
- Optional: If you want Dataplex Universal Catalog to generate recommendations for data quality rules based on the results of a data profile scan, create and run the data profile scan.
Required roles
To run a data quality scan on a BigQuery table, you need permission to read the BigQuery table and permission to create a BigQuery job in the project used to scan the table.
If the BigQuery table and the data quality scan are in different projects, then you need to give the Dataplex Universal Catalog service account of the project containing the data quality scan read permission for the corresponding BigQuery table.
If the data quality rules refer to additional tables, then the scan project's service account must have read permissions on the same tables.
To get the permissions that you need to export the scan results to a BigQuery table, ask your administrator to grant the Dataplex Universal Catalog service account the BigQuery Data Editor (
roles/bigquery.dataEditor
) IAM role on the results dataset and table. This grants the following permissions:bigquery.datasets.get
bigquery.tables.create
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.update
bigquery.tables.updateData
If the BigQuery data is organized in a Dataplex Universal Catalog lake, grant the Dataplex Universal Catalog service account the Dataplex Metadata Reader (
roles/dataplex.metadataReader
) and Dataplex Viewer (roles/dataplex.viewer
) IAM roles. Alternatively, you need all of the following permissions:dataplex.lakes.list
dataplex.lakes.get
dataplex.zones.list
dataplex.zones.get
dataplex.entities.list
dataplex.entities.get
dataplex.operations.get
If you're scanning a BigQuery external table from Cloud Storage, grant the Dataplex Universal Catalog service account the Storage Object Viewer (
roles/storage.objectViewer
) role for the bucket. Alternatively, assign the Dataplex Universal Catalog service account the following permissions:storage.buckets.get
storage.objects.get
If you want to publish the data quality scan results as Dataplex Universal Catalog metadata, you must be granted the BigQuery Data Editor (
roles/bigquery.dataEditor
) IAM role for the table, and thedataplex.entryGroups.useDataQualityScorecardAspect
permission on the@bigquery
entry group in the same location as the table. Alternatively, you must be granted the Dataplex Catalog Editor (roles/dataplex.catalogEditor
) role for the@bigquery
entry group in the same location as the table.Alternatively, you need all of the following permissions:
bigquery.tables.get
- on the tablebigquery.tables.update
- on the tablebigquery.tables.updateData
- on the tablebigquery.tables.delete
- on the tabledataplex.entryGroups.useDataQualityScorecardAspect
- on the@bigquery
entry group
Or, you need all of the following permissions:
dataplex.entries.update
- on the@bigquery
entry groupdataplex.entryGroups.useDataQualityScorecardAspect
- on the@bigquery
entry group
If you need to access columns protected by BigQuery column-level access policies, then assign the Dataplex Universal Catalog service account permissions for those columns. The user creating or updating a data scan also needs permissions for the columns.
If a table has BigQuery row-level access policies enabled, then you can only scan rows visible to the Dataplex Universal Catalog service account. Note that the individual user's access privileges are not evaluated for row-level policies.
Required data scan roles
To use auto data quality, ask your administrator to grant you one of the following IAM roles:
- Full access to
DataScan
resources: Dataplex DataScan Administrator (roles/dataplex.dataScanAdmin
) - To create
DataScan
resources: Dataplex DataScan Creator (roles/dataplex.dataScanCreator
) on the project - Write access to
DataScan
resources: Dataplex DataScan Editor (roles/dataplex.dataScanEditor
) - Read access to
DataScan
resources excluding rules and results: Dataplex DataScan Viewer (roles/dataplex.dataScanViewer
) - Read access to
DataScan
resources, including rules and results: Dataplex DataScan DataViewer (roles/dataplex.dataScanDataViewer
)
The following table lists the DataScan
permissions:
Permission name | Grants permission to do the following: |
---|---|
dataplex.datascans.create |
Create a DataScan |
dataplex.datascans.delete |
Delete a DataScan |
dataplex.datascans.get |
View operational metadata such as ID or schedule, but not results and rules |
dataplex.datascans.getData |
View DataScan details including rules and results |
dataplex.datascans.list |
List DataScan s |
dataplex.datascans.run |
Run a DataScan |
dataplex.datascans.update |
Update the description of a DataScan |
dataplex.datascans.getIamPolicy |
View the current IAM permissions on the scan |
dataplex.datascans.setIamPolicy |
Set IAM permissions on the scan |
Define data quality rules
You can define data quality rules by using built-in rules or custom SQL checks. If you're using the Google Cloud CLI, you can define these rules in a JSON or YAML file.
The examples in the following sections show how to define a variety of data quality rules. The rules validate a sample table that contains data about customer transactions. Assume the table has the following schema:
Column name | Column type | Column description |
---|---|---|
transaction_timestamp | Timestamp | Timestamp of the transaction. The table is partitioned on this field. |
customer_id | String | A customer ID in the format of 8 letters followed by 16 digits. |
transaction_id | String | The transaction ID needs to be unique across the table. |
currency_id | String | One of the supported currencies.The currency type must match one of the available currencies in the dimension table dim_currency .
|
amount | float | Transaction amount. |
discount_pct | float | Discount percentage. This value must be between 0 and 100. |
Define data quality rules using built-in rule types
The following example rules are based on built-in rule types. You can create rules based on built-in rule types using the Google Cloud console or the API. Dataplex Universal Catalog might recommend some of these rules.
Column name | Rule Type | Suggested dimension | Rule parameters |
---|---|---|---|
transaction_id |
Uniqueness check | Uniqueness | Threshold: Not Applicable |
amount |
Null check | Completeness | Threshold: 100% |
customer_id |
Regex (regular expression) check | Validity | Regular expression: ^[0-9]{8}[a-zA-Z]{16}$ Threshold: 100%
|
currency_id |
Value set check | Validity | Set of: USD,JPY,INR,GBP,CAN Threshold: 100%
|
Define data quality rules using custom SQL rules
To build custom SQL rules, use the following framework:
When you create a rule that evaluates one row at a time, create an expression that generates the number of successful rows when Dataplex Universal Catalog evaluates the query
SELECT COUNTIF(CUSTOM_SQL_EXPRESSION) FROM TABLE
. Dataplex Universal Catalog checks the number of successful rows against the threshold.When you create a rule that evaluates across the rows or uses a table condition, create an expression that returns success or failure when Dataplex Universal Catalog evaluates the query
SELECT IF(CUSTOM_SQL_EXPRESSION) FROM TABLE
.When you create a rule that evaluates the invalid state of a dataset, provide a statement that returns invalid rows. If any rows are returned, the rule fails. Omit the trailing semicolon from the SQL statement.
You can refer to a data source table and all of its precondition filters by using the data reference parameter
${data()}
in a rule, instead of explicitly mentioning the source table and its filters. Examples of precondition filters include row filters, sampling percents, and incremental filters. The${data()}
parameter is case-sensitive.
The following example rules are based on custom SQL rules.
Rule type | Rule description | SQL expression |
---|---|---|
Row condition | Checks if the value of the discount_pct
is between 0 and 100.
|
0 <discount_pct AND discount_pct < 100
|
Row condition | Reference check to validate that currency_id is one of the
supported currencies.
|
currency_id in (select id from my_project_id.dim_dataset.dim_currency)
|
Table condition | Aggregate SQL expression that checks if the average discount_pct is between 30% and 50%.
|
30<avg(discount) AND avg(discount) <50
|
Row condition | Checks if a date is not in the future. | TIMESTAMP(transaction_timestamp) < CURRENT_TIMESTAMP()
|
Table condition |
A BigQuery user-defined function (UDF)
to check that the average transaction amount is less than a predefined
value per country. Create the (Javascript) UDF by running the following
command:
CREATE OR REPLACE FUNCTION myProject.myDataset.average_by_country ( country STRING, average FLOAT64) RETURNS BOOL LANGUAGE js AS R""" if (country = "CAN" && average < 5000){ return 1 } else if (country = "IND" && average < 1000){ return 1 } else { return 0 } """; |
Example rule to check the average transaction amount for country=CAN .
myProject.myDataset.average_by_country( "CAN", (SELECT avg(amount) FROM myProject.myDataset.transactions_table WHERE currency_id = 'CAN' )) |
Table condition | A BigQuery ML
predict clause to identify anomalies in discount_pct . It checks
if a discount should be applied based on customer ,
currency , and transaction . The rule checks if the
prediction matches the actual value, at least 99% of times. Assumption: The
ML model is created before using the rule. Create the ML model using the
following command:
CREATE MODEL model-project-id.dataset-id.model-name OPTIONS(model_type='logistic_reg') AS SELECT IF(discount_pct IS NULL, 0, 1) AS label, IFNULL(customer_id, "") AS customer, IFNULL(currency_id, "") AS currency, IFNULL(amount, 0.0) AS amount FROM `data-project-id.dataset-id.table-names` WHERE transaction_timestamp < '2022-01-01'; |
The following rule checks if prediction accuracy is greater than 99%.
SELECT accuracy > 0.99 FROM ML.EVALUATE (MODEL model-project-id.dataset-id.model-name, ( SELECT customer_id, currency_id, amount, discount_pct FROM data-project-id.dataset-id.table-names WHERE transaction_timestamp > '2022-01-01'; ) ) |
Row condition | A BigQuery ML predict
function to identify anomalies in discount_pct . The function
checks if a discount should be applied based on customer ,
currency and transaction .
The rule identifies all the occurrences where the prediction didn't match.
Assumption: The ML model is created before using the rule. Create the ML
model using the following command:
CREATE MODEL model-project-id.dataset-id.model-name OPTIONS(model_type='logistic_reg') AS SELECT IF(discount_pct IS NULL, 0, 1) AS label, IFNULL(customer_id, "") AS customer, IFNULL(currency_id, "") AS currency, IFNULL(amount, 0.0) AS amount FROM `data-project-id.dataset-id.table-names` WHERE transaction_timestamp < '2022-01-01'; |
The following rule checks if the discount prediction matches with the
actual for every row.
IF(discount_pct > 0, 1, 0) =(SELECT predicted_label FROM ML.PREDICT( MODEL model-project-id.dataset-id.model-name, ( SELECT customer_id, currency_id, amount, discount_pct FROM data-project-id.dataset-id.table-names AS t WHERE t.transaction_timestamp = transaction_timestamp LIMIT 1 ) ) ) |
SQL assertion | Validates if the discount_pct is greater than 30% for today
by checking whether any rows exist with a discount percent less than or
equal to 30. |
SELECT * FROM my_project_id.dim_dataset.dim_currency WHERE discount_pct <= 30 AND transaction_timestamp >= current_date() |
SQL assertion (with data reference parameter) | Checks if the The date filter The data reference parameter |
SELECT * FROM ${data()} WHERE discount_pct > 30 |
Define data quality rules using the gcloud CLI
The following example YAML file uses some of the same rules as the
sample rules using built-in types and the
sample custom SQL rules. This YAML file also contains
other specifications for the data quality scan, such as filters and sampling
percent. When you use the gcloud CLI to create or update a data
quality scan, you can use a YAML file like this as input to the
--data-quality-spec-file
argument.
rules:
- uniquenessExpectation: {}
column: transaction_id
dimension: UNIQUENESS
- nonNullExpectation: {}
column: amount
dimension: COMPLETENESS
threshold: 1
- regexExpectation:
regex: '^[0-9]{8}[a-zA-Z]{16}$'
column : customer_id
ignoreNull : true
dimension : VALIDITY
threshold : 1
- setExpectation :
values :
- 'USD'
- 'JPY'
- 'INR'
- 'GBP'
- 'CAN'
column : currency_id
ignoreNull : true
dimension : VALIDITY
threshold : 1
- rangeExpectation:
minValue : '0'
maxValue : '100'
column : discount_pct
ignoreNull : true
dimension : VALIDITY
threshold : 1
- rowConditionExpectation:
sqlExpression : 0 < `discount_pct` AND `discount_pct` < 100
column: discount_pct
dimension: VALIDITY
threshold: 1
- rowConditionExpectation:
sqlExpression : currency_id in (select id from `my_project_id.dim_dataset.dim_currency`)
column: currency_id
dimension: VALIDITY
threshold: 1
- tableConditionExpectation:
sqlExpression : 30 < avg(discount_pct) AND avg(discount_pct) < 50
dimension: VALIDITY
- rowConditionExpectation:
sqlExpression : TIMESTAMP(transaction_timestamp) < CURRENT_TIMESTAMP()
column: transaction_timestamp
dimension: VALIDITY
threshold: 1
- sqlAssertion:
sqlStatement : SELECT * FROM `my_project_id.dim_dataset.dim_currency` WHERE discount_pct > 100
dimension: VALIDITY
samplingPercent: 50
rowFilter: discount_pct > 100
postScanActions:
bigqueryExport:
resultsTable: projects/my_project_id/datasets/dim_dataset/tables/dim_currency
notificationReport:
recipients:
emails:
- '222larabrown@gmail.com'
- 'cloudysanfrancisco@gmail.com'
scoreThresholdTrigger:
scoreThreshold: 50
jobFailureTrigger: {}
jobEndTrigger: {}
catalogPublishingEnabled: true
Create a data quality scan
Console
In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.
Click Create data quality scan.
In the Define scan window, fill in the following fields:
Optional: Enter a Display name.
Enter an ID. See the resource naming conventions.
Optional: Enter a Description.
In the Table field, click Browse. Choose the table to scan, and then click Select. Only standard BigQuery tables are supported.
For tables in multi-region datasets, choose a region where to create the data scan.
To browse the tables organized within Dataplex Universal Catalog lakes, click Browse within Dataplex Lakes.
In the Scope field, choose Incremental or Entire data.
- If you choose Incremental: In the Timestamp column field,
select a column of type
DATE
orTIMESTAMP
from your BigQuery table that increases as new records are added, and that can be used to identify new records. It can be a column that partitions the table.
- If you choose Incremental: In the Timestamp column field,
select a column of type
To filter your data, select the Filter rows checkbox. Provide a row filter consisting of a valid SQL expression that can be used as a part of a
WHERE
clause in GoogleSQL syntax. For example,col1 >= 0
. The filter can be a combination of multiple column conditions. For example,col1 >= 0 AND col2 < 10
.To sample your data, in the Sampling size list, select a sampling percentage. Choose a percentage value that ranges between 0.0% and 100.0% with up to 3 decimal digits. For larger datasets, choose a lower sampling percentage. For example, for a 1 PB table, if you enter a value between 0.1% and 1.0%, the data quality scan samples between 1-10 TB of data. For incremental data scans, the data quality scan applies sampling to the latest increment.
To publish the data quality scan results as Dataplex Universal Catalog metadata, select the Publish results to BigQuery and Dataplex Catalog checkbox.
You can view the latest scan results on the Data quality tab in the BigQuery and Dataplex Universal Catalog pages for the source table. To enable users to access the published scan results, see the Grant access to data profile scan results section of this document.
In the Schedule section, choose one of the following options:
Repeat: Run the data quality scan on a schedule: hourly, daily, weekly, monthly, or custom. Specify how often the scan runs and at what time. If you choose custom, use cron format to specify the schedule.
On-demand: Run the data quality scan on demand.
Click Continue.
In the Data quality rules window, define the rules to configure for this data quality scan.
Click Add rules, and then choose from the following options.
Profile based recommendations: Build rules from the recommendations based on an existing data profiling scan.
Choose columns: Select the columns to get recommended rules for.
Choose scan project: If the data profiling scan is in a different project than the project where you are creating the data quality scan, then select the project to pull profile scans from.
Choose profile results: Select one or more profile results and then click OK. This populates a list of suggested rules that you can use as a starting point.
Select the checkbox for the rules that you want to add, and then click Select. Once selected, the rules are added to your current rule list. Then, you can edit the rules.
Built-in rule types: Build rules from predefined rules. See the list of predefined rules.
Choose columns: Select the columns to select rules for.
Choose rule types: Select the rule types that you want to choose from, and then click OK. The rule types that appear depend on the columns that you selected.
Select the checkbox for the rules that you want to add, and then click Select. Once selected, the rules are added to your current rules list. Then, you can edit the rules.
SQL row check rule: Create a custom SQL rule to apply to each row.
In Dimension, choose one dimension.
In Passing threshold, choose a percentage of records that must pass the check.
In Column name, choose a column.
In the Provide a SQL expression field, enter a SQL expression that evaluates to a boolean
true
(pass) orfalse
(fail). For more information, see Supported custom SQL rule types and the examples in Define data quality rules.Click Add.
SQL aggregate check rule: Create a custom SQL table condition rule.
In Dimension, choose one dimension.
In Column name, choose a column.
In the Provide a SQL expression field, enter a SQL expression that evaluates to a boolean
true
(pass) orfalse
(fail). For more information, see Supported custom SQL rule types and the examples in Define data quality rules.Click Add.
SQL assertion rule: Create a custom SQL assertion rule to check for an invalid state of the data.
In Dimension, choose one dimension.
Optional: In Column name, choose a column.
In the Provide a SQL statement field, enter a SQL statement that returns rows that match the invalid state. If any rows are returned, this rule fails. Omit the trailing semicolon from the SQL statement. For more information, see Supported custom SQL rule types and the examples in Define data quality rules.
Click Add.
Optional: For any data quality rule, you can assign a custom rule name to use for monitoring and alerting, and a description. To do this, edit a rule and specify the following details:
- Rule name: Enter a custom rule name with up to 63 characters. The rule name can include letters (a-z, A-Z), digits (0-9), and hyphens (-) and must start with a letter and end with a number or a letter.
- Description: Enter a rule description with a maximum length of 1,024 characters.
Repeat the previous steps to add additional rules to the data quality scan. When finished, click Continue.
Optional: Export the scan results to a BigQuery standard table. In the Export scan results to BigQuery table section, do the following:
In the Select BigQuery dataset field, click Browse. Select a BigQuery dataset to store the data quality scan results.
In the BigQuery table field, specify the table to store the data quality scan results. If you're using an existing table, make sure that it is compatible with the export table schema. If the specified table doesn't exist, Dataplex Universal Catalog creates it for you.
Optional: Add labels. Labels are key-value pairs that let you group related objects together or with other Google Cloud resources.
Optional: Set up email notification reports to alert people about the status and results of a data quality scan job. In the Notification report section, click
Add email ID and enter up to five email addresses. Then, select the scenarios that you want to send reports for:- Quality score (<=): sends a report when a job succeeds with a data quality score that is lower than the specified target score. Enter a target quality score between 0 and 100.
- Job failures: sends a report when the job itself fails, regardless of the data quality results.
- Job completion (success or failure): sends a report when the job ends, regardless of the data quality results.
Click Create.
After the scan is created, you can run it at any time by clicking Run now.
gcloud
To create a data quality scan, use the
gcloud dataplex datascans create data-quality
command.
If the source data is organized in a Dataplex Universal Catalog lake, include the
--data-source-entity
flag:
gcloud dataplex datascans create data-quality DATASCAN \
--location=LOCATION \
--data-quality-spec-file=DATA_QUALITY_SPEC_FILE \
--data-source-entity=DATA_SOURCE_ENTITY
If the source data isn't organized in a Dataplex Universal Catalog lake, include
the --data-source-resource
flag:
gcloud dataplex datascans create data-quality DATASCAN \
--location=LOCATION \
--data-quality-spec-file=DATA_QUALITY_SPEC_FILE \
--data-source-resource=DATA_SOURCE_RESOURCE
Replace the following variables:
DATASCAN
: The name of the data quality scan.LOCATION
: The Google Cloud region in which to create the data quality scan.DATA_QUALITY_SPEC_FILE
: The path to the JSON or YAML file containing the specifications for the data quality scan. The file can be a local file or a Cloud Storage path with the prefixgs://
. Use this file to specify the data quality rules for the scan. You can also specify additional details in this file, such as filters, sampling percent, and post-scan actions like exporting to BigQuery or sending email notification reports. See the documentation for JSON representation and the example YAML representation.DATA_SOURCE_ENTITY
: The Dataplex Universal Catalog entity that contains the data for the data quality scan. For example,projects/test-project/locations/test-location/lakes/test-lake/zones/test-zone/entities/test-entity
.DATA_SOURCE_RESOURCE
: The name of the resource that contains the data for the data quality scan. For example,//bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table
.
REST
To create a data quality scan, use the
dataScans.create
method.
If you want to build rules for the data quality scan by using rule
recommendations that are based on the results of a data profiling scan, get
the recommendations by calling the
dataScans.jobs.generateDataQualityRules
method
on the data profiling scan.
Export table schema
To export the data quality scan results to an existing BigQuery table, make sure that it is compatible with the following table schema:
Column name | Column data type | Sub field name (if applicable) |
Sub field data type | Mode | Example |
---|---|---|---|---|---|
data_quality_scan | struct/record |
resource_name |
string |
nullable | //dataplex.googleapis.com/projects/test-project/locations/europe-west2/datascans/test-datascan |
project_id |
string |
nullable | dataplex-back-end-dev-project |
||
location |
string |
nullable | us-central1 |
||
data_scan_id |
string |
nullable | test-datascan |
||
data_source | struct/record |
resource_name |
string |
nullable | Entity case://dataplex.googleapis.com/projects/dataplex-back-end-dev-project/locations/europe-west2/lakes/a0-datascan-test-lake/zones/a0-datascan-test-zone/entities/table1 Table case: //bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table
|
dataplex_entity_project_id |
string |
nullable | dataplex-back-end-dev-project |
||
dataplex_entity_project_number |
integer |
nullable | 123456789 |
||
dataplex_lake_id |
string |
nullable | (Valid only if source is entity)test-lake
|
||
dataplex_zone_id |
string |
nullable | (Valid only if source is entity)test-zone |
||
dataplex_entity_id |
string |
nullable | (Valid only if source is entity)test-entity |
||
table_project_id |
string |
nullable | test-project |
||
table_project_number |
integer |
nullable | 987654321 |
||
dataset_id |
string |
nullable | (Valid only if source is table)test-dataset |
||
table_id |
string |
nullable | (Valid only if source is table)test-table |
||
data_quality_job_id | string |
nullable | caeba234-cfde-4fca-9e5b-fe02a9812e38 |
||
data_quality_job_configuration | json |
trigger |
string |
nullable | ondemand /schedule |
incremental |
boolean |
nullable | true /false |
||
sampling_percent |
float |
nullable | (0-100)20.0 (indicates 20%) |
||
row_filter |
string |
nullable | col1 >= 0 AND col2 < 10 |
||
job_labels | json |
nullable | {"key1":value1} |
||
job_start_time | timestamp |
nullable | 2023-01-01 00:00:00 UTC |
||
job_end_time | timestamp |
nullable | 2023-01-01 00:00:00 UTC |
||
job_rows_scanned | integer |
nullable | 7500 |
||
rule_name | string |
nullable | test-rule |
||
rule_type | string |
nullable | Range Check |
||
rule_evaluation_type | string |
nullable | Per row |
||
rule_column | string |
nullable | Rule only attached to a certain column |
||
rule_dimension | string |
nullable | UNIQUENESS |
||
job_quality_result | struct/record |
passed |
boolean |
nullable | true /false |
score |
float |
nullable | 90.8 |
||
job_dimension_result | json |
nullable | {"ACCURACY":{"passed":true,"score":100},"CONSISTENCY":{"passed":false,"score":60}}
|
||
rule_threshold_percent | float |
nullable | (0.0-100.0)Rule-threshold-pct in API * 100 |
||
rule_parameters | json |
nullable | {min: 24, max:5345} |
||
rule_pass | boolean |
nullable | True |
||
rule_rows_evaluated | integer |
nullable | 7400 |
||
rule_rows_passed | integer |
nullable | 3 |
||
rule_rows_null | integer |
nullable | 4 |
||
rule_failed_records_query | string |
nullable | "SELECT * FROM `test-project.test-dataset.test-table` WHERE (NOT((`cTime` >= '15:31:38.776361' and `cTime` <= '19:23:53.754823') IS TRUE));" |
||
rule_assertion_row_count | integer |
nullable | 10 |
When you configure BigQueryExport for a data quality scan job, follow these guidelines:
- For the field
resultsTable
, use the format://bigquery.googleapis.com/projects/{project-id}/datasets/{dataset-id}/tables/{table-id}
. - Use a BigQuery standard table.
- If the table doesn't exist when the scan is created or updated, Dataplex Universal Catalog creates the table for you.
- By default, the table is partitioned on the
job_start_time
column daily. - If you want the table to be partitioned in other configurations or if you don't want the partition, then recreate the table with the required schema and configurations and then provide the pre-created table as the results table.
- Make sure the results table is in the same location as the source table.
- If VPC-SC is configured on the project, then the results table must be in the same VPC-SC perimeter as the source table.
- If the table is modified during the scan execution stage, then the current running job exports to the previous results table and the table change takes effect from the next scan job.
- Don't modify the table schema. If you need customized columns, create a view upon the table.
- To reduce costs, set an expiration on the partition based on your use case. For more information, see how to set the partition expiration.
Run a data quality scan
Console
In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.
Click the data quality scan to run.
Click Run now.
gcloud
To run a data quality scan, use the
gcloud dataplex datascans run
command:
gcloud dataplex datascans run DATASCAN \ --location=LOCATION \
Replace the following variables:
LOCATION
: The Google Cloud region in which the data quality scan was created.DATASCAN
: The name of the data quality scan.
REST
To run a data quality scan, use the
dataScans.run
method.
View the data quality scan results
Console
In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.
Click the name of a data quality scan.
The Overview section displays information about the most recent jobs, including when the scan was run, the number of records scanned in each job, whether all the data quality checks passed, and if there were failures, the number of data quality checks that failed.
The Data quality scan configuration section displays details about the scan.
To see detailed information about a job, such as data quality scores that indicate the percentage of rules that passed, which rules failed, and the job logs, click the Jobs history tab. Then, click a job ID.
gcloud
To view the results of a data quality scan job, use the
gcloud dataplex datascans jobs describe
command:
gcloud dataplex datascans jobs describe JOB \ --location=LOCATION \ --datascan=DATASCAN \ --view=FULL
Replace the following variables:
JOB
: The job ID of the data quality scan job.LOCATION
: The Google Cloud region in which the data quality scan was created.DATASCAN
: The name of the data quality scan the job belongs to.--view=FULL
: To see the scan job result, specifyFULL
.
REST
To view the results of a data quality scan, use the
dataScans.get
method.
View published results
If the data quality scan results are published as Dataplex Universal Catalog metadata, then you can see the latest scan results on the BigQuery and Dataplex Universal Catalog pages in the Google Cloud console, on the source table's Data quality tab.
In the Google Cloud console, go to the Dataplex Universal Catalog Search page.
Search for and then select the table.
Click the Data quality tab.
The latest published results are displayed.
View historical scan results
Dataplex Universal Catalog saves the data quality scan history of the last 300 jobs or for the past year, whichever occurs first.
Console
In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.
Click the name of a data quality scan.
Click the Jobs history tab.
The Jobs history tab provides information about past jobs, such as the number of records scanned in each job, the job status, the time the job was run, and whether each rule passed or failed.
To view detailed information about a job, click any of the jobs in the Job ID column.
gcloud
To view historical data quality scan jobs, use the
gcloud dataplex datascans jobs list
command:
gcloud dataplex datascans jobs list \ --location=LOCATION \ --datascan=DATASCAN \
Replace the following variables:
LOCATION
: The Google Cloud region in which the data quality scan was created.DATASCAN
: The name of the data quality scan to view historical jobs for.
REST
To view historical data quality scan jobs, use the
dataScans.jobs.list
method.
Grant access to data quality scan results
To enable the users in your organization to view the scan results, do the following:
In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.
Click the data quality scan you want to share the results of.
Click the Permissions tab.
Do the following:
- To grant access to a principal, click Grant access. Grant the Dataplex DataScan DataViewer role to the associated principal.
- To remove access from a principal, select the principal that you want to remove the Dataplex DataScan DataViewer role from. Click Remove access, and then confirm when prompted.
Set alerts in Cloud Logging
To set alerts for data quality failures using the logs in Cloud Logging, follow these steps:
Console
In the Google Cloud console, go to the Cloud Logging Logs Explorer.
In the Query window, enter your query. See sample queries.
Click Run Query.
Click Create alert. This opens a side panel.
Enter your alert policy name and click Next.
Review the query.
Click the Preview Logs button to test your query. This shows logs with matching conditions.
Click Next.
Set the time between notifications and click Next.
Define who should be notified for the alert and click Save to create the alert policy.
Alternatively, you can configure and edit your alerts by navigating in the Google Cloud console to Monitoring > Alerting.
gcloud
Not supported.
REST
For more information about how to set alerts in Cloud Logging, see Create a log-based alerting policy by using the Monitoring API.
Sample queries for setting job level or dimension level alerts
A sample query to set alerts on overall data quality failures for a data quality scan:
resource.type="dataplex.googleapis.com/DataScan" AND labels."dataplex.googleapis.com/data_scan_state"="SUCCEEDED" AND resource.labels.resource_container="projects/112233445566" AND resource.labels.datascan_id="a0-test-dec6-dq-3" AND NOT jsonPayload.dataQuality.passed=true
A sample query to set alerts on data quality failures for a dimension (for example, uniqueness) of a given data quality scan:
resource.type="dataplex.googleapis.com/DataScan" AND labels."dataplex.googleapis.com/data_scan_state"="SUCCEEDED" AND resource.labels.resource_container="projects/112233445566" AND resource.labels.datascan_id="a0-test-dec6-dq-3" AND jsonPayload.dataQuality.dimensionPassed.UNIQUENESS=false
A sample query to set alerts on data quality failures for a table.
Set alerts on data quality failures for a BigQuery table that isn't organized in a Dataplex Universal Catalog lake:
resource.type="dataplex.googleapis.com/DataScan" AND jsonPayload.dataSource="//bigquery.googleapis.com/projects/test-project/datasets/testdataset/table/chicago_taxi_trips" AND labels."dataplex.googleapis.com/data_scan_state"="SUCCEEDED" AND resource.labels.resource_container="projects/112233445566" AND NOT jsonPayload.dataQuality.passed=true
Set alerts on data quality failures for a BigQuery table that's organized in a Dataplex Universal Catalog lake:
resource.type="dataplex.googleapis.com/DataScan" AND jsonPayload.dataSource="projects/test-project/datasets/testdataset/table/chicago_taxi_trips" AND labels."dataplex.googleapis.com/data_scan_state"="SUCCEEDED" AND resource.labels.resource_container="projects/112233445566" AND NOT jsonPayload.dataQuality.passed=true
Sample queries to set per rule alerts
A sample query to set alerts on all failing data quality rules with the specified custom rule name for a data quality scan:
resource.type="dataplex.googleapis.com/DataScan" AND jsonPayload.ruleName="custom-name" AND jsonPayload.result="FAILED"
A sample query to set alerts on all failing data quality rules of a specific evaluation type for a data quality scan:
resource.type="dataplex.googleapis.com/DataScan" AND jsonPayload.evalutionType="PER_ROW" AND jsonPayload.result="FAILED"
A sample query to set alerts on all failing data quality rules for a column in the table used for a data quality scan:
resource.type="dataplex.googleapis.com/DataScan" AND jsonPayload.column="CInteger" AND jsonPayload.result="FAILED"
Troubleshoot a data quality failure
For each job with row-level rules that fail, Dataplex Universal Catalog provides a query to get the failed records. Run this query to see the records that did not match your rule.
Console
In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.
Click the name of the data quality scan whose records you want to troubleshoot.
Click the Jobs history tab.
Click the job ID of the job that identified data quality failures.
In the job results window that opens, in the Rules section, find the column Query to get failed records. Click Copy query to clipboard for the failed rule.
Run the query in BigQuery to see the records that caused the job to fail.
gcloud
Not supported.
REST
To get the job that identified data quality failures, use the
dataScans.get
method.In the response object, the
failingRowsQuery
field shows the query.Run the query in BigQuery to see the records that caused the job to fail.
Manage data quality scans for a specific table
The steps in this document show how to manage data profile scans across your project by using the Dataplex Universal Catalog Data profiling & quality page in the Google Cloud console.
You can also create and manage data profile scans when working with a specific table. In the Google Cloud console, on the Dataplex Universal Catalog page for the table, use the Data quality tab. Do the following:
In the Google Cloud console, go to the Dataplex Universal Catalog Search page.
Search for and then select the table.
Click the Data quality tab.
Depending on whether the table has a data quality scan whose results are published as Dataplex Universal Catalog metadata, you can work with the table's data quality scans in the following ways:
Data quality scan results are published: the latest scan results are displayed on the page.
To manage the data quality scans for this table, click Data quality scan, and then select from the following options:
Create new scan: create a new data quality scan. For more information, see the Create a data quality scan section of this document. When you create a scan from a table's details page, the table is preselected.
Run now: run the scan.
Edit scan configuration: edit settings including the display name, filters, and schedule.
To edit the data quality rules, on the Data quality tab, click the Rules tab. Click Modify rules. Update the rules and then click Save.
Manage scan permissions: control who can access the scan results. For more information, see the Grant access to data quality scan results section of this document.
View historical results: view detailed information about previous data quality scan jobs. For more information, see the View data quality scan results and View historical scan results sections of this document.
View all scans: view a list of data quality scans that apply to this table.
Data quality scan results aren't published: select from the following options:
Create data quality scan: create a new data quality scan. For more information, see the Create a data quality scan section of this document. When you create a scan from a table's details page, the table is preselected.
View existing scans: view a list of data quality scans that apply to this table.
Update a data quality scan
You can edit various settings for an existing data quality scan, such as the display name, filters, schedule, and data quality rules.
Console
In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.
Click the name of a data quality scan.
To edit settings including the display name, filters, and schedule, click Edit. Edit the values and then click Save.
To edit the data quality rules, on the scan details page, click the Current rules tab. Click Modify rules. Update the rules and then click Save.
gcloud
To update the description of a data quality scan, use the
gcloud dataplex datascans update data-quality
command:
gcloud dataplex datascans update data-quality DATASCAN \ --location=LOCATION \ --description=DESCRIPTION
Replace the following:
DATASCAN
: The name of the data quality scan to update.LOCATION
: The Google Cloud region in which the data quality scan was created.DESCRIPTION
: The new description for the data quality scan.
REST
To edit a data quality scan, use the
dataScans.patch
method.
Delete a data quality scan
Console
In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.
Click the scan you want to delete.
Click Delete, and then confirm when prompted.
gcloud
To delete a data quality scan, use the
gcloud dataplex datascans delete
command:
gcloud dataplex datascans delete DATASCAN \ --location=LOCATION \ --async
Replace the following variables:
DATASCAN
: The name of the data quality scan to delete.LOCATION
: The Google Cloud region in which the data quality scan was created.
REST
To delete a data quality scan, use the
dataScans.delete
method.
What's next?
- Learn about data profiling.
- Learn how to use data profiling.
- Follow a tutorial to manage data quality rules as code with Terraform.