Data lineage is a Dataflow feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.
Each pipeline that you run by using Dataflow has several associated data assets. The lineage of a data asset includes its origin, what happens to it, and where it moves over time. With data lineage, you can track the end-to-end movement of your data assets, from origin to eventual destination.
When you enable data lineage for your Dataflow jobs, Dataflow captures lineage events and publishes them to the Dataplex Data Lineage API.
To access lineage information through Dataplex, see Use data lineage with Google Cloud systems.
Before you begin
Set up your project:
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataplex, BigQuery, and Data lineage APIs.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataplex, BigQuery, and Data lineage APIs.
In Dataflow, you also need to enable lineage at the job level. See Enable data lineage in Dataflow in this document.
Required roles
To get the permissions that you need to view lineage visualization graphs, ask your administrator to grant you the following IAM roles:
-
Dataplex Catalog viewer (
roles/dataplex.catalogViewer
) on the Dataplex resource project -
Data Lineage Viewer (
roles/datalineage.viewer
) on the project where you use Dataflow -
Dataflow viewer (
roles/dataflow.viewer
) on the project where you use Dataflow
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
For more information about data lineage roles, see Predefined roles for data lineage.
Support and limitations
Data lineage in Dataflow has the following limitations:
- Data lineage is supported in the Apache Beam SDK versions 2.63.0 and later.
- You must enable data lineage on a per-job basis.
- Data capture isn't instantaneous. It can take a few minutes for Dataflow job lineage data to appear in Dataplex.
The following sources and sinks are supported:
- Apache Kafka
- BigQuery
- Bigtable
- Cloud Storage
- JDBC (Java Database Connectivity)
- Pub/Sub
- Spanner
Dataflow templates that use these sources and sinks also automatically capture and publish lineage events.
Enable data lineage in Dataflow
You need to enable lineage at the job level. To enable data lineage,
use the enable_lineage
Dataflow service option
as follows:
Java
--dataflowServiceOptions=enable_lineage=true
Python
--dataflow_service_options=enable_lineage=true
Go
--dataflow_service_options=enable_lineage=true
gcloud
Use the
gcloud dataflow jobs run
command
with the additional-experiments
option. If you're using Flex Templates, use
the
gcloud dataflow flex-template run
command.
--additional-experiments=enable_lineage=true
Optionally, you can specify one or both of the following parameters with the service option:
process_id
: A unique identifier that Dataplex uses to group job runs. If not specified, the job name is used.process_name
: A human-readable name for the data lineage process. If not specified, the job name prefixed with"Dataflow "
is used.
Specify these options as follows:
Java
--dataflowServiceOptions=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME
Python
--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME
Go
--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME
gcloud
--additional-experiments=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME
View lineage in Dataplex
Data lineage provides information about the relations between your project resources and the processes that created them. You can view data lineage information in the Google Cloud console in the form of a graph or a single table. You can also retrieve data lineage information from the Data Lineage API in the form of JSON data.
For more information, see Use data lineage with Google Cloud systems.
Disable data lineage in Dataflow
If data lineage is enabled for a specific job and you want to disable
it, cancel the existing job and run a new version of the job without the
enable_lineage
service option.
Billing
Using data lineage in Dataflow doesn't impact your Dataflow bill, but it might incur additional charges on your Dataplex bill. For more information, see Data lineage considerations and Dataplex pricing.
What's next
- Learn more about data lineage.
- Learn how to use
data lineage.