Cloud Data Fusion provides a Dataplex Universal Catalog Sink plugin for ingesting data to any of the Dataplex Universal Catalog supported assets.
Before you begin
- If you don't have a Cloud Data Fusion instance, create one. This plugin is available in instances that run in Cloud Data Fusion version 6.6 or later. For more information, see Create a Cloud Data Fusion public instance.
- The BigQuery dataset or Cloud Storage bucket where data is ingested must be part of a Dataplex Universal Catalog lake.
- For data to be read from Cloud Storage entities, Dataproc Metastore must be attached to the lake.
- CSV data in Cloud Storage entities isn't supported.
- In the Dataplex Universal Catalog project, enable Private Google Access on
the subnetwork, which is typically set to
default
, or setinternal_ip_only
tofalse
.
Required roles
To get the permissions that
you need to manage roles,
ask your administrator to grant you the
following IAM roles on the Dataproc service agent and the Cloud Data Fusion service agent (service-CUSTOMER_PROJECT_NUMBER@gcp-sa-datafusion.iam.gserviceaccount.com
):
-
Dataplex Developer (
roles/dataplex.developer
) -
Dataplex Data Reader (
roles/dataplex.dataReader
) -
Dataproc Metastore Metadata User (
roles/metastore.metadataUser
) -
Cloud Dataplex Service Agent (
roles/dataplex.serviceAgent
) -
Dataplex Metadata Reader (
roles/dataplex.metadataReader
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Add the plugin to your pipeline
In the Google Cloud console, go to the Cloud Data Fusion Instances page.
This page lets you manage your instances.
To open your instance, click View instance.
Go to the Studio page, expand the Sink menu, and click Dataplex.
Configure the plugin
After you add this plugin to your pipeline on the Studio page, click the Dataplex Universal Catalog sink to configure and save its properties.
For more information about configurations, see the Dataplex Sink reference.
Optional: Get started with a sample pipeline
Sample pipelines are available, including an SAP source to Dataplex Universal Catalog sink pipeline and a Dataplex Universal Catalog source to BigQuery sink pipeline.
To use a sample pipeline, open your instance in the Cloud Data Fusion UI, click Hub > Pipelines, and select one of the Dataplex Universal Catalog pipelines. A dialog opens to help you create the pipeline.
Run your pipeline
After deploying the pipeline, open your pipeline on the Cloud Data Fusion Studio page.
Click Configure > Resources.
Optional: Change the Executor CPU and Memory based on the overall data size and the number of transformations used in your pipeline.
Click Save.
To start the data pipeline, click Run.
What's next
- Process data with Cloud Data Fusion using the Dataplex Universal Catalog Source plugin.