This tutorial shows you how to deploy a job that continuously replicates changed data from an Oracle database to a BigQuery dataset, using Cloud Data Fusion Replication. This feature is powered by Datastream.
Objectives
In this tutorial, you:
- Configure your Oracle database to enable supplemental logging.
- Create and run a Cloud Data Fusion replication job.
- View the results in BigQuery.
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When Replication runs, you're charged for the Dataproc cluster and Cloud Storage, and you incur processing costs for Datastream and BigQuery. To optimize these costs, we strongly recommend using BigQuery flat rate pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Data Fusion, Dataproc, Datastream, BigQuery, and Cloud Storage APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Data Fusion, Dataproc, Datastream, BigQuery, and Cloud Storage APIs.
- Create a public
Cloud Data Fusion instance in version 6.3.0 or later. If you
create a private
instance, set up VPC network peering.
- When you create the instance, enable Replication by clicking Add Accelerators and selecting the Replication checkbox.
- To enable it in an existing instance, see Enable Replication.
Required roles
To get the permissions that you need to connect to an Oracle database, ask your administrator to grant you the following IAM roles:
- Dataproc Worker (
roles/dataproc.worker
) on the Dataproc service account in the project that contains the cluster - Cloud Data Fusion Runner on the Dataproc service account in the project that contains the cluster
- DataStream Admin (
roles/datastream.admin
) on the Cloud Data Fusion service account and Dataproc service account
For more information about granting roles, see Manage access.
You might also be able to get the required permissions through custom roles or other predefined roles.
Optional: Install Oracle in Compute Engine
This section shows how to set up an example database. If you already have an Oracle database installed, you can skip this section.
Download an Oracle Server Docker image.
For limitations of this Oracle Express Edition 11g image, see Oracle Database Editions.
Deploy your Docker image on a new VM instance.
On the Compute Engine Disks page, change the disk size to
500 GB
and restart the VM.Install the HR sample schema.
Create VPC network peering or firewall rule for your Oracle server
If your Oracle Database doesn't allow ingress traffic from public IP addresses, set up VPC network peering between the Datastream VPC and the VPC where your Oracle database can be accessed. For more information, see Create a private connectivity configuration.
If your Oracle database allows ingress traffic from public IP addresses, create a firewall rule for the VM instance to allow ingress traffic from Datastream public IPs.
Configure your Oracle Server to enable supplemental logging
Follow the steps to configure your source Oracle database.
Create and run a Cloud Data Fusion replication job
Create the job
In the Cloud Data Fusion web interface, click Replication.
Click
Create a replication job.On the Create new replication job page, specify a replication job Name and click Next.
Configure the source:
Select Oracle (by Datastream) as the source.
For Connectivity Method, if your Oracle server allows ingress traffic from Datastream public IPs, choose IP allowlisting. Otherwise, for Private Connection Name, choose Private connectivity (VPC peering) and enter the VPC peering name that you created in the Create VPC network peering or firewall rule for your Oracle Server section.
For Host, enter the hostname of the Oracle Server to read.
For Port, enter the port to use to connect to the Oracle Server: 1521.
For System Identity, enter
xe
(the sample database name of the Oracle server).In the credentials section, enter your username and password for accessing the Oracle Server.
Leave all other properties as is.
Click Next.
Configure the target:
Select the BigQuery target.
The Project ID and Service Account Key are automatically detected. Keep the default values as is.
Optional: In the Advanced section, you can configure the following:
- Name and location of the staging bucket
- Load interval
- Staging table prefix
- Behavior when tables or databases are dropped
Click Next.
If the connection is successful, a list of tables is displayed. For this tutorial, select a few tables.
Click Next.
On the Review assessment page, click View mappings by any of the tables to get an assessment of schema issues, missing features, or connectivity issues that might occur during Replication.
If any issues occur, you must resolve them before proceeding. For this tutorial, if any of the tables have issues, do the following:
- Go back to the step where you selected tables.
- Select a table or event (Inserts, Updates, or Deletes) without issues.
For more information about data type conversions from the source database to the BigQuery destination, see Replication data types.
Click Back.
Click Next.
Review the summary replication job details, and then click Deploy replication job.
Start the job
In the Cloud Data Fusion web interface, go to the replication job details page.
Click Start.
The replication job transitions from Provisioning to Starting to Running state. In the running state, the replication job loads an initial snapshot of the table data that you selected into BigQuery. In this state, the state of the table is listed as Snapshotting. After the initial snapshot is loaded into BigQuery, any changes made to the table are replicated to BigQuery, and the state of the table is listed as Replicating.
Monitor the job
You can start and stop the replication job, review its configuration and logs, and monitor your replication job.
You can monitor replication job activities from the replication job details page.
From the Replication page, click the desired replication job Name.
Click Monitoring.
View the results in BigQuery
The replication job creates a replicated dataset and table in BigQuery, with names inherited from the corresponding Oracle database and table names.
In the Google Cloud console, go to the BigQuery page.
In the left panel, click your project name to expand a list of datasets.
Select the
xe
dataset, and then select a table to view.
For more information, see the BigQuery documentation.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
After you've finished the tutorial, clean up the resources you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.
Delete the VM instance
In the Google Cloud console, go to the VM instances page.
Select the checkbox for the instance that you want to delete.
To delete the instance, click Delete.
Delete the Cloud Data Fusion instance
Follow the instructions to delete your Cloud Data Fusion instance.
Delete the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Learn more about Datastream.
- Learn more about Replication in Cloud Data Fusion.
- Refer to the Replication API reference.
- Refer to the Oracle replication reference.
- Work through the tutorial for Replicating data from MySQL to BigQuery.
- Work through the tutorial for Replicating data from SQL Server to BigQuery.