This page describes how to load data from an Amazon Redshift instance into Google Cloud with Cloud Data Fusion. The Redshift source connector lets you sync tables from your Redshift dataset to your destination, such as BigQuery. The connector also lets you create a configurable SQL query.
Before you begin
- Cloud Data Fusion versions 6.9.0 and later support the Redshift source.
When you configure the Redshift source connector, you can select an existing, reusable connection, or create a new, one-time connection. For more information, see Manage connections. When you reuse a connection, note the following:
- You don't have to provide credentials.
- The existing connection provides the schema and table name information that's used to generate the import query.
Configure the plugin
Go to the Cloud Data Fusion web interface and click Studio.
Check that Data Pipeline - Batch is selected (not Realtime).
In the Source menu, click Redshift. The Redshift node appears in your pipeline. If you don't see the Redshift source on the Studio page, deploy the Redshift source connector from the Cloud Data Fusion Hub.
To configure the source, go to the Redshift node and click Properties.
Enter the following properties. For a complete list, see Properties.
- Enter a label for the Redshift node—for example,
Redshift tables
. Enter the connection details. You can set up a new, one-time connection, or an existing, reusable connection.
New connection
To add a one-time connection to Redshift, follow these steps:
- Keep Use connection turned off.
- In the JDBC driver name field, enter the name of the driver. Redshift supports two types of JDBC drivers: CData and Amazon. For more information, see Upload a JDBC driver.
- In the Host field, enter the endpoint of the Redshift
cluster—for example,
cdf-redshift-new.example-endpoint.eu-west-1.redshift.amazonaws.com
. - Optional: In the Port field, enter a database port
number—for example,
5439
. If your Redshift database requires authentication, do the following:
- In the Username field, enter the name for the database.
- In the Password field, enter the password for the database.
- Optional: In the Arguments field, enter key value arguments. To use the CData driver, provide the connection arguments, such as RTK or OEMKey, if applicable.
- In the Name field, enter a name—for example,
SN-PC-Source-01-01-2024
. - Enter the target database name in the Database
field—for example,
datafusiondb
.
Reusable connection
To reuse an existing connection, follow these steps:
- Turn on Use connection.
- Click Browse connections.
Click the connection name.
Optional: If a connection doesn't exist and you want to create a new reusable connection, click Add connection and refer to the steps in the New connection tab on this page.
In the Import query field, enter a query using the schema and table names from your Redshift source—for example,
Select * from "public"."users"
.Optional: Enter Advanced properties, such as a bounding query or number of splits. For all property descriptions, see Properties.
- Enter a label for the Redshift node—for example,
Optional: Click Validate and address any errors found.
Click
Close. Properties are saved and you can continue to build your data pipeline in the Cloud Data Fusion web interface.
Properties
Property | Supports macros for automation | Required property | Description |
---|---|---|---|
Label | No | Yes | The name of the node in your data pipeline. |
Use connection | No | No | Browse for a connection to the source. If Use connection is turned on, you don't need to provide credentials. |
Connection | Yes | Yes | Name of the connection to use. If Use connection is selected, this field appears. Database and table information is provided by the connection. |
JDBC driver name | Yes | Yes | Name of the JDBC driver to use. If Use connection isn't selected, this field appears. |
Host | Yes | Yes | The endpoint of the Amazon Redshift cluster. If Use connection isn't selected, this field appears. |
Port | Yes | No | Port that Redshift is running on. If Use connection isn't selected, this field appears. |
Username | Yes | No | User identity for connecting to the specified database. If Use connection isn't selected, this field appears. |
Password | Yes | No | Password to use to connect to the specified database. If Use connection isn't selected, this field appears. |
Connection arguments | Yes | No | A list of arbitrary string key-value pairs as connection arguments. These arguments are passed to the JDBC driver as connection arguments for JDBC drivers that might need additional configurations. If Use connection isn't selected, this field appears. |
Reference name | No | Yes | Uniquely identifies this source for lineage, annotating metadata, and other services. |
Database | Yes | Yes | Redshift database name. To select data, click Browse database. |
Import query | Yes | Yes | The SELECT query to use to import data from the
specified table. |
Bounding query | Yes | No | SQL query that returns the min and max values from the
splitBy field. For example,
SELECT MIN(id),MAX(id) FROM table . Not required if
numSplits is set to one. |
Split column | Yes | No | Field name that's used to generate splits. Not required if
numSplits is set to one. |
Number of splits | Yes | No | Number of splits to generate. |
Size | Yes | No | The number of rows to fetch at a time per split. Larger fetch size
can result in faster import, with the tradeoff of higher memory
usage. If unspecified, the default is 1000 . |
Data type mappings
The following table is a list of Redshift data types with corresponding CDAP types:
Redshift data type | CDAP schema data type |
---|---|
bigint |
long |
boolean |
boolean |
character |
string |
character varying |
string |
date |
date |
double precision |
double |
geometry |
bytes |
hllsketch |
string |
integer |
int |
json |
string |
numeric (precision,
scale)/decimal (precision, scale) |
decimal |
numeric with precision 0 |
string |
real |
float |
smallint |
int |
super |
string |
text |
string |
time [ (p) ] without time zone |
time |
time [ (p) ] with time zone |
string |
timestamp [ (p) ] without time zone |
timestamp |
timestamp [ (p) ] with time zone |
timestamp |
varbyte |
byte |
xml |
string |
Best practices
The following best practices apply when you connect to a Redshift cluster from Google Cloud.
Use IP address allowlists
To prevent access from unauthorized sources and restrict access to specific IP addresses, enable access controls on the Redshift cluster.
If you use Redshift access controls, to access the cluster in Cloud Data Fusion, follow these steps:
- Obtain the external IP addresses of the services or machines on Google Cloud that must connect to the Redshift cluster, such as the Proxy Server IP (see Viewing IP addresses). For Dataproc clusters, obtain the IP addresses of all master and child nodes.
Add the IP addresses to an allowlist in the security groups by creating the inbound rules for the Google Cloud machine IP addresses.
Add the connection properties in Wrangler and test them:
- Open the Cloud Data Fusion instance in the web interface.
- Click Wrangler > Add connection and create the new connection for Redshift.
- Enter all connection properties.
- Click Test connection and resolve any issues.
To create multiple splits, use bounding queries
For multiple splits, use bounding queries to manage the multi-node cluster. In scenarios where you extract data from Redshift and distribute the load uniformly across each node, configure a bounding query in the Redshift source connector properties.
- In your Cloud Data Fusion pipeline on the Studio page, go to the Redshift node and click Properties.
In the Advanced properties, specify the following:
- Enter the number of splits to create.
- Enter the fetch size for each split.
- Enter a bounding query to apply to the multi-node Redshift cluster.
- Enter the Split column field name.
For example, assume you have the following use case:
- You have a table that contains 10 million records.
- It has a unique ID column called
id
. - The Redshift cluster has 4 nodes.
Objective: To take advantage of the cluster's potential, you plan to generate multiple splits. To achieve this, use the following property configurations:
In the Bounding query field, enter the following query:
SELECT MIN(id), MAX(id) FROM tableName
In this query,
id
is the name of the column where the splits are applied.In the Split column field, enter the column name,
id
.Enter the number of splits and fetch size. These properties are interconnected, letting you calculate splits based on a fetch size, or the other way around. For this example, enter the following.
In the Number of splits field, enter
40
. In this example, where the table has ten million records, creating 40 splits results in each split containing 250,000 records.In the Fetch size field, enter
250,000
.
What's next
- Look through the Cloud Data Fusion plugins.