Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub (Stream) template

The Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub template is a streaming pipeline that reads Pub/Sub messages with change data from a MySQL database and writes the records to BigQuery. A Debezium connector captures changes to the MySQL database and publishes the changed data to Pub/Sub. The template then reads the Pub/Sub messages and writes them to BigQuery.

You can use this template to sync MySQL databases and BigQuery tables. The pipeline writes the changed data to a BigQuery staging table and intermittently updates a BigQuery table replicating the MySQL database.

Pipeline requirements

  • The Debezium connector must be deployed.
  • The Pub/Sub messages must be serialized in a Beam Row.

Template parameters

Parameter Description
inputSubscriptions The comma-separated list of Pub/Sub input subscriptions to read from, in the format of <subscription>,<subscription>, ...
changeLogDataset The BigQuery dataset to store the staging tables, in the format of <my-dataset>
replicaDataset The location of the BigQuery dataset to store the replica tables, in the format of <my-dataset>
updateFrequencySecs (Optional) The interval at which the pipeline updates the BigQuery table replicating the MySQL database.

Run the template

To run this template, perform the following steps:

  1. On your local machine, clone the DataflowTemplates repository.
  2. Change to the v2/cdc-parent directory.
  3. Ensure that the Debezium connector is deployed.
  4. Using Maven, run the Dataflow template:
    mvn exec:java -pl cdc-change-applier -Dexec.args="--runner=DataflowRunner \
        --inputSubscriptions=SUBSCRIPTIONS \
        --updateFrequencySecs=300 \
        --changeLogDataset=CHANGELOG_DATASET \
        --replicaDataset=REPLICA_DATASET \
        --project=PROJECT_ID \
        --region=REGION_NAME"
      

    Replace the following:

    • PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
    • SUBSCRIPTIONS: your comma-separated list of Pub/Sub subscription names
    • CHANGELOG_DATASET: your BigQuery dataset for changelog data
    • REPLICA_DATASET: your BigQuery dataset for replica tables

What's next