Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub (Stream) template
Stay organized with collections
Save and categorize content based on your preferences.
The Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub
template is a streaming pipeline that reads Pub/Sub messages with change data from
a MySQL database and writes the records to BigQuery. A Debezium connector captures
changes to the MySQL database and publishes the changed data to Pub/Sub. The
template then reads the Pub/Sub messages and writes them to BigQuery.
You can use this template to sync MySQL databases and BigQuery tables. The
pipeline writes the changed data to a BigQuery staging table and intermittently
updates a BigQuery table replicating the MySQL database.
useStorageWriteApiAtLeastOnce: When using the Storage Write API, specifies the write semantics. To use at-least once semantics (https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics), set this parameter to true. To use exactly-once semantics, set the parameter to false. This parameter applies only when useStorageWriteApi is true. The default value is false.
numStorageWriteApiStreams: When using the Storage Write API, specifies the number of write streams. If useStorageWriteApi is true and useStorageWriteApiAtLeastOnce is false, then you must set this parameter. Defaults to: 0.
storageWriteApiTriggeringFrequencySec: When using the Storage Write API, specifies the triggering frequency, in seconds. If useStorageWriteApi is true and useStorageWriteApiAtLeastOnce is false, then you must set this parameter.
Run the template
To run this template, perform the following steps:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-26 UTC."],[[["\u003cp\u003eThis template streams change data from a MySQL database to BigQuery using Debezium, which captures changes and publishes them to Pub/Sub, and then the pipeline reads these messages and writes them to BigQuery.\u003c/p\u003e\n"],["\u003cp\u003eThe pipeline synchronizes MySQL databases with BigQuery tables by writing changed data to a staging table and intermittently updating a BigQuery table that mirrors the MySQL database.\u003c/p\u003e\n"],["\u003cp\u003eTo run the pipeline, a Debezium connector must be deployed, and Pub/Sub messages must be serialized in Beam Row format.\u003c/p\u003e\n"],["\u003cp\u003eRequired parameters for running the template include a comma-separated list of Pub/Sub input subscriptions, a BigQuery dataset for staging tables, and a BigQuery dataset for replica tables.\u003c/p\u003e\n"],["\u003cp\u003eThe pipeline supports the optional use of the BigQuery Storage Write API, offering both at-least-once and exactly-once semantics, along with configurable options for stream numbers and triggering frequency.\u003c/p\u003e\n"]]],[],null,["# Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub (Stream) template\n\nThe Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub\ntemplate is a streaming pipeline that reads Pub/Sub messages with change data from\na MySQL database and writes the records to BigQuery. A Debezium connector captures\nchanges to the MySQL database and publishes the changed data to Pub/Sub. The\ntemplate then reads the Pub/Sub messages and writes them to BigQuery.\n\nYou can use this template to sync MySQL databases and BigQuery tables. The\npipeline writes the changed data to a BigQuery staging table and intermittently\nupdates a BigQuery table replicating the MySQL database.\n\nPipeline requirements\n---------------------\n\n- The Debezium connector must be [deployed](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/master/v2/cdc-parent#deploying-the-connector).\n- The Pub/Sub messages must be serialized in a [Beam Row](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/values/Row.html).\n\nTemplate parameters\n-------------------\n\n### Required parameters\n\n- **inputSubscriptions** : The comma-separated list of Pub/Sub input subscriptions to read from, in the format `\u003cSUBSCRIPTION_NAME\u003e,\u003cSUBSCRIPTION_NAME\u003e, ...`.\n- **changeLogDataset**: The BigQuery dataset to store the staging tables in, in the format \\\u003cDATASET_NAME\\\u003e.\n- **replicaDataset**: The location of the BigQuery dataset to store the replica tables in, in the format \\\u003cDATASET_NAME\\\u003e.\n\n### Optional parameters\n\n- **inputTopics**: Comma-separated list of PubSub topics to where CDC data is being pushed.\n- **updateFrequencySecs**: The interval at which the pipeline updates the BigQuery table replicating the MySQL database.\n- **useSingleTopic** : Set this to `true` if you configure your Debezium connector to publish all table updates to a single topic. Defaults to: false.\n- **useStorageWriteApi** : If true, the pipeline uses the BigQuery Storage Write API (\u003chttps://cloud.google.com/bigquery/docs/write-api\u003e). The default value is `false`. For more information, see Using the Storage Write API (\u003chttps://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api\u003e).\n- **useStorageWriteApiAtLeastOnce** : When using the Storage Write API, specifies the write semantics. To use at-least once semantics (\u003chttps://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics\u003e), set this parameter to `true`. To use exactly-once semantics, set the parameter to `false`. This parameter applies only when `useStorageWriteApi` is `true`. The default value is `false`.\n- **numStorageWriteApiStreams** : When using the Storage Write API, specifies the number of write streams. If `useStorageWriteApi` is `true` and `useStorageWriteApiAtLeastOnce` is `false`, then you must set this parameter. Defaults to: 0.\n- **storageWriteApiTriggeringFrequencySec** : When using the Storage Write API, specifies the triggering frequency, in seconds. If `useStorageWriteApi` is `true` and `useStorageWriteApiAtLeastOnce` is `false`, then you must set this parameter.\n\nRun the template\n----------------\n\nTo run this template, perform the following steps:\n\n1. On your local machine, clone the [DataflowTemplates repository](https://github.com/GoogleCloudPlatform/DataflowTemplates).\n2. Change to the `v2/cdc-parent` directory.\n3. Ensure that the Debezium connector is [deployed](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/master/v2/cdc-parent#deploying-the-connector).\n4. Using Maven, run the Dataflow template: \n\n ```bash\n mvn exec:java -pl cdc-change-applier -Dexec.args=\"--runner=DataflowRunner \\\n --inputSubscriptions=\u003cvar translate=\"no\"\u003eSUBSCRIPTIONS\u003c/var\u003e \\\n --updateFrequencySecs=300 \\\n --changeLogDataset=\u003cvar translate=\"no\"\u003eCHANGELOG_DATASET\u003c/var\u003e \\\n --replicaDataset=\u003cvar translate=\"no\"\u003eREPLICA_DATASET\u003c/var\u003e \\\n --project=\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e \\\n --region=REGION_NAME\"\n \n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the Google Cloud project ID where you want to run the Dataflow job\n - \u003cvar translate=\"no\"\u003eSUBSCRIPTIONS\u003c/var\u003e: your comma-separated list of Pub/Sub subscription names\n - \u003cvar translate=\"no\"\u003eCHANGELOG_DATASET\u003c/var\u003e: your BigQuery dataset for changelog data\n - \u003cvar translate=\"no\"\u003eREPLICA_DATASET\u003c/var\u003e: your BigQuery dataset for replica tables\n\nWhat's next\n-----------\n\n- Learn about [Dataflow templates](/dataflow/docs/concepts/dataflow-templates).\n- See the list of [Google-provided templates](/dataflow/docs/guides/templates/provided-templates).\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e"]]