Change Data Capture dari MySQL ke BigQuery menggunakan template Debezium dan Pub/Sub (Stream)
Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Template Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub adalah pipeline streaming yang membaca pesan Pub/Sub dengan data perubahan dari database MySQL dan menuliskan data ke BigQuery. Konektor Debezium merekam perubahan pada database MySQL dan memublikasikan data yang diubah ke Pub/Sub. Template kemudian membaca pesan Pub/Sub dan menuliskannya ke BigQuery.
Anda dapat menggunakan template ini untuk menyinkronkan database MySQL dan tabel BigQuery. Pipeline
menulis data yang diubah ke tabel penyiapan BigQuery dan secara berkala
memperbarui tabel BigQuery yang mereplikasi database MySQL.
Pesan Pub/Sub harus diserialisasi dalam Baris Beam.
Parameter template
Parameter yang diperlukan
inputSubscriptions: Daftar langganan input Pub/Sub yang dipisahkan koma untuk dibaca, dalam format <SUBSCRIPTION_NAME>,<SUBSCRIPTION_NAME>, ....
changeLogDataset: Set data BigQuery untuk menyimpan tabel penyiapan, dalam format <DATASET_NAME>.
replicaDataset: Lokasi set data BigQuery untuk menyimpan tabel replika, dalam format <DATASET_NAME>.
Parameter opsional
inputTopics: Daftar topik PubSub yang dipisahkan koma tempat data CDC dikirim.
updateFrequencySecs: Interval saat pipeline memperbarui tabel BigQuery yang mereplikasi database MySQL.
useSingleTopic: Setel ini ke true jika Anda mengonfigurasi konektor Debezium untuk memublikasikan semua pembaruan tabel ke satu topik. Nilai defaultnya adalah: false.
useStorageWriteApiAtLeastOnce: Saat menggunakan Storage Write API, menentukan semantik penulisan. Untuk menggunakan semantik minimal satu kali (https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics), tetapkan parameter ini ke true. Untuk menggunakan semantik tepat satu kali, tetapkan parameter ke false. Parameter ini hanya berlaku jika useStorageWriteApi adalah true. Nilai defaultnya adalah false.
numStorageWriteApiStreams: Saat menggunakan Storage Write API, tentukan jumlah aliran penulisan. Jika useStorageWriteApi adalah true dan useStorageWriteApiAtLeastOnce adalah false, Anda harus menetapkan parameter ini. Nilai default: 0.
storageWriteApiTriggeringFrequencySec: Saat menggunakan Storage Write API, menentukan frekuensi pemicuan, dalam detik. Jika useStorageWriteApi adalah true dan useStorageWriteApiAtLeastOnce adalah false, Anda harus menetapkan parameter ini.
Menjalankan template
Untuk menjalankan template ini, lakukan langkah-langkah berikut:
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-08-18 UTC."],[[["\u003cp\u003eThis template streams change data from a MySQL database to BigQuery using Debezium, which captures changes and publishes them to Pub/Sub, and then the pipeline reads these messages and writes them to BigQuery.\u003c/p\u003e\n"],["\u003cp\u003eThe pipeline synchronizes MySQL databases with BigQuery tables by writing changed data to a staging table and intermittently updating a BigQuery table that mirrors the MySQL database.\u003c/p\u003e\n"],["\u003cp\u003eTo run the pipeline, a Debezium connector must be deployed, and Pub/Sub messages must be serialized in Beam Row format.\u003c/p\u003e\n"],["\u003cp\u003eRequired parameters for running the template include a comma-separated list of Pub/Sub input subscriptions, a BigQuery dataset for staging tables, and a BigQuery dataset for replica tables.\u003c/p\u003e\n"],["\u003cp\u003eThe pipeline supports the optional use of the BigQuery Storage Write API, offering both at-least-once and exactly-once semantics, along with configurable options for stream numbers and triggering frequency.\u003c/p\u003e\n"]]],[],null,["# Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub (Stream) template\n\nThe Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub\ntemplate is a streaming pipeline that reads Pub/Sub messages with change data from\na MySQL database and writes the records to BigQuery. A Debezium connector captures\nchanges to the MySQL database and publishes the changed data to Pub/Sub. The\ntemplate then reads the Pub/Sub messages and writes them to BigQuery.\n\nYou can use this template to sync MySQL databases and BigQuery tables. The\npipeline writes the changed data to a BigQuery staging table and intermittently\nupdates a BigQuery table replicating the MySQL database.\n\nPipeline requirements\n---------------------\n\n- The Debezium connector must be [deployed](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/master/v2/cdc-parent#deploying-the-connector).\n- The Pub/Sub messages must be serialized in a [Beam Row](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/values/Row.html).\n\nTemplate parameters\n-------------------\n\n### Required parameters\n\n- **inputSubscriptions** : The comma-separated list of Pub/Sub input subscriptions to read from, in the format `\u003cSUBSCRIPTION_NAME\u003e,\u003cSUBSCRIPTION_NAME\u003e, ...`.\n- **changeLogDataset**: The BigQuery dataset to store the staging tables in, in the format \\\u003cDATASET_NAME\\\u003e.\n- **replicaDataset**: The location of the BigQuery dataset to store the replica tables in, in the format \\\u003cDATASET_NAME\\\u003e.\n\n### Optional parameters\n\n- **inputTopics**: Comma-separated list of PubSub topics to where CDC data is being pushed.\n- **updateFrequencySecs**: The interval at which the pipeline updates the BigQuery table replicating the MySQL database.\n- **useSingleTopic** : Set this to `true` if you configure your Debezium connector to publish all table updates to a single topic. Defaults to: false.\n- **useStorageWriteApi** : If true, the pipeline uses the BigQuery Storage Write API (\u003chttps://cloud.google.com/bigquery/docs/write-api\u003e). The default value is `false`. For more information, see Using the Storage Write API (\u003chttps://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api\u003e).\n- **useStorageWriteApiAtLeastOnce** : When using the Storage Write API, specifies the write semantics. To use at-least once semantics (\u003chttps://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics\u003e), set this parameter to `true`. To use exactly-once semantics, set the parameter to `false`. This parameter applies only when `useStorageWriteApi` is `true`. The default value is `false`.\n- **numStorageWriteApiStreams** : When using the Storage Write API, specifies the number of write streams. If `useStorageWriteApi` is `true` and `useStorageWriteApiAtLeastOnce` is `false`, then you must set this parameter. Defaults to: 0.\n- **storageWriteApiTriggeringFrequencySec** : When using the Storage Write API, specifies the triggering frequency, in seconds. If `useStorageWriteApi` is `true` and `useStorageWriteApiAtLeastOnce` is `false`, then you must set this parameter.\n\nRun the template\n----------------\n\nTo run this template, perform the following steps:\n\n1. On your local machine, clone the [DataflowTemplates repository](https://github.com/GoogleCloudPlatform/DataflowTemplates).\n2. Change to the `v2/cdc-parent` directory.\n3. Ensure that the Debezium connector is [deployed](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/master/v2/cdc-parent#deploying-the-connector).\n4. Using Maven, run the Dataflow template: \n\n ```bash\n mvn exec:java -pl cdc-change-applier -Dexec.args=\"--runner=DataflowRunner \\\n --inputSubscriptions=\u003cvar translate=\"no\"\u003eSUBSCRIPTIONS\u003c/var\u003e \\\n --updateFrequencySecs=300 \\\n --changeLogDataset=\u003cvar translate=\"no\"\u003eCHANGELOG_DATASET\u003c/var\u003e \\\n --replicaDataset=\u003cvar translate=\"no\"\u003eREPLICA_DATASET\u003c/var\u003e \\\n --project=\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e \\\n --region=REGION_NAME\"\n \n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the Google Cloud project ID where you want to run the Dataflow job\n - \u003cvar translate=\"no\"\u003eSUBSCRIPTIONS\u003c/var\u003e: your comma-separated list of Pub/Sub subscription names\n - \u003cvar translate=\"no\"\u003eCHANGELOG_DATASET\u003c/var\u003e: your BigQuery dataset for changelog data\n - \u003cvar translate=\"no\"\u003eREPLICA_DATASET\u003c/var\u003e: your BigQuery dataset for replica tables\n\nWhat's next\n-----------\n\n- Learn about [Dataflow templates](/dataflow/docs/concepts/dataflow-templates).\n- See the list of [Google-provided templates](/dataflow/docs/guides/templates/provided-templates).\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e"]]