Change Data Capture (CDC) processing

This page guides you through Change Data Capture (CDC) within Google Cloud Cortex Framework in BigQuery. BigQuery is designed for efficiently storing and analyzing new data.

CDC process

When data changes in your source data system (like SAP), BigQuery doesn't modify existing records. Instead, the updated information is added as a new record. To avoid duplicates, a merge operation needs to be applied afterwards. This process is called Change Data Capture (CDC) processing.

The Data Foundation for SAP includes the option to create scripts for Cloud Composer or Apache Airflow to merge or upsert the new records resulting from updates and only keep the latest version in a new dataset. For these scripts to work the tables need to have some specific fields:

  • operation_flag: This flag tells the script whether a record was inserted, updated, or deleted.
  • recordstamp: This timestamp helps identify the most recent version of a record. This flag indicates whether the record is:
    • Inserted (I)
    • Updated (U)
    • Deleted (D)

By utilizing CDC processing, you can ensure that your BigQuery data accurately reflects the latest state of your source system. This eliminates duplicate entries and provides a reliable foundation for your data analysis.

Dataset structure

For all supported data sources, data from upstream systems are first replicated into a BigQuery dataset (_source_ or _replicated dataset_), and the updated or merged results are inserted into another dataset (_CDC dataset_). The reporting views select data from the CDC dataset, to ensure the reporting tools and applications always have the latest version of a table.

The following flow shows how the CDC processing for SAP, dependent on the operational_flag and recordstamp.

CDC processing example for SAP

Figure 1. CDC processing example for SAP.

The following flow depicts the integration from APIs into Raw data and CDC processing for Salesforce, dependent on the Id and SystemModStamp fields produced by Salesforce APIs.

Integration from APIs into Raw data and CDC processing for Salesforce

Figure 2. Integration from APIs into Raw data and CDC processing for Salesforce.

Some replication tools can merge or upsert the records when inserting them into BigQuery, so the generation of these scripts is optional. In this case, the setup only has a single dataset. The reporting dataset fetches updated records for reporting from that dataset.