Stay organized with collections
Save and categorize content based on your preferences.
Change Data Capture (CDC) processing
This page guides you through Change Data Capture (CDC) within Google Cloud Cortex Framework
in BigQuery. BigQuery is designed for efficiently
storing and analyzing new data.
CDC process
When data changes in your source data system
(like SAP), BigQuery doesn't modify existing records. Instead,
the updated information is added as a new record. To avoid duplicates, a
merge operation needs to be applied afterwards. This process is
called Change Data Capture (CDC) processing.
The Data Foundation for SAP includes the option to create scripts for
Cloud Composer or Apache Airflow to merge
or upsert
the new records resulting from updates and only keep the
latest version in a new dataset. For these scripts to work the tables
need to have some specific fields:
operation_flag
: This flag tells the script whether a record was inserted,
updated, or deleted.
recordstamp
: This timestamp helps identify the most recent version of a
record. This flag indicates whether the record is:
- Inserted (I)
- Updated (U)
- Deleted (D)
By utilizing CDC processing, you can ensure that your BigQuery
data accurately reflects the latest state of your source system.
This eliminates duplicate entries and provides a reliable foundation for
your data analysis.
Dataset structure
For all supported data sources, data from upstream systems are first replicated
into a BigQuery dataset (source
or replicated dataset
),
and the updated or merged results are inserted into another dataset
(CDC dataset). The reporting views select data from the CDC dataset,
to ensure the reporting tools and applications always have the latest version
of a table.
The following flow shows how the CDC processing for SAP, dependent on
the operational_flag
and recordstamp
.

Figure 1. CDC processing example for SAP.
The following flow depicts the integration from APIs into Raw data and
CDC processing for Salesforce, dependent on the Id
and SystemModStamp
fields produced by Salesforce APIs.

Figure 2. Integration from APIs
into Raw data and CDC processing for Salesforce.
Some replication tools can merge or upsert the records when
inserting them into BigQuery, so the generation of these
scripts is optional. In this case, the setup only has a single
dataset. The reporting dataset fetches updated records for reporting
from that dataset.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-25 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eChange Data Capture (CDC) in Google Cloud Cortex Framework for BigQuery adds updated information as new records instead of modifying existing ones.\u003c/p\u003e\n"],["\u003cp\u003eA merge or upsert operation is required after CDC to avoid duplicates and keep only the latest version of each record in a new dataset.\u003c/p\u003e\n"],["\u003cp\u003eThe process relies on \u003ccode\u003eoperation_flag\u003c/code\u003e and \u003ccode\u003erecordstamp\u003c/code\u003e fields to identify whether a record was inserted, updated, or deleted, and to track the most recent version.\u003c/p\u003e\n"],["\u003cp\u003eData is replicated into a \u003ccode\u003esource\u003c/code\u003e dataset, and the merged results are inserted into a separate CDC dataset, ensuring reporting tools always use the latest data version.\u003c/p\u003e\n"],["\u003cp\u003eSome replication tools can merge or upsert records during insertion into BigQuery, making the creation of CDC scripts optional, and allowing a single dataset approach.\u003c/p\u003e\n"]]],[],null,["# Change Data Capture (CDC) processing\n====================================\n\nThis page guides you through Change Data Capture (CDC) within Google Cloud Cortex Framework\nin BigQuery. BigQuery is designed for efficiently\nstoring and analyzing new data.\n\nCDC process\n-----------\n\nWhen data changes in your source data system\n(like SAP), BigQuery doesn't modify existing records. Instead,\nthe updated information is added as a new record. To avoid duplicates, a\nmerge operation needs to be applied afterwards. This process is\ncalled [Change Data Capture (CDC) processing](/bigquery/docs/migration/database-replication-to-bigquery-using-change-data-capture).\n\nThe Data Foundation for SAP includes the option to create scripts for\nCloud Composer or Apache Airflow to [merge](/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement)\nor `upsert` the new records resulting from updates and only keep the\nlatest version in a new dataset. For these scripts to work the tables\nneed to have some specific fields:\n\n- `operation_flag`: This flag tells the script whether a record was inserted, updated, or deleted.\n- `recordstamp`: This timestamp helps identify the most recent version of a record. This flag indicates whether the record is:\n - Inserted (I)\n - Updated (U)\n - Deleted (D)\n\nBy utilizing CDC processing, you can ensure that your BigQuery\ndata accurately reflects the latest state of your source system.\nThis eliminates duplicate entries and provides a reliable foundation for\nyour data analysis.\n\nDataset structure\n-----------------\n\nFor all supported data sources, data from upstream systems are first replicated\ninto a BigQuery dataset (`source` or `replicated dataset`),\nand the updated or merged results are inserted into another dataset\n(CDC dataset). The reporting views select data from the CDC dataset,\nto ensure the reporting tools and applications always have the latest version\nof a table.\n\nThe following flow shows how the CDC processing for SAP, dependent on\nthe `operational_flag` and `recordstamp`.\n\n**Figure 1**. CDC processing example for SAP.\n\nThe following flow depicts the integration from APIs into Raw data and\nCDC processing for Salesforce, dependent on the `Id` and `SystemModStamp`\nfields produced by Salesforce APIs.\n\n**Figure 2**. Integration from APIs into Raw data and CDC processing for Salesforce.\n\nSome replication tools can merge or upsert the records when\ninserting them into BigQuery, so the generation of these\nscripts is optional. In this case, the setup only has a single\ndataset. The reporting dataset fetches updated records for reporting\nfrom that dataset."]]