使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
变更数据捕获 (CDC) 处理
本页将引导您在 BigQuery 的 Google Cloud Cortex Framework 中使用变更数据捕获 (CDC)。BigQuery 旨在高效存储和分析新数据。
CDC 流程
当源数据系统(例如 SAP)中的数据发生变化时,BigQuery 不会修改现有记录。而是将更新后的信息添加为新记录。为避免重复,需要之后应用合并操作。此过程称为变更数据捕获 (CDC) 处理。
Data Foundation for SAP 包含一个选项,可为 Cloud Composer 或 Apache Airflow 创建脚本,以合并或upsert
更新产生的新记录,并仅在新数据集中保留最新版本。为了让这些脚本正常运行,表格需要包含一些特定字段:
operation_flag
:此标志会告知脚本记录是被插入、更新还是删除的。
recordstamp
:此时间戳有助于识别记录的最新版本。此标志指示记录是否:
通过利用 CDC 处理,您可以确保 BigQuery 数据准确反映源系统的最新状态。这样可以消除重复条目,为数据分析奠定可靠的基础。
数据集结构
对于所有受支持的数据源,来自上游系统的数据会先复制到 BigQuery 数据集 (source
或 replicated dataset
),然后更新或合并后的结果会插入到另一个数据集(CDC 数据集)。报告视图会从 CDC 数据集中选择数据,以确保报告工具和应用始终使用最新版本的表。
下图显示了 SAP 的 CDC 处理方式,具体取决于 operational_flag
和 recordstamp
。

图 1。SAP 的 CDC 处理示例。
下图描绘了从 API 集成到 Salesforce 的原始数据和 CDC 处理,具体取决于 Salesforce API 生成的 Id
和 SystemModStamp
字段。

图 2。将 API 集成到 Salesforce 的原始数据和 CDC 处理中。
某些复制工具可以在将记录插入 BigQuery 时合并或更新/插入记录,因此生成这些脚本是可选的。在本例中,设置只有一个数据集。报告数据集会从该数据集提取更新后的记录以进行报告。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-08-18。
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-18。"],[[["\u003cp\u003eChange Data Capture (CDC) in Google Cloud Cortex Framework for BigQuery adds updated information as new records instead of modifying existing ones.\u003c/p\u003e\n"],["\u003cp\u003eA merge or upsert operation is required after CDC to avoid duplicates and keep only the latest version of each record in a new dataset.\u003c/p\u003e\n"],["\u003cp\u003eThe process relies on \u003ccode\u003eoperation_flag\u003c/code\u003e and \u003ccode\u003erecordstamp\u003c/code\u003e fields to identify whether a record was inserted, updated, or deleted, and to track the most recent version.\u003c/p\u003e\n"],["\u003cp\u003eData is replicated into a \u003ccode\u003esource\u003c/code\u003e dataset, and the merged results are inserted into a separate CDC dataset, ensuring reporting tools always use the latest data version.\u003c/p\u003e\n"],["\u003cp\u003eSome replication tools can merge or upsert records during insertion into BigQuery, making the creation of CDC scripts optional, and allowing a single dataset approach.\u003c/p\u003e\n"]]],[],null,["# Change Data Capture (CDC) processing\n====================================\n\nThis page guides you through Change Data Capture (CDC) within Google Cloud Cortex Framework\nin BigQuery. BigQuery is designed for efficiently\nstoring and analyzing new data.\n\nCDC process\n-----------\n\nWhen data changes in your source data system\n(like SAP), BigQuery doesn't modify existing records. Instead,\nthe updated information is added as a new record. To avoid duplicates, a\nmerge operation needs to be applied afterwards. This process is\ncalled [Change Data Capture (CDC) processing](/bigquery/docs/migration/database-replication-to-bigquery-using-change-data-capture).\n\nThe Data Foundation for SAP includes the option to create scripts for\nCloud Composer or Apache Airflow to [merge](/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement)\nor `upsert` the new records resulting from updates and only keep the\nlatest version in a new dataset. For these scripts to work the tables\nneed to have some specific fields:\n\n- `operation_flag`: This flag tells the script whether a record was inserted, updated, or deleted.\n- `recordstamp`: This timestamp helps identify the most recent version of a record. This flag indicates whether the record is:\n - Inserted (I)\n - Updated (U)\n - Deleted (D)\n\nBy utilizing CDC processing, you can ensure that your BigQuery\ndata accurately reflects the latest state of your source system.\nThis eliminates duplicate entries and provides a reliable foundation for\nyour data analysis.\n\nDataset structure\n-----------------\n\nFor all supported data sources, data from upstream systems are first replicated\ninto a BigQuery dataset (`source` or `replicated dataset`),\nand the updated or merged results are inserted into another dataset\n(CDC dataset). The reporting views select data from the CDC dataset,\nto ensure the reporting tools and applications always have the latest version\nof a table.\n\nThe following flow shows how the CDC processing for SAP, dependent on\nthe `operational_flag` and `recordstamp`.\n\n**Figure 1**. CDC processing example for SAP.\n\nThe following flow depicts the integration from APIs into Raw data and\nCDC processing for Salesforce, dependent on the `Id` and `SystemModStamp`\nfields produced by Salesforce APIs.\n\n**Figure 2**. Integration from APIs into Raw data and CDC processing for Salesforce.\n\nSome replication tools can merge or upsert the records when\ninserting them into BigQuery, so the generation of these\nscripts is optional. In this case, the setup only has a single\ndataset. The reporting dataset fetches updated records for reporting\nfrom that dataset."]]