Overview of the BigQuery Connector for SAP

This guide provides an overview of the BigQuery Connector for SAP, detailing its capabilities and how it facilitates data integration between SAP systems and BigQuery. BigQuery Connector for SAP replicates your SAP data into BigQuery in near real time. With your data in BigQuery, you can perform analytics with artificial intelligence or machine learning on live SAP application data. In BigQuery, you can also integrate your SAP data with data from other sources.

BigQuery Connector for SAP installs into SAP Landscape Transformation Replication Server (SAP LT Replication Server) and uses the standard interfaces and capabilities of SAP Landscape Transformation Replication Server. BigQuery Connector for SAP supports all of the data sources that SAP LT Replication Server supports.

Google Cloud has developed the BigQuery Connector for SAP as an implementation of an SAP Business Add In (BAdI) for SAP LT Replication Server, following the SAP LT Replication Server SDK process. The connector uses the change data capture (CDC) feature of SAP LT Replication Server.

Data processing

BigQuery Connector for SAP allows only minimal data transformation, conforming to an extract, load, transform model (ELT), as opposed to an extract, transform, and load (ETL) model. For more information about these models, see Procedures and patterns for data pipelines.

The connector automatically suggests the fields, field names, and data types for the target table based on the source data. Before the target BigQuery table is created, you can modify the suggested values as needed. The suggested names for the target fields are based on the descriptions of the fields in the source table. For more information, see Default naming options for fields.

Although BigQuery Connector for SAP automatically suggests the BigQuery data types for most fields, boolean, timestamp, and hexadecimal values cannot be interpreted automatically, so you need to map them manually. For more information, see Data type mapping.

BigQuery Connector for SAP provides several enhancement spots where you can insert your own custom enhancements into BigQuery Connector for SAP code. For more information, see Enhancement exits.

If a target table doesn't already exist in BigQuery, then BigQuery Connector for SAP creates one. For CDC replication through Pub/Sub, the BigQuery Connector for SAP also auto-creates the necessary Pub/Sub topic, schema, and subscription.

Supported replication paths

The BigQuery Connector for SAP supports the following replication paths:

Change Data Capture (CDC) replication through Pub/Sub

With Change Data Capture (CDC) replication through Pub/Sub, your BigQuery table directly reflects changes from the source SAP tables, updating existing data, and adding new records to keep the data synchronized. This approach uses the Storage Write API's CDC capabilities, eliminating the need for manual deduplication. CDC focuses on capturing and applying only the changes (updates, deletions, and new insertions) that occur in your source SAP tables.

When a record in an SAP table is updated or deleted, the BigQuery Connector for SAP publishes this change to a Pub/Sub topic. A BigQuery subscription then streams it directly to the corresponding BigQuery table, applying the change without duplicates.

Streaming data replication

With streaming data replication, your BigQuery table continuously appends new records, reflecting every change as a distinct entry in an insert-only mode. If BigQuery receives an update to a record that already exists in the BigQuery table, BigQuery inserts a new instance of the record into that table without modifying the existing instance. The most recently inserted instance of the record reflects the current state of the record in the source table.

For streaming data replication, BigQuery Connector for SAP uses the BigQuery streaming API.

The following diagram illustrates both data replication paths from SAP to BigQuery, by using the BigQuery Connector for SAP:

Replication paths to BigQuery

The breakdown of the data flow and components for each path is as follows:

Data extraction and initial processing (Common to both paths)

  • SAP data source: Originates the data within the SAP environment.
  • SAP LT Replication Server: Connects to the SAP data source through an RFC connection. Its primary role is to extract CDC data from the SAP source.
  • BigQuery Connector for SAP: Receives the extracted CDC data and determines which replication path the data is going to take based on the configurations.

CDC replication path

This path focuses on replicating changes to a BigQuery CDC table through Pub/Sub:

  • Pub/Sub REST API: BigQuery Connector for SAP publishes the CDC data to Pub/Sub by using the Pub/Sub REST API.
  • Pub/Sub topic: This acts as a central message broker, receiving the published CDC data.
  • Pub/Sub schema: Associated with the Pub/Sub topic, the schema enforces the data structure, which maintains data consistency.
  • Pub/Sub BigQuery subscription: Subscribes to the Pub/Sub topic and streams the CDC data to the BigQuery CDC table.
  • BigQuery CDC table: The final destination for the CDC data in BigQuery. It applies changes (updates/deletes) and inserts new records, maintaining a deduplicated, up-to-date view of the SAP data.
  • Dead-letter topic: A Pub/Sub topic where messages that fail to be processed by the BigQuery subscription are sent for further manual investigation, making sure no data loss.

Streaming data replication (Insert-only) path

This path is designed for continuously inserting new records into a BigQuery staging table, preserving every change as a new entry:

  • BigQuery Streaming REST API: BigQuery Connector for SAP streams the data directly to BigQuery by using the BigQuery streaming REST API.
  • BigQuery staging table: The destination for the streaming data. In this insert-only mode, every change (including updates and deletes to existing SAP records) results in a new row being appended to this table.

Choose your replication path

This section compares the replication paths to help you decide which best fits your specific data needs and operational requirements.

Factor CDC replication through Pub/Sub Streaming data replication (Insert-only mode)
How it works Captures and applies only the changes to the BigQuery table. Inserts every change as a new record to the BigQuery table.
BigQuery result Maintains a single, up-to-date record by natively updating or deleting existing rows. Each change creates a new entry, leading to multiple versions of the same record.
Key benefits Provides high data consistency with the latest data. Useful for basic ingestion and provides a historical audit.

Deployment architectures

You install the BigQuery Connector for SAP as a Business Add-In (BAdI) within an SAP LT Replication Server instance.

The SAP LT Replication Server instance can be on Google Cloud, on-premises, or on another cloud provider. We recommended that you place your SAP LT Replication Server instance as close to the SAP data source as possible. Work with your network team to make sure that there is low latency and high throughput between the source SAP system, SAP LT Replication Server, and your BigQuery dataset.

The following sections describe common, recommended architectures for BigQuery Connector for SAP.

Deployment architectures for CDC replication through Pub/Sub

Architecture for SAP data sources on Google Cloud

The following diagram shows two example SAP LT Replication Server installations on Google Cloud, with the SAP data source on Google Cloud.

To show an example of each architecture, one installation uses an SAP LT Replication Server standalone architecture, in which SAP LT Replication Server is installed on a separate server, and the other uses an SAP LT Replication Server embedded architecture, in which SAP LT Replication Server is installed in the SAP source system server.

Because the SAP LT Replication Server instances are installed on Google Cloud, BigQuery Connector for SAP connects to the Pub/Sub API endpoint directly, without requiring a Cloud Interconnect or Cloud VPN connection.

Architecture for SAP data sources on Google Cloud

Architecture for SAP data sources on-premises or on another cloud provider

The following diagram shows two example SAP LT Replication Server installations that are running either on-premises or on another cloud provider.

To show an example of each architecture, one installation uses an SAP LT Replication Server standalone architecture and the other uses an SAP LT Replication Server embedded architecture.

In both examples, SAP LT Replication Server is installed in the same environment as the SAP data source.

The connection from BigQuery Connector for SAP in SAP LT Replication Server to Pub/Sub is provided by either a Cloud Interconnect connection or a Cloud VPN connection.

Architecture for SAP data sources on-premises or on another cloud provider

Detailed architectural view of the data flow

The following diagram shows where BigQuery Connector for SAP fits into the SAP LT Replication Server data flow:

Detailed architectural view of the data flow

The following numbered explanations correspond to the numbers in the diagram:

  1. After SAP LT Replication Server is initialized, when records in the source tables are inserted, updated, or deleted, the database trigger records the changes in the logging tables.
  2. SAP LT Replication Server continuously checks the logging tables for new entries using RFC calls.
  3. If SAP LT Replication Server finds new entries, then the Read engine reads the records and calls the Mapping and transformation engine.
  4. The Mapping and transformation engine calls the Write engine, which invokes BigQuery Connector for SAP.
  5. The Write engine passes the processed data to the BigQuery Connector for SAP. Here, the connector's custom BAdI implementation transforms the SAP change records into an Avro-compliant JSON format. It populates specific metadata fields as follows:
    1. _CHANGE_TYPE: Populated based on the SAP SLT operation. For example, UPSERT for inserts or updates, DELETE for deletions.
    2. _CHANGE_SEQUENCE_NUMBER: A granular timestamp for chronological ordering and conflict resolution in BigQuery.
    The connector also maps tables and fields (names and data types) from SAP to BigQuery.
  6. The transformed messages are then published by the BigQuery Connector for SAP to Pub/Sub through the Pub/Sub REST API, by using HTTPS over a secure connection.
  7. Upon receiving the data, Pub/Sub takes the following actions:
    1. Performs validation checks according to the schema.
    2. Sends an HTTP 200 (OK) status code back to BigQuery Connector for SAP for valid messages.
    3. Inserts the records into the BigQuery target table through the BigQuery subscription.
    4. Captures the messages that fail BigQuery ingestion, in a dead-letter topic, preventing data loss and simplifying troubleshooting.
  8. The BigQuery Storage Write API uses the _CHANGE_TYPE and _CHANGE_SEQUENCE_NUMBER fields in the messages to apply the changes. The API performs insert, update, or delete operations, thereby maintaining synchronized data in the BigQuery tables for data analytics.
  9. BigQuery Connector for SAP passes the HTTP OK status code back to SAP LT Replication Server, which deletes the replicated entries from the logging table and frees resources on the SAP source system.

Deployment architectures for streaming data replication

The following diagram shows two example SAP LT Replication Server installations on Google Cloud, with the SAP data source on Google Cloud.

To show an example of each architecture, one deployment uses an SAP LT Replication Server standalone architecture, in which SAP LT Replication Server is installed on a separate server, and the other uses an SAP LT Replication Server embedded architecture, in which SAP LT Replication Server is installed in the SAP source system server.

Because the SAP LT Replication Server instances are installed on Google Cloud, BigQuery Connector for SAP connects to the BigQuery API endpoint directly, without requiring a Cloud Interconnect or Cloud VPN connection.

In the diagram, the SAP systems and BigQuery are shown in different Google Cloud projects, but you can use the same project for both, if necessary.

Architecture for SAP data sources on Google Cloud

Architecture for SAP data sources on-premises or on another cloud provider

The following diagram shows two example SAP LT Replication Server installations that are running either on-premises or on another cloud provider.

To show an example of each architecture, one installation uses an SAP LT Replication Server standalone architecture and the other uses an SAP LT Replication Server embedded architecture.

In both examples, SAP LT Replication Server is installed in the same environment as the SAP data source.

The connection from BigQuery Connector for SAP in SAP LT Replication Server to BigQuery is provided by either a Cloud Interconnect connection or a Cloud VPN connection.

Architecture for SAP data sources on-premises or on another cloud provider

Detailed architectural view of the data flow

The following diagram shows where BigQuery Connector for SAP fits into the SAP LT Replication Server data flow:

Detailed architectural view of the data flow

The following numbered explanations correspond to the numbers in the diagram:

  1. After SAP LT Replication Server is initialized, when records in the source tables are inserted, updated, or deleted, the database trigger records the changes in the logging tables.
  2. SAP LT Replication Server continuously checks the logging tables for new entries using RFC calls.
  3. If SAP LT Replication Server finds new entries, then the Read engine reads the records and calls the Mapping and transformation engine.
  4. The Mapping and transformation engine calls the Write engine, which invokes BigQuery Connector for SAP.
  5. BigQuery Connector for SAP performs the following actions:
    1. Maps the SAP data to the target table name, field names, and BigQuery data types.
    2. Creates the BigQuery table, if necessary.
    3. Sends the records to BigQuery in chunks through the BigQuery streaming API.
  6. Upon receiving the data, BigQuery takes the following actions:
    1. Performs validation checks.
    2. Inserts the records into the target table.
    3. Sends an HTTP 200 (OK) status code back to BigQuery Connector for SAP.
  7. BigQuery Connector for SAP passes the HTTP OK status code back to SAP LT Replication Server, which deletes the replicated entries from the logging table and frees resources on the SAP source system.

What's next

For information about planning the installation and configuration of the BigQuery Connector for SAP, see the BigQuery Connector for SAP planning guide.