Move Kafka data in Google Cloud

This document advises developers, architects, or decision-makers on options for change data capture (CDC) from external and internal databases, disaster recovery (DR), and other use cases for moving Kafka data to Google Cloud.

Within Google Cloud, you can move the Kafka data to a Google Cloud Managed Service for Apache Kafka cluster or to another Google product such as a BigQuery table or a Cloud Storage location. For a summary, see the following table.

Use case Data source Data destination Recommended solution
Copy data Self-managed Apache Kafka cluster Managed Service for Apache Kafka cluster Access Kafka data in Managed Service for Apache Kafka using a Dataflow template.
Analyze data in a data warehouse Managed Service for Apache Kafka cluster BigQuery Access Kafka data in BigQuery using a Dataflow template.
Capture database changes Relational database Spanner or Bigtable Managed Service for Apache Kafka Apache Beam pipeline run on Dataflow.
Backup data Managed Service for Apache Kafka Cloud Storage Access Kafka data in Cloud Storage using a Dataflow template.
Process data Managed Service for Apache Kafka Apache Spark Run Dataproc on Compute Engine.
Process data Managed Service for Apache Kafka Apache Flink BigQuery Engine for Apache Flink.
Migrate data with bi-directional synchronization Self-managed Kafka cluster Managed Service for Apache Kafka cluster Self-managed Kafka Connect with MirrorMaker 2 connector.
Replicate cluster across regions Managed Service for Apache Kafka (region A) Managed Service for Apache Kafka cluster (region B) Self-managed Kafka Connect with MirrorMaker 2 connector.

Google Cloud offers several solutions to integrate your Kafka data. Each offering has unique advantages depending on your needs and existing infrastructure. The best integration method for you depends on your existing systems, current skill set, and capacity for managing infrastructure.

  • Dataflow: Google Cloud's serverless data processing service offers serverless stream and batch data integration and processing. We recommend Dataflow with Managed Service for Apache Kafka for the most common data integration tasks. For more information, see Export Kafka data to Google Cloud using Dataflow.

  • BigQuery Engine for Flink (Preview): This fully managed service for Apache Flink workloads is ideal if you have existing Flink pipelines and require advanced stream processing capabilities. For more information, see Export Apache Flink data to Google Cloud.

  • Dataproc: This fully managed service for Apache Spark and Hadoop workloads is a good fit if you have existing Spark pipelines. For more information, see Export Apache Spark data to Google Cloud.

  • Kafka Connect: This open-source tool lets Kafka users stream data in and out of Kafka clusters. However, it requires managing your deployment, including infrastructure, updates, and security. For more information, see Use Kafka Connect.

Export Kafka data to Google Cloud using Dataflow

To move Kafka data to different sinks such as BigQuery datasets or Cloud Storage buckets, Dataflow is the recommended solution. You can deploy Dataflow pipelines by using a Dataflow template or configuring an Apache Beam pipeline. Choose your Dataflow deployment based on the following factors:

  • For simpler, faster deployments, especially for some common data integration tasks, choose Dataflow prebuilt templates that can be directly deployed from the console.

  • For maximum flexibility, control, and complex use cases requiring custom logic or integrations, choose an Apache Beam pipeline.

Use Dataflow templates

Dataflow templates are predefined Apache Beam pipelines that you can deploy in a code-free, easy-to-use job wizard. Dataflow provides three Dataflow templates for exporting Kafka data to Google Cloud. These templates are discussed in an earlier section of the document.

Use Apache Beam pipelines

The Dataflow templates discussed in the previous section might not meet all of your requirements. For example, you might need to integrate your Kafka data with a source or sink that is not supported by these templates. You might also have to perform transformations, normalizations, or mutations on records.

For these scenarios, you can use the Apache Beam SDK to author pipelines that can be run on Dataflow.

For information about the Apache Beam programming model, including key concepts like pipelines, PCollections, transforms, and runners, see Programming model for Apache Beam .

For resources on getting started with Apache Beam programming, including installing the SDK, programming guides, and interactive environments, see Use Apache Beam to build pipelines. The document also provides links to design, create, and test your pipeline, along with example streaming pipelines.

If you need to write change streams (change data capture) to Kafka, we recommend using the following components as part of your Dataflow pipeline:

Dataflow provides several resources for using Kafka with your Dataflow pipeline:

To run existing Flink applications in Google Cloud, use BigQuery Engine for Apache Flink.

To create an end-to-end streaming pipeline with BigQuery Engine for Apache Flink, Managed Service for Apache Kafka, and BigQuery, see Process real-time data from Managed Service for Apache Kafka. This document also covers creating a Kafka cluster, running Flink jobs to write and read messages from Kafka, and processing the data in BigQuery.

To read data from and write data to Kafka topics, use the Kafka connector provided by Apache Flink.

Deploy Spark pipelines using Dataproc

Apache Spark users can connect existing Spark deployments to Managed Service for Apache Kafka by using Dataproc. Dataproc is Google Cloud`s fully managed service for Spark pipelines and includes tools for cluster lifecycle management.

For Spark Streaming and Kafka, Apache Spark provides an integration guide.

To help deploy common Kafka workflows, Dataproc provides a number of open-source templates.

Use Kafka Connect

Kafka Connect is an open-source tool that allows users to integrate data between Apache Kafka and other systems. Kafka Connect provides many advantages, including hundreds of open-source connectors, automatic offset management, and a REST interface for operations.

To deploy a running Kafka Connect application, follow the Apache Kafka quickstart.

Kafka Connect is not offered as a part of Managed Service for Apache Kafka. You must manage the Kafka Connect deployment using Compute Engine or Kubernetes Engine.

To perform tasks like migration and disaster recovery, we recommend deploying the MirrorMaker 2 connector using Kafka Connect. MirrorMaker 2 (MM2) is a connector for cluster replication. MM2 deploys a source and sink connector that reads data from an original cluster and writes to a target cluster. MM2 automatically detects new topics and partitions, synchronizes topic configurations, and replicates offsets between clusters.