Move Kafka data in Google Cloud

This document advises developers, architects, and decision-makers on options for migrating external Kafka data to Google Cloud, disaster recovery (DR), integrating with BigQuery, and change data capture (CDC) from databases.

Within Google Cloud, you can move Kafka data to a Google Cloud Managed Service for Apache Kafka cluster or to another Google product such as a BigQuery table or a Cloud Storage location. For a summary, see the following table.

Use case	Data source	Data destination	Recommended solution
Copy data	Self-managed Apache Kafka cluster	Managed Service for Apache Kafka cluster	Create a MirrorMaker 2.0 source connector in a Connect cluster.
Analyze data in a data warehouse	Managed Service for Apache Kafka cluster	BigQuery	Create a BigQuery sink connector in a Connect cluster.
Migrate data with synchronization	Self-managed Kafka cluster	Managed Service for Apache Kafka cluster	Create a MirrorMaker 2.0 source connector in a Connect cluster.
Replicate cluster across regions	Managed Service for Apache Kafka (region A)	Managed Service for Apache Kafka cluster (region B)	Create a MirrorMaker 2.0 source connector in a Connect cluster.
Backup data	Managed Service for Apache Kafka	Cloud Storage	Create a Cloud Storage sink connector in a Connect cluster.
Capture database changes	Relational database, Spanner, or Bigtable	Managed Service for Apache Kafka	Run an Apache Beam pipeline on Dataflow.
Process data with Apache Spark	Managed Service for Apache Kafka	Apache Spark	Run Dataproc on Compute Engine.

Recommended solutions to help move your Kafka data

Google Cloud offers several solutions for integrating your Kafka data. Each offering has unique advantages depending on your needs and existing infrastructure. The best integration method for you depends on your existing systems, current skill set, and capacity for managing infrastructure.

Use Kafka Connect

Google Cloud Managed Service for Apache Kafka lets you provision clusters that run Kafka Connect. The primary goal of Kafka Connect is to connect your Managed Service for Apache Kafka cluster to other systems for use cases like migration, backup, disaster recovery, high availability, and data integration. We recommend Kafka Connect for the most common data integration tasks for Managed Service for Apache Kafka. Kafka Connect offers several advantages:

Connect your external Kafka clusters to various Google Cloud data sources and sinks using built-in connectors. External Kafka clusters include Managed Service for Apache Kafka, on-premises Kafka clusters, and custom cloud deployments. Supported connectors include the following:
- MirrorMaker 2.0 connectors
- BigQuery sink
- Cloud Storage sink
- Pub/Sub source
- Pub/Sub sink
Benefit from the scalability and reliability of Google Cloud's infrastructure that ensures your data pipelines can handle growing data volumes and maintain high availability.
Offload the operational burden of managing Kafka Connect infrastructure to Google Cloud.
Monitor and manage your Kafka Connect clusters using Google Cloud's monitoring and logging tools.

To know more about Kafka Connect, see the Kafka Connect overview.

Use Dataflow

Google Cloud's serverless data processing service offers serverless stream and batch data integration. You can use Dataflow to move Kafka data to different sinks such as BigQuery datasets or Cloud Storage buckets. You can deploy Dataflow pipelines by using a Dataflow template or deploying an Apache Beam pipeline. Choose your Dataflow deployment based on the following factors:

For simpler, faster deployments, especially for common data integration tasks, choose prebuilt Dataflow templates that are deployable directly from the console.
For maximum flexibility, control, and complex use cases requiring custom logic, choose an Apache Beam pipeline.

Built-in templates

Built-in Dataflow templates are predefined Apache Beam pipelines that you can deploy in a code-free, easy-to-use job wizard. Dataflow provides several Dataflow templates for exporting Kafka data to Google Cloud.

Custom Apache Beam pipelines

The Dataflow templates discussed in the previous section might not meet all of your requirements. For example, you might need to integrate your Kafka data with a source or sink that is not supported by these templates. You might also have to perform transformations, normalizations, or mutations on records.

For these scenarios, you can use the Apache Beam SDK to author pipelines that can be run on Dataflow.

For information about the Apache Beam programming model, including key concepts like pipelines, PCollections, transforms, and runners, see Programming model for Apache Beam.

For resources on getting started with Apache Beam programming, including installing the SDK, programming guides, and interactive environments, see Use Apache Beam to build pipelines. The document also provides links to design, create, and test your pipeline, along with example streaming pipelines.

If you need to write Change Streams (change data capture) to Kafka, we recommend using the following components as part of your Dataflow pipeline:

DebeziumIO for relational databases such as Cloud SQL, MySQL, Postgres, and SQL Server.
Spanner Change Streams for Spanner
Bigtable Change Streams for Bigtable

Dataflow provides several resources for using Kafka with your Dataflow pipeline:

Choose between Kafka Connect and Dataflow

When transferring data between Kafka clusters, especially to Managed Service for Apache Kafka, Kafka Connect is typically the go-to solution. MirrorMaker 2.0, a part of Kafka Connect, is well-suited for tasks like cluster migration, backup, disaster recovery, or basic transformations. Kafka Connect supports record-at-a-time transformations for basic, per-message modifications.

For high-volume data migration requiring complex transformations, Dataflow is the more appropriate choice. Dataflow's strength lies in its ability to perform complex, stream-based operations, including data cleaning, enrichment, and aggregation, before the data reaches the target Managed Service for Apache Kafka cluster. Dataflow enables joining multiple data streams with advanced windowing and alignment logic, which are essential for complex data correlation and aggregation. This capability differentiates Dataflow from Kafka Connect, which is limited to basic, per-message modifications.

Use Dataproc for Spark pipelines

This fully managed service for Apache Spark and Hadoop workloads is a good fit if you have existing Spark pipelines. Apache Spark users can connect existing Spark deployments to Managed Service for Apache Kafka by using Dataproc. Dataproc is Google Cloud's fully managed service for Spark pipelines and includes tools for cluster lifecycle management. For example, if you have a Spark application that processes streaming data from Kafka and you want to migrate this application to Google Cloud, Dataproc would be a suitable choice.

For Spark Streaming and Kafka, Apache Spark provides an integration guide. To help deploy common Kafka workflows, Dataproc provides a number of open-source templates.