Access Kafka data in BigQuery

If you need to load data from a Apache Kafka for BigQuery topic to a BigQuery table, you can do so with a Dataflow template using the Google Cloud console or the Google Cloud CLI. This method works if the messages in the Kafka topic are published using AVRO or JSON schema and encoding.

Google Cloud products used

The Kafka to BigQuery Dataflow template uses the following billable Google Cloud products. Use the Pricing calculator to generate a cost estimate based on your projected usage.

Apache Kafka for BigQuery: A Google Cloud service that helps you run Apache Kafka. Apache Kafka for BigQuery lets you focus on building event-driven systems and streaming data pipeline, rather than managing the infrastructure.
BigQuery: Google Cloud's serverless data warehouse, ideal for running SQL queries and advanced analytics. Integrating Kafka data with BigQuery unlocks various possibilities:
- Data accessibility: Makes data available to analysts and data scientists who may not have direct Kafka access.
- Sample analysis: Allows analyzing specific data sets over a defined period, using Apache Avro.
- Machine learning: Combine BigQuery ML with your Kafka data to build machine learning models using SQL.
- Business intelligence: Integrate Kafka data with your BigQuery-powered BI applications.
- Prototyping: Experiment with Avro data outside your production environment.
Dataflow: A fully managed service that provisions virtual machines for data processing. Its features such as optimizations and self-healing ensure continuous and reliable data processing. The Kafka to BigQuery template comes pre-packaged, allowing users to run pipelines with custom parameters. You can launch it from the Google Cloud console, Google Cloud CLI, or REST API.
Cloud Storage: This service stores your objects (immutable data files) in containers called buckets. Buckets can contain managed folders for organizing and controlling access to groups of objects.

Before you begin

Before launching your Kafka to BigQuery pipeline, ensure you have completed the following:

Create an Apache Kafka for BigQuery cluster and topic.

One way of creating a cluster and a topic is to follow the Apache Kafka for BigQuery quickstart.

If your topic contains Avro records, for additional resource requirements, see Specify the message format.

Enable the following Google Cloud APIs:

Dataflow
BigQuery
Cloud Storage

gcloud services enable dataflow.googleapis.com bigquery.googleapis.com \
storage-api.googleapis.com

Create a BigQuery dataset and a table. Ensure that the schema of the table matches the schema of your Kafka input topic.

If using multiple schemas within the same topic and writing to multiple tables, you needn't create a table before configuring the pipeline.

For more information on how to create a BigQuery dataset and a table, see Create a dataset and Create an empty table with a schema definition.

Launch the Kafka to BigQuery Dataflow template

You can launch the Kafka to BigQuery Dataflow template from the cluster details page in the console.

In the Google Cloud console, go to the Cluster page.

Go to Clusters
The clusters you created in a project are listed.
To view the cluster details page, click a cluster name.
In the cluster details page, click Export Data to BigQuery.
The Create job from template page opens.

Choose a regional endpoint for your pipeline

The Dataflow workers can run independent of the region of your Kafka cluster. However, you incur inter-regional egress costs if you launch workers outside the region of your Kafka cluster.

We strongly recommend that you configure the region of your Dataflow pipeline to be the same as your Kafka cluster, unless you have data processing requirements that require otherwise.

Grant the Managed Kafka client role to the Dataflow worker service account

Use Application Default Credetials as the Kafka source authentication mode.

When doing so, you must also grant the roles/managedkafka.client role on the Dataflow worker service account. The Dataflow worker service account is used as the identity for all worker VMs, and all requests that originate from the VM use the worker service account.

For more information about the worker service account, see Security and permissions for pipelines on Google Cloud.

The Managed Kafka client role includes the permission managedkafka.clusters.connect required for all connections. To grant the Managed Kafka client role to the Dataflow service account, follow these steps:

Console

In the Google Cloud console, go to the IAM page.
Go to IAM
Check that the project is set to the consumer project that the Apache Kafka for BigQuery client would be accessing.
Click Grant access.
In the new page, for Add Principals, enter the email address of the Dataflow worker service account that you are using.
For Assign roles, select the Managed Kafka client role.
Click Save.

gcloud CLI

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Run the gcloud projects add-iam-policy-binding command:
```
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member serviceAccount:SERVICE_ACCOUNT_EMAIL \
  --role roles/managedkafka.client
```
Replace the following:
- PROJECT_ID is the project ID.
- SERVICE_ACCOUNT_EMAIL is the email address of the Dataflow worker service account.

Configure the Kafka message format

The Dataflow template supports three message formats:

Avro (Confluent wire format): Each Kafka message includes a magic byte, a schema ID, and the Avro binary-encoded record. By default, the message key is not loaded into BigQuery. If you enable the Persist the Kafka message key to the BigQuery table option, you must also add a field named _key of type BYTES in the target table. While the key may represent structured data, the template treats it as a byte array.

For Avro (Confluent wire format) formats, you can utilize either a single schema or multiple schemas:
- Single schema: All messages adhere to a single predefined Avro schema.
- Multiple schemas: Messages can utilize different schemas. This is only supported for the Avro (Confluent wire format).
Avro (binary-encoded): Messages contain only the payload of the record without any metadata. You must provide an Avro schema file (.avsc) uploaded to Cloud Storage. All messages must adhere to this single schema.
JSON: Records don't require a predefined schema. Records that don't conform to the BigQuery table schema are sent to the dead-letter queue (if configured) or an error message is logged. The supported format is {"field": "value"} format. The format [{"name": "field", "value": "value"}]is not supported.

Configure the Kafka schema source for Avro only

You can provide the Avro schema through two methods:

Single schema file: Upload your Avro schema file (.avsc) to Cloud Storage. This option is available for both Avro (Confluent wire format) and Avro (binary-encoded) formats.
Schema registry: Provide the schema registry connection URL. Ensure that the schema registry is accessible to the Apache Kafka for BigQuery cluster network and is hosted in the same region as your Dataflow workers. You can use a schema registry with both single and multiple schema scenarios.

Specify the Kafka offset

The Dataflow pipeline offers two starting offset options for data processing, found in the "Optional Source Parameters" section:

Earliest: Processes messages from the beginning of the Kafka topic.
Latest: Processes messages starting from the latest available offset.

To restart a pipeline from where it previously stopped, enable the offset commit policy. When using this option, specify the consumer group name for your pipeline.

Schema and BigQuery table compatibility

If you need to modify the pipeline's schema, create a BigQuery destination table that encompasses all fields from the incoming topics. This ensures that the BigQuery table can accommodate all data from the Kafka messages.

Configure dead-letter handling

Sometimes messages can't be processed due to corruption, incompatible data types, or schema mismatches with the destination BigQuery table.

To handle these cases, enable the dead-letter queue in the template and provide a table name. The template creates the table using a standardized schema. Erroneous messages are written to a separate BigQuery table.

Dead-letter records are written using a standardized schema that the template provides.

Configure encryption

By default, all data at rest and in-transit are encrypted by a Google-managed encryption key. If you have customer-managed encryption keys (CMEK), you can select your own keys. For more information about how to configure a CMEK, see Configure message encryption.

Configure networking

You must specify the cluster's network and subnetwork in the Dataflow template. The template's Optional parameters section lets you define the network for your Dataflow workers.

The Kafka to BigQuery Dataflow template provisions Dataflow workers in your project's default network, by default. To let your Apache Kafka for BigQuery cluster send data to BigQuery through Dataflow, ensure that your Dataflow workers can access your cluster's network.

Here are a few ways to ensure network accessibility:

Recommended: If your Kafka cluster is not connected to a subnet in the project's default network, use your project's default network for your Kafka cluster.
Set up VPC network peering. This method connects your Apache Kafka for BigQuery network to your Dataflow worker network. This offers more flexibility but adds complexity.

For more information on setting up networking with your Dataflow pipeline, see the Guidelines for specifying a network parameter.

If you encounter challenges configuring your Dataflow networking, see the Dataflow networking troubleshooting guide.

Monitoring

The Dataflow template for Kafka to BigQuery provides a monitoring experience that lets you explore logs, metrics, and errors within the console. This monitoring suite of tools is available as part of the Dataflow user interface.

The Job Metrics tab lets you create custom dashboards. For the Kafka to BigQuery template, we recommend setting up a Job Metrics dashboard that monitors the following:

Throughput: The volume of data processed at any point in time. This is useful for monitoring data flow through your job and identifying potential performance issues.

For more information, see Dataflow throughput monitoring.
Data freshness: The difference in seconds between the timestamp on the data element and the time the event is processed in your pipeline. This helps identify performance and data source bottlenecks or frequent retries.

For more information, see Dataflow data freshness monitoring.
Backlog: The amount of bytes waiting to be processed. This information informs autoscaling decisions.

For more information about Dataflow monitoring, see the Dataflow monitoring documentation.

Troubleshooting

If you encounter performance issues with your Dataflow pipeline, Dataflow provides a comprehensive set of troubleshooting and diagnostic tools.

Here are two common scenarios and their respective troubleshooting guides:

For a general overview of debugging Dataflow pipelines, review the Dataflow debugging guide.

Known limitations

The template does not support passing credentials for authentication to your Schema Registry.
When you create the Kafka to BigQuery Dataflow job, ensure that the Google Cloud project is set to the same project that contains the Apache Kafka for BigQuery cluster.

Apache Kafka® is a registered trademark of The Apache Software Foundation or its affiliates in the United States and/or other countries.

What's next

Apache Kafka to BigQuery template parameters