Managed Service for Apache Kafka overview

Managed Service for Apache Kafka is a Google Cloud service that helps you run secure, scalable open source Apache Kafka clusters. This page is an overview of what the service automates and simplifies for you. For more information about Apache Kafka, see the Apache Kafka website.

Simple sizing and scaling

To size or scale a Managed Service for Apache Kafka cluster, you need only set the total vCPU count and RAM size for the cluster. Management of brokers, including storage, is fully automated. To keep up with demands of clients, you can monitor vCPU and RAM utilization and adjust them up or down.

When you set the vCPU count and RAM size, the service automates broker provisioning, storage management, and rebalancing.

Broker provisioning

When you configure the total vCPU and RAM size for the cluster, the service provisions new brokers and scales existing brokers. For a typical cluster configuration, the total vCPU and RAM size is split evenly across all brokers. This means that fractional vCPU counts per broker are allowed, although a minimum of a single vCPU per broker is required. All clusters are distributed across three zones. This means that a minimum of 3 vCPU and 3 GiB of RAM per cluster is required.

As you increase the cluster size, brokers are scaled vertically up to 15 vCPU per broker. After this limit is reached, the service creates new brokers. When you decrease the cluster size, existing brokers are scaled down to a single vCPU, but not deleted.

The maximum broker size might change at any time. This limit was chosen to maintain linear scaling of broker throughput with vCPU count. You can examine individual broker configurations using Apache Kafka command line tools as well as Google Cloud topic metrics.

Storage management

Storage management is automated. In most situations, you are responsible for setting the retention time on individual topics to control cost or satisfy your data retention policies. You don't need to provision and manage persistent disks.

The service relies on tiered storage (KIP-405). Tiered storage combines pre-provisioned persistent disk volumes attached to brokers with virtually unlimited object storage. As of writing, the service uses 100 GiB of SSD persistent disks for each vCPU to balance performance, availability, and cost. Each partition leader buffers messages in segment files on these persistent disks. After a segment is rolled, it is moved to persistent object storage backed by regional Cloud Storage. The size of these segment files is set by log.roll.ms and log.segment.bytes settings.

While these details are useful to understand, storage is managed by the service. The specific configurations, such as amount of persistent disk capacity per vCPU, are implementation details that might change. You don't have direct access to Cloud Storage buckets used for persistent storage.

Rebalancing

For newly provisioned brokers to be useful in maintaining performance, some traffic from existing brokers must be moved to these new machines. To make this easier, you can turn on automatic rebalancing.

With automatic rebalancing turned on, when a new broker is provisioned, the service automatically rebalances the partitions from existing brokers. The tiered storage model ensures that a relatively small amount of data must be copied to new brokers, speeding up rebalancing.

The rebalancing algorithm is based on the count of partitions. It does not account for the actual traffic served by each partition.

Flexible networking

The service makes a cluster accessible from any VPC securely. This includes access from multiple VPCs, projects, and regions.

To configure networking for a cluster, you provide the set of subnets where the cluster is accessible. The service provisions private IP addresses for the bootstrap servers and brokers in each subnet. It also sets up private Cloud DNS with URLs for each IP address. The bootstrap servers have a load balancer, so there is a single bootstrap URL per cluster. The URLs are the same across all VPCs so client configurations can be consistent across environments.

This level of flexibility is achieved thanks to Private Service Connect (PSC). Each IP address allocated for a cluster requires a PSC endpoint. The endpoints are provisioned automatically.

Secure clusters

The service does much of the work to keep your cluster secure. This includes authentication, authorization, encryption, patching and resource isolation. It also disallows unauthenticated and unencrypted connections and storage.

All connections to managed clusters and the administrative APIs are authenticated with an IAM identity. Human, service and federated accounts are supported. You don't have to manage these identities in the clusters.

Authorization is supported at two levels. You can rely on IAM to manage access to clusters and administrative APIs for managing them. If you need to control access to individual topics, you can use Kafka ACLs.

Encryption is required. All connections to clusters must be made with TLS. The TLS certificates presented by the brokers are signed by the Public Certificate Authority. Stored data is always encrypted. You can choose whether to use Google-managed or Customer-managed encryption keys (CMEK) for encryption at rest.

The service team keeps track of security vulnerabilities discovered in the open source code. When vulnerabilities are discovered, your clusters are patched automatically.

Another security feature of the service is resource isolation. The managed service deploys clusters in tenant projects in a private VPC inaccessible through public IP addresses. Each of your projects has a dedicated tenant project, with a dedicated service agent account. This helps limit the scope of access granted to the service.

High availability clusters

The goal of the service is to provide regional clusters for mission-critical applications. Specifically, the service protects you from failures of individual zones or brokers.

To achieve this, all clusters are provisioned in a rack-aware three-zone configuration. The default topic configuration requires at least three replicas. Rack-awareness makes sure that replicas are created in different zones. The default minimum number of in-sync replicas is two. This means that your cluster can tolerate complete loss of a zone or a broker.

When a broker fails, due to software, hardware or networking failure, it is replaced automatically. When the service detects broker failure, it automatically restarts it, on a different machine if necessary. After the broker is available, Apache Kafka integrates the broker into the cluster. Complete zone failure might make it impossible to create a new broker. However, the cluster continues operating as long as the other two zones remain available.

In addition to these specific features, a growing list of internal tools and processes proactively maintain the health of the service, Apache Kafka code, and updates. Data and metadata backups are maintained at multiple levels, allowing the service to recover from many human errors and software failures.

The service does not provide protection from regional or dual-zone failures. For applications that require this level of protection, we recommend running two separate regional clusters. You can synchronize the data between two clusters by using tools such as MirrorMaker 2.0.

Tools for your style of administration

The service aims to offer a complete set of tools for your style of cluster management and troubleshooting. This includes tools for administering, monitoring, and logging.

The Managed Service for Apache Kafka is exposed as a Google Cloud API. This means that you can manage clusters and cluster resources using REST and gRPC APIs. Several clients and interfaces are provided for these APIs, including

  • Terraform providers if you prefer the infrastructure as code approach.
  • UI in Google Cloud console for interactive work in a browser.
  • The gcloud CLI for interactive work in a shell.
  • Client libraries in Java, Python, Go and other languages for custom development and scripting.

For monitoring and troubleshooting, the service exports metrics to Cloud Monitoring. Some of the metrics are available in the service UI. A complete set is available in Cloud Monitoring for interactive work, configuring alerts, and export to other systems.

The service also exports broker logs to Cloud Logging. These are searchable and can be used to create log-based metrics and alerts.

Automatic upgrades and patches

The service aims to keep all clusters updated to the latest stable version of Apache Kafka and all underlying software within several months of release. The service remains 1 minor version behind the latest Apache Kafka version to avoid stability issues.

Updates to the underlying infrastructure, including the operating system and orchestration layers, are also continuous and automatic. Brokers are updated with a rolling restart, with no downtime to the overall cluster. All updates are tested before they are made available and are monitored for stability. Upgrades require no manual intervention.

Transparent cost

The pricing model for Managed Service for Apache Kafka is similar to charges you see when you run Apache Kafka yourself on Compute Engine. You pay for the resources you provision—vCPU, RAM, and local storage—and consume—persistent storage, data transfer. vCPU, nd persistent storage cost more with Managed Service for Apache Kafka compared to setting up a similar system yourself. Data transfer and local storage don't. For more information about pricing, see the pricing guide.

Compatible because we run Apache Kafka

Finally, Managed Service for Apache Kafka runs the same open source software you may already run in your environment. You don't have to change your application code to migrate it to the service.

What's next?

Apache Kafka® is a registered trademark of The Apache Software Foundation or its affiliates in the United States and/or other countries.