Plan the size of your Managed Service for Apache Kafka cluster

This document describes how to estimate the capacity that you need for a Managed Service for Apache Kafka cluster, and how to adjust the size of an existing cluster.

When you create a Managed Service for Apache Kafka cluster, you choose the following parameters for the size of the cluster:

vCPUs: The number of vCPUs in the cluster. The minimum vCPU count is 3.
Memory: The amount of memory per vCPU. You must provision between 1 GiB and 8 GiB per vCPU.

You can update these values after the cluster is created.

Choose the initial cluster size

To choose the initial cluster size, start by estimating the following values, based on your particular workload.

Write throughput: The total rate at which producers send data to the cluster, in MBps.
Read throughput: The total rate at which consumers read data from the cluster, in MBps.

To estimate the size of a cluster necessary to handle this throughput, perform the following steps:

Calculate the total write bandwidth, including replication.

Total write bandwidth = produce rate * replicas

This value includes bandwidth from the client to the leader broker, and from the leader to replica brokers. The default number of replicas is 3.
Calculate the total read bandwidth, including replication.

Total read bandwidth = consume rate + produce rate * ( replicas - 1)

This value includes bandwidth for the client's read operations (consume rate), plus bandwidth needed for the replicas to stay synchronized. Replicas synchronize by reading data from the partition leader. The (replicas - 1) term is used because the partition leader doesn't read from any replicas.
Calculate the write-equivalent data rate.

As a general rule, read bandwidth is 4 times more efficient to process than write bandwidth. To account for this difference, calculate the write-equivalent data rate as follows:

Write-equivalent rate = (total write bandwidth) + (total read bandwidth / 4)
Determine your target vCPU utilization. This value represents average vCPU utilization as a percentage of vCPU capacity. The actual utilization might spike or dip over time.
- As a baseline, start with a utilization target of 50%.
- If you know the expected traffic patterns, set the utilization target equal to the ratio of the average write-equivalent bandwidth to the peak bandwidth you must accommodate.
Generally, increasing utilization lowers the cost of your cluster by reducing its size, but is also riskier if the traffic exceeds estimates. Excessive vCPU utilization can cause high latencies and errors.
Calculate the number of vCPUs.

vCPU count = ceiling (write-equivalent rate / 20 MBps / utilization)

The estimated capacity for a single vCPU in a single zone is 20 MBps. Therefore, if the vCPUs ran at 100% utilization, you would need (write-equivalent rate / 20) vCPUs. To get the actual number, divide that value by the target utilization and round up.

Also, sending messages in batches smaller than 10 KB reduces throughput per CPU, relative to the benchmark here. In that case, either account for the reduced throughput capacity or consider sending larger batches.
Estimate the required memory. We recommend 4 GiB of RAM for each vCPU.

Memory = vCPU count * 4 GiB

Test with your real workload for the most accurate sizing. Monitor the cluster's resource usage and scale up if needed.

Example size calculation

Assume that a workload has a write rate of 50 MBps and a read rate of 100 MBps, with 3 replicas and a target vCPU utilization of 50%.

Total write bandwidth = 50 MBps * 3 replicas = 150 MBps
Total read traffic = 100 MBps + 50 MBps * (3 - 1) = 200 MBps
Write-equivalent rate = 150 MBps + (200 MBps / 4) = 200 MBps
Target utilization = 0.5
Number of vCPUs = ceiling (200 MBps / 20 MBps / 0.5) = 20 vCPUs
Memory = 20 vCPUs * 4 GiB = 80 GiB

Brokers

When you create a cluster, the system provisions at least one broker in each of three zones. Brokers are as evenly distributed across zones as possible, and all brokers have the same number of vCPUs. The number of brokers can be calculated with the following formula:

number of brokers = max(3, ceiling(vCPUs / 15))

For example, a cluster with 75 vCPUs starts with 5 brokers.

If you change the number of vCPUs, they are distributed across the existing brokers, up to a maximum of 15 vCPUs per broker. If you increase the cluster size beyond 15 vCPUs per broker, the system provisions a new broker. Once a new broker is provisioned, it can be scaled down to 1 vCPU, but it cannot be deleted.

Update the cluster size

After you create a Managed Service for Apache Kafka cluster, you can adjust the vCPU count and memory to accommodate your needs. When you update an existing cluster, the following rules apply:

The cluster's overall vCPU-to-memory ratio must always remain between 1:1 and 1:8.
If you downscale, there must be at least 1 vCPU and 1 GiB of memory for each existing broker. The number of brokers never decreases.
If you upscale, and the change results in adding new brokers, the average vCPU and memory per broker can't decrease by more than 10% compared to the averages before the update.

For example, if you try to upscale a cluster from 45 vCPUs (3 brokers) to 48 vCPUs (4 brokers), the operation fails. This is because the average vCPU per broker decreases from 15 to 12, which is a 20% reduction, exceeding the 10% limit.

If you need to decrease the CPU count by more than 10%, we recommend reducing the number in several stages. After each update, monitor resource utilization and rebalance partitions if needed.

However, if you are confident that your brokers will have enough capacity after the update, you can disable this check. To disable the check, set the allow_broker_downscale_on_cluster_upscale flag to true in the gcloud managed-kafka clusters update command. This flag signals that you accept the potential performance risk.

To update a cluster, see Update a Managed Service for Apache Kafka cluster.

Example update operations

The following examples start with a cluster that has 75 vCPUs, 130 GiB RAM, and 5 brokers.

Example of a failed upscale operation

Upscale the cluster to 80 vCPUs and 140 GiB RAM.

The service determines whether a new broker is needed.
- ceiling (80 vCPUs / 15) = 6 brokers
The cluster would grow from 5 to 6 brokers, so the 10% safety check is triggered.
The current averages per broker are:
- 75 vCPUs / 5 brokers = 15 vCPUs per broker
- 130 GiB / 5 brokers = 26 GiB per broker
With 6 brokers, the new averages are:
- 80 vCPUs / 6 brokers = 13.33 vCPUs per broker, an 11.1% reduction
- 140 GiB / 6 brokers = 23.33 GiB per broker, a 10.2% reduction
The operation fails, because these averages exceed 10%.

Example of a successful upscale operation

Upscale the cluster to 85 vCPUs and 150 GiB RAM.

The service determines if a new broker is needed.
- ceiling (85 vCPUs / 15) = 6 brokers
The cluster would grow from 5 to 6 brokers, so the 10% safety check is triggered.
The current averages per broker are:
- 75 vCPUs / 5 brokers = 15 vCPUs per broker
- 130 GiB / 5 brokers = 26 GiB per broker
With 6 brokers, the new averages are:
- 85 vCPUs / 6 brokers = 14.17 vCPUs per broker, a 5.5% reduction
- 150 GiB / 6 brokers = 25 GiB per broker, a 3.8% reduction

This operation succeeds because the reduction in average vCPU and memory per broker is within the 10% limit.