Troubleshoot bottlenecks in Dataflow

A bottleneck occurs when one step, stage, or worker slows down the overall job. Bottlenecks can lead to idle workers and increased latency.

If Dataflow detects a bottleneck, the job graph shows an alert, and the Step Info panel lists the kind of bottleneck and the cause, if known. Dataflow also exports bottleneck detection information to a Stackdriver metric, which presents the data as a timeseries. This lets you view bottlenecks over time or in the past.

Understand bottlenecks

When Dataflow runs a streaming pipeline, the job consists of a series of components, such as streaming shuffles, user-defined function (DoFn) processing threads, and persistent state checkpointing. To facilitate the flow of data, Dataflow uses queues to connect these components. Data is pushed from upstream to downstream.

In many pipelines, the overall throughput capacity is constrained by a single component, creating a bottleneck in the pipeline. The rate at which data can move through a bottleneck limits how quickly the pipeline can accept and process input data.

For example, consider a pipeline where DoFn processing occurs downstream of a streaming shuffle. A queue between them buffers the shuffled but unprocessed data. If the DoFn processing can't consume data as quickly as the streaming shuffle produces it, then the queue grows. A prolonged bottleneck can cause the queue to reach its capacity. At that point, further shuffling is paused, and the backlog propagates upstream. Queues further upstream also accumulate backlogs, eventually causing a slowdown that extends to the data source, meaning the whole pipeline can't keep pace with the input.

When a bottleneck happens, a substantial portion of the pipeline might appear to be unhealthy, even though a single point in the pipeline is causing the backlog. This behavior can make it hard to debug bottlenecks. The goal of bottleneck detection is to identify the precise location and cause, eliminating guesswork, so that you can fix the root cause.

Dataflow detects a bottleneck when a delay exceeds the threshold of five minutes. If the delay doesn't cross this threshold, Dataflow doesn't detect a bottleneck.

Bottleneck detection doesn't always require you to act and depends on your use case. A pipeline can operate normally with transient delays of more than five minutes. If this is acceptable for your use case, you might not need to resolve the indicated bottlenecks.

Kinds of bottleneck

When Dataflow detects a bottleneck, the monitoring interface indicates the severity of the problem. Bottlenecks fall into the following categories:

Processing is stuck and not making progress
The progress of the pipeline is completely halted at this step.
Processing is ongoing but falling behind.
The pipeline can't process incoming data as quickly as it arrives. The backlog is growing as a result.
Processing is ongoing but the backlog is steady
The pipeline is making progress, and the processing rate is comparable to the input rate. Processing is fast enough that the backlog is not growing, but the accumulated backlog is also not significantly decreasing.
Processing is ongoing and catching up from a backlog
The backlog is decreasing, but the current bottleneck prevents the pipeline from catching up any faster. If you start a pipeline with a backlog, this state might be normal and not require any intervention. Monitor progress to see if the backlog continues to decrease.

Causes of bottlenecks

The monitoring interface shows the cause of the bottleneck, if known. Use this information to resolve the issue.

Long processing time operations

The computation has a long processing time. This occurs whenever an input bundle is sent to the worker executing the DoFn and significant time has elapsed without results being available.

This is most often the result of a single long-running operation in user code. Other issues can manifest as long processing time operations. For example, errors thrown and retried inside the DoFn, retries for long periods of time, or crashes of the worker harness due to factors such as OOMs can all cause these long processing times.

If the affected computation is in user code, look for ways to optimize the code or bound the execution time. To help with debugging, the worker logs show stack traces for any operations that are stuck for longer than 5 minutes.

Slow persistent state read

The computation is spending a significant amount of time reading persistent state as part of executing the DoFn. This may be the result of excessively large persistent state, or too many reads. Consider reducing the persisted state size or frequency of reads. This may also be a transient issue due to slowness of the underlying persistent state.

Slow persistent state write

The computation is spending a significant amount of time writing persistent state during the commit of the results of processing. This may be the result of excessively large persistent state. Consider reducing the persisted state size. This may also be a transient issue due to slowness of the underlying persistent state.

Rejected commit

Data processing cannot be committed to persistent state due to being invalid. This is usually due to exceeding one of the operational limits. Check the logs for more details, or contact support.

Insufficient Apache Kafka source partitions

The Apache Kafka source computation has insufficient partitions. To resolve this problem, try the following:

  • Increase the number of Kafka partitions.
  • Use a Redistribute transform to redistribute and parallelize the data more efficiently.

For more information, see Parallelism in the page Read from Apache Kafka to Dataflow.

Insufficient source parallelism

A source computation has insufficient parallelism. If possible, increase the parallelization within the source. If you can't increase parallelization and the job uses at-least-once mode, try adding a Redistribute transform to the pipeline.

Hot keys or insufficient key parallelism

The job has hot keys or insufficient key parallelism.

For each sharding key, Dataflow processes messages serially. While Dataflow is processing a batch of messages for a given key, other incoming messages for that key are queued until the current batch is completed.

If Dataflow can't process enough distinct keys in parallel, it can cause a bottleneck. For example, the data might have too few distinct keys, or certain keys might be overrepresented in the data ("hot keys"). For more information, see Troubleshoot slow or stuck jobs.

Underprovisioned vCPUs

The job doesn't have enough worker vCPUs. This situation occurs when the job is already scaled to maximum, vCPU utilization is high, and there is still a backlog.

Problem communicating with workers

Dataflow cannot communicate with all of the worker VMs. Check the status of the job's worker VMs. Possible causes include:

  • There is a problem provisioning the worker VMs.
  • The worker VM pool is deleted while the job is running.
  • Networking issues.
Pub/Sub source has insufficient parallelism

The Pub/Sub source computation has an insufficient number of Pub/Sub keys. If you see this warning, contact support.

Pub/Sub source throttled for unknown reason

The Pub/Sub source computation is throttled while reading from Pub/Sub, for an unknown reason. This issue might be transient. Check for Pub/Sub configuration issues, missing IAM permissions, or quota limits. However, if none of the previous areas is the root cause and the issue persists, contact support.

Pub/Sub sink publish slow or stuck

The Pub/Sub sink computation is slow or stuck. This problem might be caused by a configuration issue or a quota limit.

High work queue time

The oldest eligible work age is high, due to a large number of keys and the rate at which keys are processed. In this situation, each operation might not be abnormally long, but the overall queuing delay is high.

Dataflow uses a single processing thread per sharding key, and the number of processing threads is limited. The queueing delay is approximately equal to the ratio of keys to threads, multiplied by the on-thread latency for each processing bundle for a key:

(key count / total harness threads) * latency per bundle

Try the following remediations:

  • Increase the number of workers. See Streaming autoscaling.
  • Increase the number of worker harness threads. Set the numberOfWorkerHarnessThreads / number_of_worker_harness_threads pipeline option.
  • Decrease the number of keys.
  • Decrease the operation latency.
A transient issue with the Streaming Engine backend

There is a configuration or operational issue with the Streaming Engine backend. This issue might be transient. If the issue persists, contact support.

What's next