Execution details

Dataflow provides an Execution details tab in its web-based monitoring user interface. Use the information in this tab to optimize performance for your jobs and diagnose why your job might be slow or stuck. After you launch your job, you can view the Execution details tab using the Dataflow monitoring UI. For more information, see Accessing the Dataflow monitoring interface.

This feature does not cause additional CPU or network usage for your VMs. The execution details are collected by the Dataflow backend monitoring system, which doesn't affect the performance of the job.

This page provides a high-level summary of the execution details feature and its user interface layout. For additional troubleshooting information, see Pipeline troubleshooting and debugging.

When to use execution details

The following are common scenarios for using execution details when running Dataflow jobs:

  • Your pipeline is stuck and you want to troubleshoot the issue.
  • Your pipeline is slow and you want to target pipeline optimization.
  • Nothing needs to be fixed, but you want to see the execution details of your pipeline to understand your job.

Terminology

To use execution details effectively, it's helpful to understand how the following concepts apply to Dataflow jobs.

Dataflow terminology

  • Fusion optimization: The process of Dataflow fusing multiple steps or transforms. This process optimizes user-submitted pipelines. For more information, see Fusion optimization.
  • Stages: The unit of fused steps in Dataflow pipelines.
  • Last stage: The final node in Dataflow pipelines. A pipeline can have multiple final nodes.

Batch terminology

  • Critical paths: The sequence of stages of a pipeline that contribute to the overall job runtime. For example, this sequence excludes the following stages:
    • Branches of the pipeline that finished earlier than the overall job.
    • Inputs that did not delay downstream processing.
  • Workers: Compute Engine VM instances running a Dataflow job.
  • Work items: The units of work that corresponds to a bundle selected by Dataflow.

Streaming terminology

  • Data Freshness: The amount of time between real time and the output watermark. More information can be found at Data freshness.

Use the Execution details tab

The Execution details tab includes four views: Stage progress, Stage info panel (within Stage progress), Stage workflow, and Worker progress.

The Stage Workflow view is automatically enabled for all batch and streaming jobs. Batch and streaming jobs also have the Stage progress view, and batch jobs have an additional Worker progress view.

This section walks you through each view and provides examples of successful and unsuccessful Dataflow jobs.

Stage progress for Batch jobs

The Stage progress view for Batch jobs shows the execution stages of the job, arranged by their start and end times. The length of time is represented with a bar. For example, you can visually identify the longest running stages of a pipeline by finding the longest bar.

With each of the bars, a sparkline shows the progress of the stage over time. To highlight the stages that contributed to the overall runtime of the job, click the Critical path toggle. Additionally, you can use Filter stages to only select the stages you are interested in.

An example of the Stage progress view for Batch jobs, showing a visualization of the length of time for six different execution stages.

Stage progress for Streaming jobs

The Stage progress view for Streaming jobs can be broken down into two sections. The first half of the view shows a chart representing the data freshness for each execution stage of the job. Hovering over the chart provides the data freshness value at that specific instant of time.

The second half of the view shows the execution stages of the job, arranged in a topological order. Stages with no descendant stages are shown first, followed by their descendants. This view makes it easier to identify the stages of a pipeline that take longer than expected. The bars are sized relative to the longest data freshness for the entire time domain.

Streaming jobs run until they are cancelled, drained or updated.

  • Use the time picker shown with the chart to scope the domain to a more useful time range.
  • Use the Filter stages menu to select the stages you're interested in.

The Stage progress view helps you identify when your streaming job is slow or stuck in two different ways:

  • The Data freshness by stages chart includes anomaly detection, which automatically display windows of time when the data freshness looks unhealthy. The chart highlights potential stuckness when data freshness exceeds the 99th percentile for the selected time window. Likewise, the chart highlights potential slowness when data freshness exceeds the 95th percentile.

  • Detect bottlenecks by hovering over a time in the chart that shows unexpected. results. Longer bars indicate slower stages. Alternatively, click the x-axis of the chart to display the data at that instance of time. A common approach to finding the stage causing stuckness or slowness is to find the most upstream (topmost) or the most downstream (bottommost) stage causing the data freshness to spike. This approach does not suit all scenarios, and you might need to debug further to pinpoint the exact cause.

An example of the Stage progress view for Streaming jobs, showing a visualization of the length of
time for one execution stage and a possible slowness anomaly.

Stage info panel

The Stage info panel displays a list of steps associated with a fused stage, ranked by descending wall time. To open the panel, hold the pointer over one of the bars in the Stage progress view and click View details.

An example of the Stage Info Panel

Stage workflow

Stage workflow shows the execution stages of the job, represented as a workflow graph. To show only the stages that directly contributed to the overall runtime of the job, click the Critical path toggle.

An example of the Stage workflow view, showing the hierarchy of the different
execution stages of a job.

Worker progress

For batch jobs, Worker progress shows the workers for a particular stage. This view is not available for streaming jobs.

Each bar maps to a work item scheduled to a worker. A sparkline that tracks CPU utilization on a worker is located with each worker, making it easier to spot underutilization issues.

Due to the density of this visualization, you must filter this view by pre-selecting a stage. First, identify a stage in the Stage progress view. Hold the pointer over that stage, and then click View workers to enter the Worker progress view.

An example of the worker progress view. The workers have bars and sparklines
that correspond to work item scheduling and CPU utilizations.

What's next