Stay organized with collections
Save and categorize content based on your preferences.
Batch jobs use Dataflow shuffle by default.
Dataflow shuffle
moves the shuffle operation out of the worker VMs and into the
Dataflow service backend.
The information on this page applies to batch jobs. Streaming jobs use a
different shuffle mechanism, called
streaming shuffle.
About Dataflow shuffle
Dataflow shuffle is the base operation behind
Dataflow transforms such as GroupByKey, CoGroupByKey, and
Combine.
The Dataflow shuffle operation partitions and groups
data by key in a scalable, efficient, fault-tolerant manner.
Benefits of Dataflow shuffle
The service-based Dataflow shuffle has the following benefits:
Faster execution time of batch pipelines for the majority of pipeline job
types.
A reduction in consumed CPU, memory, and Persistent Disk storage resources
on the worker VMs.
Better Horizontal Autoscaling, because
VMs don't hold any shuffle data and can therefore be scaled down earlier.
Better fault tolerance, because an unhealthy VM holding Dataflow
shuffle data doesn't cause the entire job to fail.
Support and limitations
This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations. There might be performance differences between regions.
Workers must be deployed in the same region as the Dataflow job.
Don't specify the zone pipeline option. Instead, specify the region, and
set the value to one of the available regions. Dataflow
automatically selects the zone in the region you specified.
If you specify the zone
pipeline option and set it to a zone outside of the available regions, the
Dataflow job returns an error. If you set an incompatible combination
of region and zone, your job can't use Dataflow shuffle.
For Python, Dataflow shuffle requires Apache Beam SDK
for Python version 2.1.0 or later.
Disk size considerations
The default boot disk size for each batch job is 25 GB. For some batch jobs,
you might be required to modify the size of the disk. Consider the following:
A worker VM uses part of the 25 GB of disk space for the operating system,
binaries, logs, and containers. Jobs that use a significant amount of disk and
exceed the remaining disk capacity may fail when you use
Dataflow shuffle.
Jobs that use a lot of disk I/O may be slow due to the performance of the
small disk. For more information about performance differences between disk
sizes, see
Compute Engine Persistent Disk Performance.
To specify a larger disk size for a Dataflow shuffle job, you can
use the --disk_size_gb
parameter.
Pricing
Most of the reduction in worker resources comes from offloading the shuffle work
to the Dataflow service. For that reason, there is a
charge associated with the use of Dataflow
shuffle. The execution times might vary from run to run. If you are running
a pipeline that has important deadlines, we recommend allocating sufficient
buffer time before the deadline.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-26 UTC."],[[["\u003cp\u003eDataflow shuffle, which is used by default for batch jobs, moves shuffle operations to the Dataflow service backend, improving efficiency.\u003c/p\u003e\n"],["\u003cp\u003eDataflow shuffle is the foundational operation for Dataflow transforms like \u003ccode\u003eGroupByKey\u003c/code\u003e, \u003ccode\u003eCoGroupByKey\u003c/code\u003e, and \u003ccode\u003eCombine\u003c/code\u003e, enabling scalable and fault-tolerant data partitioning and grouping.\u003c/p\u003e\n"],["\u003cp\u003eBenefits of Dataflow shuffle include faster batch pipeline execution, reduced worker VM resource consumption, improved horizontal autoscaling, and better fault tolerance.\u003c/p\u003e\n"],["\u003cp\u003eDataflow shuffle is not available for streaming jobs and requires worker VMs to be deployed in the same region as the Dataflow job, avoiding the specification of a \u003ccode\u003ezone\u003c/code\u003e pipeline option.\u003c/p\u003e\n"],["\u003cp\u003eDataflow shuffle charges apply, and the default 25 GB boot disk size may need to be increased for jobs with heavy disk I/O to ensure optimal performance.\u003c/p\u003e\n"]]],[],null,["# Dataflow shuffle for batch jobs\n\nBatch jobs use Dataflow shuffle by default.\nDataflow shuffle\nmoves the shuffle operation out of the worker VMs and into the\nDataflow service backend.\n\nThe information on this page applies to batch jobs. Streaming jobs use a\ndifferent shuffle mechanism, called\n[streaming shuffle](/dataflow/docs/concepts/exactly-once#streaming-shuffle).\n\nAbout Dataflow shuffle\n----------------------\n\n- Dataflow shuffle is the base operation behind Dataflow transforms such as `GroupByKey`, `CoGroupByKey`, and `Combine`.\n- The Dataflow shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant manner.\n\nBenefits of Dataflow shuffle\n----------------------------\n\nThe service-based Dataflow shuffle has the following benefits:\n\n- Faster execution time of batch pipelines for the majority of pipeline job types.\n- A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs.\n- Better [Horizontal Autoscaling](/dataflow/docs/horizontal-autoscaling), because VMs don't hold any shuffle data and can therefore be scaled down earlier.\n- Better fault tolerance, because an unhealthy VM holding Dataflow shuffle data doesn't cause the entire job to fail.\n\nSupport and limitations\n-----------------------\n\n- This feature is available in all regions where Dataflow is supported. To see available locations, read [Dataflow locations](/dataflow/docs/resources/locations). There might be performance differences between regions.\n- Workers must be deployed in the same region as the Dataflow job.\n- Don't specify the `zone` pipeline option. Instead, specify the `region`, and\n set the value to one of the available regions. Dataflow\n automatically selects the zone in the region you specified.\n\n If you specify the `zone`\n pipeline option and set it to a zone outside of the available regions, the\n Dataflow job returns an error. If you set an incompatible combination\n of `region` and `zone`, your job can't use Dataflow shuffle.\n- For Python, Dataflow shuffle requires Apache Beam SDK\n for Python version 2.1.0 or later.\n\nDisk size considerations\n------------------------\n\nThe default boot disk size for each batch job is 25 GB. For some batch jobs,\nyou might be required to modify the size of the disk. Consider the following:\n\n- A worker VM uses part of the 25 GB of disk space for the operating system, binaries, logs, and containers. Jobs that use a significant amount of disk and exceed the remaining disk capacity may fail when you use Dataflow shuffle.\n- Jobs that use a lot of disk I/O may be slow due to the performance of the small disk. For more information about performance differences between disk sizes, see [Compute Engine Persistent Disk Performance](/compute/docs/disks/performance).\n\nTo specify a larger disk size for a Dataflow shuffle job, you can\nuse the [`--disk_size_gb`](/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options)\nparameter.\n\nPricing\n-------\n\nMost of the reduction in worker resources comes from offloading the shuffle work\nto the Dataflow service. For that reason, there is a\n[charge](/dataflow/pricing) associated with the use of Dataflow\nshuffle. The execution times might vary from run to run. If you are running\na pipeline that has important deadlines, we recommend allocating sufficient\nbuffer time before the deadline."]]