Configure Dataflow worker VMs

This document describes how to configure the worker VMs for a Dataflow job.

By default, Dataflow selects the machine type for the worker VMs that run your job, along with the size and type of Persistent Disk. To configure the worker VMs, set the following pipeline options when you create the job.

Machine type

The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use x86 or Arm machine types, including custom machine types.

Java

Set the workerMachineType pipeline option.

Python

Set the machine_type pipeline option.

Go

Set the worker_machine_type pipeline option.

Custom machine types

To specify a custom machine type, use the following format: FAMILY-vCPU-MEMORY. Replace the following:

  • FAMILY. Use one of the following values:
    Machine seriesValue
    N1custom
    N2n2-custom
    N2Dn2d-custom
    N4n4-custom
    E2e2-custom
  • vCPU. The number of vCPUs.
  • MEMORY. The memory, in MB.

To enable extended memory, append -ext to the machine type. Examples: n2-custom-6-3072, n2-custom-2-32768-ext.

For more information about valid custom machine types, see Custom machine types in the Compute Engine documentation.

Disk type

The type of Persistent Disk to use.

Don't specify a Persistent Disk when using Streaming Engine.

Java

Set the workerDiskType pipeline option.

Python

Set the worker_disk_type pipeline option.

Go

Set the disk_type pipeline option.

To specify the disk type, use the following format: compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/DISK_TYPE.

Replace the following:

  • PROJECT_ID: your project ID
  • ZONE: the zone for the Persistent Disk, for example us-central1-b
  • DISK_TYPE: the disk type, either pd-ssd or pd-standard

For more information, see the Compute Engine API reference page for diskTypes.

Disk size

The Persistent Disk size.

Java

Set the diskSizeGb pipeline option.

Python

Set the disk_size_gb pipeline option.

Go

Set the disk_size_gb pipeline option.

If you set this option, specify at least 30 GB to account for the worker boot image and local logs.

Lowering the disk size reduces available shuffle I/O. Shuffle-bound jobs not using Dataflow Shuffle or Streaming Engine may result in increased runtime and job cost.

Batch jobs

For batch jobs using Dataflow Shuffle, this option sets the size of a worker VM boot disk. For batch jobs not using Dataflow Shuffle, this option sets the size of the disks used to store shuffled data; the boot disk size is not affected.

If a batch job uses Dataflow Shuffle, then the default disk size is 25 GB. Otherwise, the default is 250 GB.

Streaming jobs

For streaming jobs using Streaming Engine, this option sets size of the boot disks. For streaming jobs not using Streaming Engine, this option sets the size of each additional Persistent Disk created by the Dataflow service; the boot disk is not affected.

If a streaming job does not use Streaming Engine, you can set the boot disk size with the experiment flag streaming_boot_disk_size_gb. For example, specify --experiments=streaming_boot_disk_size_gb=80 to create boot disks of 80 GB.

If a streaming job uses Streaming Engine, then the default disk size is 30 GB. Otherwise, the default is 400 GB.

What's next