Best practices for working with Dataflow GPUs

This page describes best practices for building pipelines by using GPUs.

For information and examples about how to enable GPUs in your Dataflow jobs, see Run a pipeline with GPUs and Processing Landsat satellite images with GPUs.

Prerequisites for using GPUs in Dataflow


When designing both your test and production environments, consider the following factors.

Local development

Using Apache Beam with NVIDIA GPUs lets you create large-scale data processing pipelines that handle preprocessing and inference. When you're using GPUs for local development, consider the following information:

  • Often, data processing workflows use additional libraries that you need to install in the launch environment and in the execution environment on Dataflow workers. This configuration adds steps to the development workflow for configuring pipeline requirements or for using custom containers in Dataflow. It's beneficial to have a local development environment that mimics the production environment as closely as possible.

  • If your workflow meets both of the following criteria, you don't need to build custom containers or change your development workflow to configure pipeline requirements:

    • You're using a library that implicitly uses NVIDIA GPUs.
    • Your code doesn't require any changes to support GPU.
  • Some libraries don't switch transparently between the CPU and GPU usage, and hence require specific builds and different code paths. To replicate the code-run-code development lifecycle for this scenario, additional steps are required.

  • When running local experiments, replicate the environment of the Dataflow worker as closely as possible. Depending on the library, you might need a machine with a GPU and the required GPU libraries installed. This type of machine might not be available in your local environment. You can emulate the Dataflow runner environment using a container running on a GPU-equipped Google Cloud virtual machine.

Machine types specifications

For details about machine type support for each GPU model, see GPU platforms. GPUs that are supported with N1 machine types also support the custom N1 machine types.

The type and number of GPUs define the upper bound restrictions on the available amounts of vCPU and memory that workers can have. To find the corresponding restrictions, see Availability.

Specifying a higher number of CPUs or memory might require that you specify a higher number of GPUs.

For more details, read GPUs on Compute Engine.

Optimize resource usage

Most pipelines are not composed entirely of transformations that require a GPU. A typical pipeline has an ingestion stage that uses one of the many sources provided by Apache Beam. That stage is followed by data manipulation or shaping transforms, which then feed into a GPU transform.

Right fitting uses Apache Beam resource hints to customize worker resources for your batch pipelines. When right fitting is enabled, Dataflow only uses GPUs for the stages of the pipeline that need them. Consequently, this feature improves pipeline flexibility and capability while potentially reducing costs.

For more information, see Right fitting.

GPUs and worker parallelism

For Python pipelines that use the Dataflow Runner v2 architecture, Dataflow launches one Apache Beam SDK process per VM core. Each SDK process runs in its own Docker container and in turn spawns many threads, each of which processes incoming data.

GPUs use multiple process architecture, and GPUs in Dataflow workers are visible to all processes and threads.

If you're running multiple SDK processes on a shared GPU, you can improve GPU efficiency and utilization by enabling the NVIDIA Multi-Process Service (MPS). MPS improves worker parallelism and overall throughput for GPU pipelines, especially for workloads with low GPU resource usage. For more information, see Improve performance on a shared GPU by using NVIDIA MPS.

To avoid GPU memory oversubscription, you might need to manage GPU access. If you're using TensorFlow, either of the following suggestions can help you avoid GPU memory oversubscription:

  • Configure the Dataflow workers to start only one containerized Python process, regardless of the worker vCPU count. To make this configuration, when launching your job, use the following pipeline options:

    • --experiments=no_use_multiple_sdk_containers
    • --number_of_worker_harness_threads

    For more information about how many threads to use, see Reduce the number of threads.

  • Enable MPS.

Inference workloads

When you're using machine learning (ML) models to do local and remote inference, use the built-in Apache Beam RunInference transform. The RunInference API is a PTransform optimized for machine learning inferences. Using the RunInference transform can improve efficiency when you use ML models in your pipelines.


The following two-stage workflow shows how to build a pipeline using GPUs. This flow takes care of GPU and non-GPU related issues separately and shortens the feedback loop.

  1. Create a pipeline

    Create a pipeline that can run on Dataflow. Replace the transforms that require GPUs with the transforms that don't use GPUs, but are functionally the same:

    1. Create all transformations that surround the GPU usage, such as data ingestion and manipulation.

    2. Create a stub for the GPU transform with a pass-through or schema change.

  2. Test locally

    Test the GPU portion of the pipeline code in the environment that mimics the Dataflow worker execution environment. The following steps describe one of the methods to run this test:

    1. Create a Docker image with all necessary libraries.

    2. Start development of the GPU code.

    3. Begin the code-run-code cycle using a Google Cloud virtual machine with the Docker image. To rule out library incompatibilities, run the GPU code in a local Python process separately from an Apache Beam pipeline. Then, run the entire pipeline on the direct runner, or launch the pipeline on Dataflow.

Use a VM running container-optimized operating system

For a minimum environment, use a container-optimized virtual machine (VM). For more information, see Create a VM with attached GPUs.

The general flow is:

  1. Create a VM.

  2. Connect to the VM and run the following commands:

    sudo cos-extensions install gpu -- -version latest
    sudo mount --bind /var/lib/nvidia /var/lib/nvidia
    sudo mount -o remount,exec /var/lib/nvidia
  3. Confirm that GPUs are available:

  4. Start a Docker container with GPU drivers from the VM mounted as volumes. For example:

    sudo docker run --rm -it --entrypoint /bin/bash
    --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64
    --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin

For a sample Dockerfile, see Build a custom container image. Add all the dependencies that you need for your pipeline to the Dockerfile.

For more information about using a Docker image that is pre-configured for GPU usage, see Use an existing image configured for GPU usage.

Tools for working with container-optimized systems

  • To configure Docker CLI to use docker-credential-gcr as a credential helper for the default set of Google Container Registries (GCR), use:

    sudo docker-credential-gcr configure-docker

    For more information about setting up Docker credentials, see docker-credential-gcr.

  • To copy files, such as pipeline code, to or from a VM, use toolbox. This technique is useful when using a Custom-Optimized image. For example:

    toolbox /google-cloud-sdk/bin/gsutil cp gs://bucket/gpu/image_process/* /media/root/home/<userid>/opencv/

    For more information, see Debugging node issues using toolbox.

What's next