Build Dataflow pipelines

This page describes the different ways to build pipelines for Dataflow, and the advantages of each method.

Google-provided templates

Google provides open source templates for dozens of prebuilt Dataflow pipelines. You can run these templates from the Google Cloud console or from the command line. Some templates support adding a user-defined function (UDF), so that you can transform the data before writing it to the output destination.

Consider using a template if there is one that matches your scenario. For a complete list, see Google-provided templates.

Job builder

The job builder is a visual UI for building and running Dataflow pipelines in the Google Cloud console, without writing any code. In the job builder, your create a pipeline by selecting sources, sinks, and transforms, and connecting them to form a graph. The job builder also lets you save and load your pipelines as YAML files.

Consider the job builder for the following scenarios:

  • Create custom pipelines when a Google-provided template doesn't match your scenario.
  • Build pipelines without coding.
  • Create pipelines with multiple sources or sinks.
  • Create quick prototypes.

For more information, see Job builder UI overview.

The job builder supports a subset of the sources and sinks that are available in Apache Beam. If you need one that isn't supported in the job builder, try a Google-provided template or use the Apache Beam SDK.

Apache Beam SDK

Dataflow pipelines are built on the open-source Apache Beam SDK. By using the SDK to write your pipeline, you get the full power of Apache Beam for your workloads. Pipelines can be written in Java, Python, or Go.

Consider using the Apache Beam SDK if you can't achieve your scenario with a Google-provided template or by using the job builder. For example:

  • More complex pipelines that need the full Apache Beam feature set.
  • Streaming pipelines that require more sophisticated strategies for handling late data, such as reprocessing.

For more information, see Use Apache Beam to build pipelines.

Notebooks

You can run Apache Beam Python code in a JupyterLab notebook. These notebooks are made available through Vertex AI Workbench, a service that hosts notebook VMs pre-installed with the latest data science and machine learning frameworks. With notebooks, you don't need to set up a development environment, and you can quickly iterate on your pipeline code. Notebooks run in a test environment, but you can export the code for production use.

For more information, see Develop Apache Beam notebooks.