Create a custom job with the job builder

The job builder lets you create custom batch and streaming Dataflow jobs. You can also save job builder jobs as Apache Beam YAML files to share and reuse.

Create a new pipeline

To create a new pipeline in the job builder, follow these steps:

Go to the Jobs page in the Google Cloud console.

Go to Jobs
Click Create job from builder.
For Job name, enter a name for the job.
Select either Batch or Streaming.
If you select Streaming, select a windowing mode. Then enter a specification for the window, as follows:
- Fixed window: Enter a window size, in seconds.
- Sliding window: Enter a window size and window period, in seconds.
- Session window: Enter a session gap, in seconds.
For more information about windowing, see Windows and windowing functions.

Next, add sources, transforms, and sinks to the pipeline, as described in the following sections.

Add a source to the pipeline

A pipeline must have at least one source. Initially, the job builder is populated with an empty source. To configure the source, perform the following steps:

In the Source name box, enter a name for the source or use the default name. The name appears in the job graph when you run the job.
In the Source type list, select the type of data source.
Depending on the source type, provide additional configuration information. For example, if you select BigQuery, specify the table to read from.

If you select Pub/Sub, specify a message schema. Enter the name and data type of each field that you want to read from Pub/Sub messages. The pipeline drops any fields that aren't specified in the schema.
Optional: For some source types, you can click Preview source data to preview the source data.

To add another source to the pipeline, click Add a source. To combine data from multiple sources, add a SQL or Join transform to your pipeline.

Add a transform to the pipeline

Optionally, add one or more transforms to the pipeline. You can use the following transforms to manipulate, aggregate, or join data from sources and other transforms:

Transform type	Description	Beam YAML Transform information
Filter (Python)	Filter records with a Python expression.	Filtering (Python) Filtering
SQL Transform	Manipulate records or join multiple inputs with a SQL statement.	Named transforms Sql
Map Fields (Python)	Add new fields or re-map entire records with Python expressions and functions.	Mapping functions MapToFields
Map Fields (SQL)	Add or map record fields with SQL expressions.	Mapping functions MapToFields
YAML Transforms: AssertEqual AssignTimestamps Combine Explode Filter Flatten Join LogForTesting MLTransform MapToFields PyTransform WindowInfo	Use any transform from the Beam YAML SDK. YAML transform configuration: Provide the configuration parameters for the YAML transform as a YAML map. The key-value pairs are used to populate the config section of the resulting Beam YAML transform. For the supported configuration parameters for each transform type, see the Beam YAML transform documentation. Sample configuration parameters: Combine group_by: combine: Join type: equalities: fields:	Beam YAML Transform Index
Log	Log records to the job's worker logs.	LogForTesting
Group by	Combine records with functions like `count()` and `sum()`.	Group By
Join	Join multiple inputs on equal fields.	Beam YAML Join Join
Explode	Split records by flattening array fields.	Flatmap Explode

To add a transform:

Click Add a transform.
In the Transform name box, enter a name for the transform or use the default name. The name appears in the job graph when you run the job.
In the Transform type list, select the type of transform.
Depending on the transform type, provide additional configuration information. For example, if you select Filter (Python), enter a Python expression to use as the filter.
Select the input step for the transform. The input step is the source or transform whose output provides the input for this transform.

Note: The SQL and Join transform can have multiple input steps.

Add a sink to the pipeline

A pipeline must have at least one sink. Initially, the job builder is populated with an empty sink. To configure the sink, perform the following steps:

In the Sink name box, enter a name for the sink or use the default name. The name appears in the job graph when you run the job.
In the Sink type list, select the type of sink.
Depending on the sink type, provide additional configuration information. For example, if you select the BigQuery sink, select the BigQuery table to write to.
Select the input step for the sink. The input step is the source or transform whose output provides the input for this transform.
To add another sink to the pipeline, click Add a sink.

Run the pipeline

To run a pipeline from the job builder, perform the following steps:

Optional: Set Dataflow job options. To expand the Dataflow options section, click the expander arrow.
Click Run job. The job builder navigates to the job graph for the submitted job. You can use the job graph to monitor the status of the job.

Validate the pipeline before launching

For pipelines with complex configuration, such as Python filters and SQL expressions, it can be helpful to check the pipeline configuration for syntax errors before launching. To validate the pipeline syntax, perform the following steps:

Click Validate to open Cloud Shell and start the validation service.
Click Start Validating.
If an error is found during validation, a red exclamation mark appears.
Fix any detected errors and verify the fixes by clicking Validate. If no error is found, a green checkmark appears.

Run with the gcloud CLI

You can also run Beam YAML pipelines by using the gcloud CLI. To run a job builder pipeline with the gcloud CLI:

Click Save YAML to open the Save YAML window.
Perform one of the following actions:
- To save to Cloud Storage, enter a Cloud Storage path and click Save.
- To download a local file, click Download.
Run the following command in your shell or terminal:
```
  gcloud dataflow yaml run my-job-builder-job --yaml-pipeline-file=YAML_FILE_PATH
```
Replace YAML_FILE_PATH with the path of your YAML file, either locally or in Cloud Storage.

What's next

Use the Dataflow job monitoring interface.
Save and load YAML job definitions in the job builder.
Learn more about Beam YAML.