Run a pipeline by using the job builder

This quickstart shows you how to run a Dataflow job by using the Dataflow job builder. The job builder is a visual UI for building and running Dataflow pipelines in the Google Cloud console, without writing any code.

In this quickstart, you load an example pipeline into the job builder, run a job, and verify that the job created output.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, and Resource Manager APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, and Resource Manager APIs.

    Enable the APIs

  8. Create a Cloud Storage bucket:
    1. In the Google Cloud console, go to the Cloud Storage Buckets page.

      Go to Buckets page

    2. Click Create bucket.
    3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
      • For Name your bucket, enter a unique bucket name. Don't include sensitive information in the bucket name, because the bucket namespace is global and publicly visible.
      • For Choose where to store your data, do the following:
        • Select a Location type option.
        • Select a Location option.
      • For Choose a default storage class for your data, select the following: Standard.
      • For Choose how to control access to objects, select an Access control option.
      • For Advanced settings (optional), specify an encryption method, a retention policy, or bucket labels.
    4. Click Create.
  9. To complete the steps in this quickstart, your user account must have the Dataflow Admin role and the Service Account User role. The Compute Engine default service account must have the Dataflow Worker role. To add the required roles in the Google Cloud console:

    1. Go to the IAM page.
      Go to IAM
    2. Select your project.
    3. In the row containing your user account, click Edit principal
    4. Click Add another role, and in the drop-down list, select Dataflow Admin.
    5. Click Add another role, and in the drop-down list, select Service Account User.
    6. Click Save.
    7. In the row containing the Compute Engine default service account, click Edit principal.
    8. Click Add another role, and in the drop-down list, select Dataflow Worker.
    9. Click Add another role, and in the drop-down list, select Storage Object Admin.
    10. Click Save.

      For more information about granting roles, see Grant an IAM role by using the console.

  10. By default, each new project starts with a default network. If the default network for your project is disabled or was deleted, you need to have a network in your project for which your user account has the Compute Network User role (roles/compute.networkUser).

Load the example pipeline

In this step, you load an example pipeline that counts the words in Shakespeare's King Lear.

  1. Go to the Jobs page in the Google Cloud console.

    Go to Jobs

  2. Click Create job from template.

  3. Click Job builder.

  4. Click Load.

  5. Click Word Count. The job builder is populated with a graphical representation of the pipeline.

For each pipeline step, the job builder displays a card that specifies the configuration parameters for that step. For example, the first step reads text files from Cloud Storage. The location of the source data is pre-populated in the Text location box.

A screenshot of the job builder

Set the output location

In this step, you specify a Cloud Storage bucket where the pipeline writes output.

  1. Locate the card titled New sink. You might need to scroll.

  2. In the Text location box, click Browse.

  3. Select the name of the Cloud Storage bucket that you created in Before you begin.

  4. Click View child resources.

  5. In the Filename box, enter words.

  6. Click Select.

Run the job

Click Run job. The job builder creates a Dataflow job and then navigates to the job graph. When the job starts, the job graph shows a graphical representation of the pipeline, similar to the one shown in the job builder. As each step of the pipeline runs, the status is updated in the job graph.

The Job info panel shows the overall status of the job. If the job completes successfully, the Job status field updates to Succeeded.

Examine the job output

When the job completes, perform the following steps to see the output from the pipeline:

  1. In the Google Cloud console, go to the Cloud Storage Buckets page.

    Go to Buckets

  2. In the bucket list, click the name of the bucket that you created in Before you begin.

  3. Click the file named words-00000-of-00001.

  4. In the Object details page, click the authenticated URL to view the pipeline output.

The output should look similar to the following:

brother: 20
deeper: 1
wrinkles: 1
'alack: 1
territory: 1
dismiss'd: 1
[....]

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Delete the project

The easiest way to eliminate billing is to delete the Google Cloud project that you created for the quickstart.

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the individual resources

If you want to keep the Google Cloud project that you used in this quickstart, then delete the Cloud Storage bucket:

  1. In the Google Cloud console, go to the Cloud Storage Buckets page.

    Go to Buckets

  2. Click the checkbox for the bucket that you want to delete.
  3. To delete the bucket, click Delete, and then follow the instructions.

What's next