Orchestrate data preparations

This document describes how to orchestrate data preparation pipelines, including how to perform manual and scheduled runs.

Data preparations are powered by Dataform.

Data preparations run using custom Dataform service accounts, which you select when you configure schedules or test runs. For more information, see About service accounts in Dataform.

Changes you make to the data preparation steps aren't automatically saved. You must save and deploy the changes before they can be executed with a schedule. Schedules always run the latest deployed version of your data preparation and exclude any undeployed changes you might be developing.

Before you begin

Before you begin, create a data preparation.

Required roles

To run data preparations, you must grant the roles to the service account which you plan to use for executing the data preparation runs. For more information, see the required roles.

Develop a data preparation

As you develop a data preparation, you can manually run the steps and inspect the output before you deploy the changes to production. You can test the current version you're developing on your data, while BigQuery continues to run the latest deployed version, according to a schedule. Before you can perform the run, you must configure the destination, and fix any validation errors.

Manually run a data preparation in development

To test your data preparation steps and validate the results in your destination table, run the data preparation manually from the data preparation editor:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project and the Data preparations folder. Click the name of the data preparation that you want to run.

  3. Configure the permissions on the service account for the run:

    1. In the data preparation editor toolbar, hold the pointer over the disabled Run option.
    2. On the dialog that appears with information about configuring the service account, click Configure.
    3. In the Service account settings dialog, select a service account.
    4. If the service account needs additional permissions, grant it the required roles by clicking Grant all.
    5. Click Save.
  4. Optional: To update the service account for future runs, go to the data preparation editor toolbar and click More > Configure run now experience, and then update and save the service account settings.

  5. Fix any validation errors that appear.

  6. From the data preparation editor toolbar, click Run.

  7. In the Run now dialog, click Confirm to acknowledge that this manual run writes data to a destination table, which you might also be using for scheduled runs.

    The run then executes your steps and loads the output to the destination.

  8. Optional: After the run is complete, you can view the details about the execution in the Executions pane.

Deploy a data preparation

To schedule runs for a version of your data preparation, you must first deploy it. Schedules run the most recently deployed version.

To deploy a data preparation, follow these steps:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project and the Data preparations folder. Click the name of the chosen data preparation.

    The data preparation editor opens.

  3. In the data preparation editor toolbar, click Deploy.

Create a schedule

To create a schedule that executes the deployed data preparation steps and loads the prepared data into the destination table, schedule a data preparation run. To schedule the run, you must configure the destination, and fix any validation errors.

To create a schedule, follow these steps:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project and the Data preparations folder. Click the name of the data preparation that you want to schedule.

  3. From the data preparation editor toolbar, click Schedule.

  4. Enter a schedule name.

  5. Enter the service account name associated with the execution.

  6. Schedule a frequency.

  7. Click Create schedule.

Manually run a scheduled data preparation

When you manually run a data preparation in a selected schedule, BigQuery executes the data preparation once, independently from the schedule.

To manually run a scheduled data preparation, follow these steps:

  1. In the Google Cloud console, go to the Scheduling page.

    Go to Scheduling

  2. Click the name of the selected data preparation schedule.

  3. On the Schedule details page, click Run.

View schedules

You can view data preparation schedules from the data preparation editor or the Scheduling page.

Data preparation editor

To view the schedule for a data preparation, follow these steps:

  1. In the data preparation editor toolbar, click schedule View schedule.
  2. Optional: To view the schedule history, click View past executions.

Scheduling page

To view all data preparation schedules in your project, follow these steps:

  1. In the Google Cloud console, go to the Scheduling page.

    Go to Scheduling

  2. Optional: To view the run history and details of a selected schedule, click the name of the schedule. History of manual runs is not shown.

Edit a schedule

You can edit a schedule from the data preparation editor or the Scheduling page.

Data preparation editor

To edit a schedule, follow these steps:

  1. In the data preparation editor toolbar, click schedule View schedule.
  2. In the Schedule data preparation dialog, click Edit and then update the schedule.
  3. Click Update schedule.

Scheduling page

To edit a schedule, follow these steps:

  1. In the Google Cloud console, go to the Scheduling page.

    Go to Scheduling

  2. Click the name of the selected data preparation schedule.

  3. On the Schedule details page, click Edit.

  4. Click View schedule.

  5. In the Schedule data preparation dialog, click Edit and then update the schedule.

  6. Click Update schedule.

Delete a schedule

To permanently delete a schedule for a selected data preparation, follow these steps:

  1. In the Google Cloud console, go to the Scheduling page.

    Go to Scheduling

  2. In the row that contains the schedule, click more_vert Actions > Delete.

What's next