Overview of Dataform features

Dataform is a serverless service for data analysts to develop and deploy tables, incremental tables, or views to BigQuery. Dataform offers a web environment for SQL workflow development, connection with GitHub, GitLab, Azure DevOps Services, and Bitbucket, continuous integration, continuous deployment, and workflow execution.

Repositories

Each Dataform project is stored in a repository. A Dataform repository houses a collection of JSON configuration files, SQLX files, and JavaScript files.

Dataform repositories contain the following types of files:

  • Config files

    Config JSON or SQLX files let you configure your SQL workflows. They contain general configuration, execution schedules, or schema for creating new tables and views.

  • Definitions

    Definitions are SQLX and JavaScript files that define new tables, views, and additional SQL operations to run in BigQuery.

  • Includes

    Includes are JavaScript files where you can define variables and functions to use in your project.

Each Dataform repository is connected to a service account. You can select a service account when you create a repository or edit the service account later.

By default, Dataform uses a service account derived from your project number in the following format:

service-YOUR_PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com

Version control

Dataform uses the Git version control system to maintain a record of each change made to project files and to manage file versions.

Each Dataform repository can manage its own Git repository, or be connected to a remote third-party Git repository. You can connect a Dataform repository to a GitHub, GitLab, Azure DevOps Services, or Bitbucket repository.

Users version control their SQL workflow code inside Dataform workspaces. In a Dataform workspace, you can pull changes from the repository, commit all or selected changes, and push them to Git branches of the repository.

Workflow development

In Dataform, you make changes to files and directories inside a development workspace. A development workspace is a virtual, editable copy of the contents of a Git repository. Dataform preserves the state of files in your development workspace between sessions.

In a development workspace, you can develop SQL workflow actions by using Dataform core with SQLX and JavaScript, or exclusively with JavaScript. You can automatically format your Dataform core or JavaScript code.

Each element of a Dataform SQL workflow, such as a table or assertion, corresponds to an action that Dataform performs in BigQuery. For example, a table definition file is an action of creating or updating the table in BigQuery.

In a Dataform workspace, you can develop the following SQL workflow actions:

You can use JavaScript to reuse your Dataform SQL workflow code in the following ways:

Dataform compiles the SQL workflow code in your workspace in real-time. In your workspace, you can view the compiled queries and details of actions in each file. You can also view the compilation status and errors in the edited file or in the repository.

To test the output of a compiled SQL query before you execute it to BigQuery, you can run preview of the query in your Dataform workspace.

To inspect the entire SQL workflow defined in your workspace, you can view an interactive compiled graph that shows all compiled actions in your SQL workflow and relationships between them.

Workflow compilation

Dataform uses default compilation settings, configured in dataform.json, to compile the SQL workflow code in your workspace to SQL in real-time, creating a compilation result of the workspace.

You can override compilation settings to customize how Dataform compiles your SQL workflow into a compilation result.

With workspace compilation overrides, you can configure compilation overrides for all workspaces in a repository. You can set dynamic workspace overrides to create compilation results custom for each workspace, turning workspaces into isolated development environments. You can override the Google Cloud project in which Dataform will execute the contents of a workspace, add a prefix to names of all compiled tables, and add a suffix to the default schema.

With release configurations, you can configure templates of compilation settings for creating compilation results of a Dataform repository. In a release configuration, you can override the Google Cloud project in which Dataform will execute compilation results, add a prefix to names of all compiled tables, add a suffix the default schema, and add compilation variables. You can also set the frequency of creating compilation results. To schedule executions of compilation results created in a selected release configuration, you can create a workflow configuration.

Workflow execution

During workflow execution, Dataform executes compilation results of SQL workflows to create or update assets in BigQuery.

To create or refresh the tables and views defined in your SQL workflow in BigQuery, you can start a workflow execution manually in a development workspace or schedule executions.

You can schedule Dataform executions in BigQuery in the following ways:

To debug errors, you can monitor executions in the following ways:

What's next