Dataform overview

This document introduces you to Dataform concepts and processes.

Dataform is a service for data analysts to develop, test, version control, and schedule complex SQL workflows for data transformation in BigQuery.

Dataform lets you manage data transformation in the Extraction, Loading, and Transformation (ELT) process for data integration. After raw data is extracted from source systems and loaded into BigQuery, Dataform helps you to transform it into a well-defined, tested, and documented suite of data tables.

Dataform lets you perform the following data transformation actions:

  • Develop and execute SQL workflows for data transformation.
  • Collaborate with team members on SQL workflow development through Git.
  • Manage a large number of tables and their dependencies.
  • Declare source data and manage table dependencies.
  • View a visualization of the dependency tree of your SQL workflow.
  • Manage data with SQL code in a central repository.
  • Reuse code with JavaScript.
  • Test data correctness with quality tests on source and output tables.
  • Version control SQL code.
  • Document data tables inside SQL code.

Data transformation processes in Dataform

The data transformation workflow for Dataform is as follows:

  1. Dataform lets you create repositories to manage your code.
  2. Dataform lets you create workspaces for development.
  3. Dataform lets you develop SQL workflows in a development workspace.
  4. Dataform compiles Dataform core into SQL.
  5. Dataform executes the dependency tree.

Dataform lets you create repositories to manage your code

In a Dataform repository, you use Dataform core, an extension of SQL, to write SQLX files in which you define your workflow. Dataform repositories support version-control. You can link a Dataform repository to a third-party Git provider.

Dataform lets you create workspaces for development

You can create development workspaces inside a Dataform repository for Dataform core development. In a development workspace, you can make changes to the repository, compile, test, and push them to the main repository through Git.

Dataform lets you develop Dataform core in a development workspace

In a development workspace, you can define and document tables, their dependencies, and transformation logic to build your SQL workflow. You can also configure actions in JavaScript.

Dataform compiles Dataform core

During compilation, Dataform performs the following tasks:

  • Compiles Dataform core into a SQL workflow of Standard SQL.
  • Adds boilerplate SQL statements, such as CREATE TABLE or INSERT, to the code inline with your query configuration.
  • Transpiles (compiles source-to-source) JavaScript into SQL.
  • Resolves dependencies and checks for errors including missing or circular dependencies.
  • Builds the dependency tree of all actions to be run in BigQuery.

Dataform compilation is hermetic to ensure compilation consistency, meaning that the same code compiles to the same SQL compilation result every time. Dataform compiles your code in a sandbox environment with no internet access. No additional actions, such as calling external APIs, are available during compilation.

To debug in real-time, you can inspect the compiled SQL workflow of your project in an interactive graph in your development workspace.

Dataform executes the dependency tree

In BigQuery, Dataform performs the following tasks:

  • Executes SQL commands, following the order of the dependency tree.
  • Executes assertion queries against your tables and views to check data correctness.
  • Executes other SQL operations that you defined.

After the execution, you can use your tables and views for all your analytics purposes.

You can view logs to see what tables were created, if assertions passed or failed, how long each action took to complete, and other information. You can also view the exact SQL code that was run in BigQuery.

Dataform modeling framework

Dataform provides an open source data modeling framework, consisting of Dataform core and Dataform CLI, that you can use outside Google Cloud.

Limitations

Dataform has the following known limitations:

  • Dataform in Google Cloud runs on a plain V8 runtime and does not support additional capabilities and modules provided by Node.js. If your existing codebase requires any Node.js modules, you need to remove these dependencies.

    Projects without a name field in package.json generate diffs on package-lock.json every time packages are installed. To avoid this outcome, you need to add a name property in package.json.

  • git+https:// URLs for dependencies in package.json are not supported.

    Convert such URLs to plain https:// archive URLs. For example, convert git+https://github.com/dataform-co/dataform-segment.git#1.5 to https://github.com/dataform-co/dataform-segment/archive/1.5.tar.gz.

  • Manually running unit tests is not available.

  • Searching for file content in development workspaces is not available.

  • As of Dataform core 3.0.0., Dataform doesn't distribute a Docker image. You can build your own Docker image of Dataform, which you can use to run the equivalent of Dataform CLI commands. To build your own Docker image, see Containerize an application in the Docker documentation.

  • The following Dataform API methods don't comply with the AIP.134 guidelines by treating the * wildcard entry as a bad request and by updating all fields instead of set fields when field_mask is omitted:

What's next