Overview of best practices in Dataform

This document shows an overview of best practices for managing repository size, repository structure, and code lifecycle in Dataform.

Best practices for repository size

Repository size impacts multiple aspects of development in Dataform, such as:

  • Collaboration
  • Codebase readability
  • Development processes
  • Workflow compilation
  • Workflow execution

Dataform enforces API quotas and limits on compilation resources. Large repository size can cause your repository to exceed these quotas and limits. This can lead to failed compilation and execution of your SQL workflow.

To mitigate that risk, we recommend splitting large repositories. When you split a large repository, you divide a large SQL workflow into a number of smaller SQL workflows housed in different repositories and connected by cross-repository dependencies.

This approach lets you adhere to Dataform quotas and limits, fine-grain processes and permissions, and improve codebase readability and collaboration. However, managing split repositories can be more challenging than managing a single repository.

To learn more about the impact of repository size in Dataform and best practices for splitting repositories, see Splitting repositories.

Best practices for repository structure

We recommend structuring files in the definitions directory to reflect the stages of your workflow. Keep in mind that you can adopt a custom structure that best fits your needs.

The following recommended structure of definitions subdirectories reflects the key stages of most SQL workflows:

  • sources, storing data source declarations
  • intermediate, storing data transformation logic
  • output, storing definitions of output tables
  • Optional: extras, storing additional files

Names of all files in Dataform must conform to BigQuery table naming guidelines. We recommend that the names of files in the definitions directory in a Dataform repository reflect the subdirectory structure.

To learn more about best practices for structuring and naming files in a repository, see Structuring code in a repository.

Best practices for code lifecycle

The default code lifecycle in Dataform consists of the following phases:

To manage code lifecycle in Dataform, you can create execution environments, for example, development, staging, and production.

To learn more about code lifecycle in Dataform, see Introduction to code lifecycle in Dataform.

You can select to keep your execution environments in a single repository, or in multiple repositories.

Execution environments in a single repository

You can create isolated execution environments such as development, staging, and production in a single Dataform repository with workspace compilation overrides and release configurations.

You can create isolated execution environments the following ways:

  • Split development and production tables by schema
  • Split development and production tables by schema and Google Cloud project
  • Split development, staging, and production tables per Google Cloud project

Then, you can schedule executions in staging and production environments with workflow configurations. We recommend triggering executions manually in the development environment.

To learn more about best practices for managing code lifecycle in Dataform, see Managing code lifecycle.

Code lifecycle in multiple repositories

To tailor Identity and Access Management permissions to each stage of the code lifecycle, you can create multiple copies of a repository and store them in different Google Cloud projects.

Each Google Cloud project serves as an execution environment that corresponds to a stage of your code lifecycle, for example, development and production.

In this approach, we recommend keeping the codebase of the repository the same in all Google Cloud project. To customize compilation and execution in each copy of the repository, use workspace compilation overrides, release configurations, and workflow configurations.

What's next