Overview of repository size

This document helps you understand how repository size impacts SQL workflow development and Dataform compilation resources usage, and how to estimate compilation resources usage of your repository.

About repository size in Dataform

The size of a repository impacts the following aspects of development in Dataform:

Collaboration
Multiple collaborators working on a large repository can create an excessive number of pull requests, increasing the risk of merge conflicts.
Codebase readability
A larger number of files that make up a SQL workflow in a single repository can make it difficult to navigate through the repository.
Development processes
Some areas of a large SQL workflow in a single repository might require custom permissions or processes, such as scheduling, different from the permissions and processes applied to the rest of the SQL workflow. Large repository size makes it difficult to tailor development processes to specific areas of the SQL workflow.
Workflow compilation
Dataform enforces usage limits on compilation resources. Large repository size can lead to exceeding these limits, causing compilation to fail.
Workflow execution
During execution, Dataform executes repository code inside your workspace and deploys assets to BigQuery. The larger the repository, the more time it takes Dataform to execute it.

If the large size of your repository negatively impacts your development in Dataform, you can split the repository into multiple smaller repositories.

About repository compilation resources limits

During development, Dataform compiles all repository code inside your workspace to generate a representation of the SQL workflow in your repository, called a compilation result. Dataform enforces usage limits on compilation resources.

Your repository might exceed the usage limits for the following reasons:

  • An infinite loop bug in the repository code.
  • A memory leak bug in the repository code.
  • Large repository size, approximately more than 1000 SQL workflow actions.

For more information on usage limits on compilation resources, see Dataform compilation resources limits.

Estimate compilation resources usage of your repository

You can estimate the usage of the following compilation resources for your repository:

  • CPU time usage
  • Maximum total serialized data size of the generated graph of actions defined in your repository

To obtain a rough approximation of the current compilation CPU time usage for the compilation of your repository, you can time the compilation of your Dataform SQL workflow on a local Linux or macOS machine.

  • To time the compilation of your SQL workflow, inside your repository, execute the Dataform CLI dataform compile command in the following format:
time dataform compile

The following code sample shows a result of executing the time dataform compile command:

real    0m3.480s
user    0m1.828s
sys     0m0.260s

You can treat the real result as a rough indicator of the CPU time usage for the compilation of your repository.

To obtain a rough approximation of the total size of the generated graph of actions in your repository, you can write the output of the graph to a JSON file. You can treat the size of the uncompressed JSON file as a rough indicator of the total graph size.

  • To write the output of the compiled graph of your SQL workflow to a JSON file, inside your repository, execute the following Dataform CLI command:
dataform compile --json > graph.json

What's next