This document helps you understand the concept of repositories in Dataform and how to create a new repository.
About Dataform repositories
Each Dataform repository houses a collection of SQLX and JavaScript files that make up your workflow, as well as Dataform configuration files and packages. You interact with the contents of your repository in a development workspace.
Dataform displays your repositories on the Dataform page in the alphabetical order of repository IDs. You can sort and filter them.
To view your repositories, in the Google Cloud console, go to the Dataform page.
Each Dataform repository is connected to a service account. You can select a service account when you create a repository, or edit the service account later.
By default, Dataform uses a service account derived from your project number in the following format:
service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com
Dataform uses Git to record changes and manage file versions. Each Dataform repository corresponds with a Git repository. After you create a Dataform repository, you can connect it to a remote GitHub, GitLab, or Bitbucket repository.
In a Dataform repository, Dataform stores the repository code. In a connected repository, the third-party repository stores the repository code. Dataform interacts with the third-party repository to allow you to edit and execute its contents in a Dataform development workspace.
A Dataform repository page consists of the following components:
- Development workspaces tab
- Displays development workspaces created in the repository.
- Release configurations tab
- Lets you inspect, create, edit, and delete releases.
- Workflow execution logs tab
- Displays Dataform workflow execution logs.
- Workflow configurations tab
- Lets you inspect, create, edit, and delete workflow configurations.
- Settings tab
- Displays the name and location of the repository. For a repository connected to a third-party Git repository, displays the third-party repository source, default branch name, and secret token. Displays the buttons to connect the repository to a third-party Git repository and to edit the Git connection.
- Create development workspace button
- Lets you create a development workspace.
After you create and initialize a development workspace, you can edit your workflow settings file to configure the following Dataform settings of your repository:
- The default database (Google Cloud project ID).
- The default schema (BigQuery dataset ID).
- The default BigQuery location.
- The default schema (BigQuery dataset ID) for assertions.
- The warehouse, which must be set to
bigquery
. - User-defined variables that are made available to project code during compilation.
For more information about Dataform repository settings, see IProjectConfig in the Dataform core reference.
Repository settings
When you create a Dataform repository, you need to set the following repository settings:
- Repository ID
- A unique ID of the repository. IDs can only include numbers, letters, hyphens, and underscores.
- Region
Dataform region for storing the repository and its contents.
This storage region can be different than the processing region where Dataform processes your code and stores the output of executions. By default, the processing region is set to your default BigQuery dataset region. You can edit the processing region in the workflow settings file after creating the repository. For more information, see Configure Dataform settings.
- Service account
Service account associated with the repository. You can select the default Dataform service account, a service account associated with your Google Cloud project, or manually enter a different service account. By default, Dataform uses a service account derived from your project number in the following format:
service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com
Dataform uses the default service account for all repository operations. You can use a different service account to execute workflows in your repository, but the default service account is still used for all other repository operations.
- Encryption
Encryption method for the repository. You can use the default encryption, a unique customer-managed Cloud KMS encryption key, or a default Dataform CMEK key. For more information about using customer-managed encryption keys (CMEK) in Dataform, see Use customer-managed encryption keys.
After you create a repository, you can connect it to GitHub or GitLab.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery and Dataform APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery and Dataform APIs.
- To use CMEK encryption for the repository, enable CMEK encryption of Dataform repositories.
Required roles
To get the permissions that you need to create and delete a repository,
ask your administrator to grant you the
Dataform Admin (roles/dataform.admin
) IAM role on repositories.
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
To use a service account other than the default Dataform service account, grant access to the custom service account.
After you create a Dataform repository, Dataform automatically grants you the Dataform Admin role on that repository.
Create a repository
To create a Dataform repository, follow these steps:
In the Google Cloud console, go to the Dataform page.
Click Create repository.
On the Create repository page, in the Repository ID field, enter a unique ID.
IDs can only include numbers, letters, hyphens, and underscores.
In the Region drop-down list, select a Dataform region for storing the repository and its contents. Select the Dataform region nearest to your location.
For a list of available Dataform regions, see Locations. The repository region does not have to match the location of your BigQuery datasets.
In the
workflow_settings.yaml
file, you can set the processing region where Dataform processes your code and stores the output of executions. The processing region has to match the location of your BigQuery datasets, but does not need to match the repository region. For more information, see Configure Dataform settings.In the Service account drop-down, select a service account for the repository.
In the drop-down, you can select the default Dataform service account or any service account associated with your Google Cloud project that you have access to. Keep in mind that custom service accounts are used only for workflow execution. All other repository operations are still performed by the default Dataform service account.
- Optional: To select a service account that is not displayed in the drop-down, click Enter manually and enter a service account ID.
Configure your selected encryption mechanism for the repository:
Default CMEK key
Dataform displays the Use the default KMS key checkbox and selects it by default.
- To encrypt the repository with the default Dataform CMEK key, leave the Use the default KMS key checkbox selected.
Unique CMEK key
To encrypt the repository with a unique CMEK key, do the following:
- If the Use the default KMS key checkbox is selected by default, deselect the checkbox.
- In the Encryption section, select the Customer-managed encryption keys (CMEK) option.
- In the Select a customer-managed key drop-down, select a unique CMEK key.
Encryption at rest
- To use the default encryption, in the Encryption section, select the Google-managed encryption key option.
Click Create, and then click Done.
Edit the service account
You can associate a custom service account with a Dataform repository for workflow execution. All other repository operations are still performed by the default Dataform service account.
To edit the service account for a Dataform repository, follow these steps:
In the Google Cloud console, go to the Dataform page.
Select a repository, and then click Settings.
By the Service account field, click
Edit Service account.In the Service account drop-down, select a service account for the repository.
In the drop-down, you can select the default Dataform service account or any service account associated with your Google Cloud project that you have access to.
- Optional: To select a service account that is not displayed in the drop-down, click Enter manually and enter a service account ID.
Click Save.
Delete a repository
To delete a repository and all its contents, follow these steps:
In the Google Cloud console, go to the Dataform page.
By the repository that you want to delete, click the
More menu, and then select Delete.In the Delete repository window, enter the name of the repository to confirm deletion.
Click Delete.
What's next
- To learn how to connect a Dataform repository to a third-party Git repository, see Connect to a third-party Git repository.
- To learn more about how repository size affects development in Dataform, see Overview of repository size.
- To learn more about splitting a repository in Dataform, see Introduction to splitting repositories.
- To learn how to configure Dataform processing settings, see Configure Dataform settings.
- To learn how to create and initialize a workspace, see Create a workspace.