An enterprise data management and analytics platform provides an enclave where you can store, analyze, and manipulate sensitive information while maintaining security controls. You can use the enterprise data mesh architecture to deploy a platform on Google Cloud for data management and analytics. The architecture is designed to work in a hybrid environment, where Google Cloud components interact with your existing on-premises components and operating processes.
The enterprise data mesh architecture includes the following:
- A GitHub
repository
that contains a set of Terraform configurations, scripts, and code to build
the following:
- A governance project that lets you use Google's implementation of the Cloud Data Management Capabilities (CDMS) Key Controls Framework.
- A data platform example that supports interactive and production workflows.
- A producer environment within the data platform that supports multiple data domains. Data domains are logical groupings of data elements.
- A consumer environment within the data platform that supports multiple consumer projects.
- A data transfer service that uses Workload Identity Federation and the Tink encryption library to help you transfer data into Google Cloud in a secure manner.
- A data domain example that contains ingestion, non-confidential, and confidential projects.
- An example of a data access system that lets data consumers request access to data sets and data owners grant access to those data sets. The example also includes a workflow manager that changes the IAM permissions of those data sets accordingly.
- A guide to the architecture, design, security controls, and operational processes that you use this architecture to implement (this document).
The enterprise data mesh architecture is designed to be compatible with the enterprise foundations blueprint. The enterprise foundations blueprint provides a number of base-level services that this architecture relies on, such as VPC networks and logging. You can deploy this architecture without deploying the enterprise foundations blueprint if your Google Cloud environment provides the necessary functionality.
This document is intended for cloud architects, data scientists, data engineers, and security architects who can use the architecture to build and deploy comprehensive data services on Google Cloud. This document assumes that you are familiar with the concepts of data meshes, Google Cloud data services, and the Google Cloud implementation of the CDMC framework.
Architecture
The enterprise data mesh architecture takes a layered approach to provide the capabilities that enable data ingestion, data processing, and governance. The architecture is intended to be deployed and controlled through a CI/CD workflow. The following diagram shows how the data layer that is deployed by this architecture relates to other layers in your environment.
This diagram includes the following:
- Google Cloud infrastructure provides security capabilities such as encryption at rest and encryption in transit, as well as basic building blocks such as compute and storage.
- The enterprise foundation provides a baseline of resources such as identity, networking, logging, monitoring, and deployment systems that enable you to adopt Google Cloud for your data workloads.
- The data layer provides various capabilities such as data ingestion, data storage, data access control, data governance, data monitoring, and data sharing.
- The application layer represents various different applications that use the data layer assets.
- CI/CD provides the tools to automate the provision, configuration, management, and deployment of infrastructure, workflows, and software components. These components help you ensure consistent, reliable, and auditable deployments; minimize manual errors; and accelerate the overall development cycle.
To show how the data environment is used, the architecture includes a sample data workflow. The sample data workflow takes you through the following processes: data governance, data ingestion, data processing, data sharing, and data consumption.
Key architectural decisions
The following table summarizes the high-level decisions of the architecture.
Decision area | Decision |
---|---|
Google Cloud architecture | |
Resource hierarchy |
The architecture uses the resource hierarchy from the enterprise foundations blueprint. |
Networking |
The architecture includes an example data transfer service that uses Workload Identity Federation and a Tink library. |
Roles and IAM permissions |
The architecture includes segmented data producer roles, data consumer roles, data governance roles, and data platform roles. |
Common data services | |
Metadata |
The architecture uses Data Catalog to manage data metadata. |
Central policy management |
To manage policies, the architecture uses Google Cloud's implementation of the CDMC framework. |
Data access management |
To control access to data, the architecture includes an independent process that requires data consumers to request access to data assets from the data owner. |
Data quality |
The architecture uses the Cloud Data Quality Engine to define and run data quality rules on specified table columns, measuring data quality based on metrics like correctness and completeness. |
Data security |
The architecture uses tagging, encryption, masking, tokenization, and IAM controls to provide data security. |
Data domain | |
Data environments |
The architecture includes three environments. Two environments (non-production and production) are operational environments that are driven by pipelines. One environment (development) is an interactive environment. |
Data owners |
Data owners ingest, process, expose, and grant access to data assets. |
Data consumers |
Data consumers request access to data assets. |
Onboarding and operations | |
Pipelines |
The architecture uses the following pipelines to deploy resources:
|
Repositories |
Each pipeline uses a separate repository to enable segregation of responsibility. |
Process flow |
The process requires that changes to the production environment include a submitter and an approver. |
Cloud operations | |
Data product scorecards |
The Report Engine generates data product scorecards. |
Cloud Logging |
The architecture uses the logging infrastructure from the enterprise foundations blueprint. |
Cloud Monitoring |
The architecture uses the monitoring infrastructure from the enterprise foundations blueprint. |
Identity: Mapping roles to groups
The data mesh leverages the enterprise foundations blueprint's existing identity lifecycle management, authorization, and authentication architecture. Users are not assigned roles directly; instead groups are the primary method of assigning roles and permission in IAM. IAM roles and permissions are assigned during project creation through the foundation pipeline.
The data mesh associates groups with one of four key areas: infrastructure, data governance, domain-based data producers, and domain-based consumers.
The permission scopes for these groups are the following:
- The infrastructure group's permission scope is the data mesh as a whole.
- The data governance groups' permission scope is the data governance project.
- Domain-based producers and consumers permissions are scoped to their data domain.
The following tables show the various roles used in this data mesh implementation and their associated permissions.
Infrastructure
Group | Description | Roles |
---|---|---|
|
Overall administrators of the data mesh |
|
Data governance
Group | Description | Roles |
---|---|---|
|
Administrators of the data governance project |
|
|
Developers who build and maintain the data governance components |
Multiple roles on the data governance project, including
|
|
Readers of data governance information |
|
|
Security administrators of the governance project |
|
|
Group with permission to use tag templates |
|
|
Group with permission to use tag templates and add tags |
|
|
Service account group for Security Command Center notifications |
None. This is a group for membership, and a service account is created with this name, which has the necessary permissions. |
Domain-based data producers
Group | Description | Roles |
---|---|---|
|
Administrators of a specific data domain |
|
|
Developers who build and maintain data products within a data domain |
Multiple roles on the data domain project, including
|
|
Readers of the data domain information |
|
|
Editors of Data Catalog entries |
Roles to edit Data Catalog entries |
|
Data stewards for the data domain |
Roles to manage metadata and data governance aspects |
Domain-based data consumers
Group | Description | Roles |
---|---|---|
|
Administrators of a specific consumer project |
|
|
Developers working within a consumer project |
Multiple roles on the consumer project, including
|
|
Readers of the consumer project information |
|
Organization structure
To differentiate between production operations and production data, the architecture uses different environments to develop and release workflows. Production operations include the governance, traceability, and repeatability of a workflow and the auditability of the results of the workflow. Production data refers to possibly sensitive data that you need to run your organization. All environments are designed to have security controls that let you ingest and operate your data.
To help data scientists and engineers, the architecture includes an interactive environment, where developers can work with the environment directly and add services through a curated catalog of solutions. Operational environments are driven through pipelines which have codified architecture and configuration.
This architecture uses the organizational structure of the enterprise foundations blueprint as a basis for deploying data workloads. The following diagram shows the top-level folders and projects used in the enterprise data mesh architecture.
The following table describes the top-level folders and projects that are part of the architecture.
Folder | Component | Description |
---|---|---|
|
|
Contains the deployment pipeline that's used to build out the code artifacts of the architecture. |
|
Contains the infrastructure used by the Service Catalog to deploy resources in the interactive environment. |
|
|
Contains all the resources used by Google Cloud's implementation of the CDMC framework. |
|
|
|
Contains the projects and resources of the data platform for developing use cases in interactive mode. |
|
|
Contains the projects and resources of the data platform for testing use cases that you want to deploy in an operational environment. |
|
|
Contains the projects and resources of the data platform for deployment into production. |
Data platform folder
The data platform folder contains all the data plane components and some of the CDMC resources. In addition, the data platform folder and the data governance project contain the CDMC resources. The following diagram shows the folders and projects that are deployed in the data platform folder.
Each data platform folder includes an environment folder (production, non-production, and development). The following table describes the folders within each data platform folder.
Folders | Description |
---|---|
Producers |
Contains the data domains. |
Consumers |
Contains the consumer projects. |
Data domain |
Contains the projects associated with a particular domain. |
Producers folder
Each producers folder includes one or more data domains. A data domain refers to a logical grouping of data elements that share a common meaning, purpose, or business context. Data domains let you categorize and organize data assets within an organization. The following diagram shows the structure of a data domain. The architecture deploys projects in the data platform folder for each environment.
The following table describes the projects that are deployed in the data platform folder for each environment.
Project | Description |
---|---|
Ingestion |
The ingestion project ingests data into the data domain. The architecture provides examples of how you can stream data into BigQuery, Cloud Storage, and Pub/Sub. The ingestion project also contains examples of Dataflow and Cloud Composer that you can use to orchestrate the transformation and movement of ingested data. |
Non-confidential |
The non-confidential project contains data that has been de-identified. You can mask, containerize, encrypt, tokenize, or obfuscate data. Use policy tags to control how the data is presented. |
Confidential |
The confidential project contains plaintext data. You can control access through IAM permissions. |
Consumer folder
The consumer folder contains consumer projects. Consumer projects provide a mechanism to segment data users based on their required trust boundary. Each project is assigned to a separate user group and the group is assigned access to the required data assets on a project-by-project basis. You can use the consumer project to collect, analyze, and augment the data for the group.
Common folder
The common folder contains the services that are used by different environments and projects. This section describes the capabilities that are added to the common folder to enable the enterprise data mesh.
CDMC architecture
The architecture uses the CDMC architecture for data governance. The data governance functions reside in the data governance project in the common folder. The following diagram shows the components of the CDMC architecture. The numbers in the diagram represent the key controls that are addressed with Google Cloud services.
The following table describes the components of the CDMC architecture that the enterprise data mesh architecture uses.
CDMC component | Google Cloud service | Description |
---|---|---|
Access and lifecycle components | ||
Key management |
Cloud KMS |
A service that securely manages encryption keys that protect sensitive data. |
Record Manager |
Cloud Run |
An application that maintains comprehensive logs and records of data processing activities, ensuring organizations can track and audit data usage. |
Archiving policy |
BigQuery |
A BigQuery table that contains the storage policy for data. |
Entitlements |
BigQuery |
A BigQuery table that stores information about who can access sensitive data. This table ensures that only authorized users can access specific data based on their roles and privileges. |
Scanning components | ||
Data loss |
Sensitive Data Protection |
Service used to inspect assets for sensitive data. |
DLP findings |
BigQuery |
A BigQuery table that catalogs data classifications within the data platform. |
Policies |
BigQuery |
A BigQuery table that contains consistent data governance practices (for example, data access types). |
Billing export |
BigQuery |
A table that stores cost information that is exported from Cloud Billing to enable the analysis of cost metrics that are associated with data assets. |
Cloud Data Quality Engine |
Cloud Run |
An application that runs data quality checks for tables and columns. |
Data quality findings |
BigQuery |
A BigQuery table that records identified discrepancies between the defined data quality rules and the actual quality of the data assets. |
Reporting components | ||
Scheduler |
Cloud Scheduler |
A service that controls when the Cloud Data Quality Engine runs and when the Sensitive Data Protection inspection occurs. |
Report Engine |
Cloud Run |
An application that generates reports that help track and measure adherence to the CDMC framework's controls. |
Findings and assets |
BigQuery and Pub/Sub |
A BigQuery report of discrepancies or inconsistencies in data management controls, such as missing tags, incorrect classifications, or non-compliant storage locations. |
Tag exports |
BigQuery |
A BigQuery table that contains extracted tag information from Data Catalog. |
Other components | ||
Policy management |
Organization Policy Service |
A service that defines and enforces restrictions on where data can be stored geographically. |
Attribute-based access policies |
Access Context Manager |
A service that defines and enforces granular, attribute-based access policies so that only authorized users from permitted locations and devices can access sensitive information. |
Metadata |
Data Catalog |
A service that stores metadata information about the tables that are in use in the data mesh. |
Tag Engine |
Cloud Run |
An application that adds tags to data in BigQuery tables. |
CDMC reports |
Looker Studio |
Dashboards that let your analysts view reports that were generated by the CDMC architecture engines. |
CDMC implementation
The following table describes how the architecture implements the key controls in the CDMC framework.
CDMC control requirement | Implementation |
---|---|
The Report Engine detects non-compliant data assets through and publishes findings to a Pub/Sub topic. These findings are also loaded into BigQuery for reporting using Looker Studio. |
|
Data ownership is established for both migrated and cloud-generated data |
Data Catalog automatically captures technical metadata from BigQuery. Tag Engine applies business metadata tags like owner name and sensitivity level from a reference table, which helps ensure that all sensitive data is tagged with owner information for compliance. This automated tagging process helps provide data governance and compliance by identifying and labeling sensitive data with the appropriate owner information. |
Data sourcing and consumption are governed and supported by automation |
Data Catalog classifies data assets by tagging them with
an |
Organization Policy Service defines permitted storage regions for data assets and Access Context Manager restricts access based on user location. Data Catalog stores the approved storage locations as metadata tags. Report Engine compares these tags against the actual location of the data assets in BigQuery and publishes any discrepancies as findings using Pub/Sub. Security Command Center provides an additional layer of monitoring by generating vulnerability findings if data is stored or accessed outside the defined policies. |
|
Data Catalog stores and updates the technical metadata for all BigQuery data assets, effectively creating a continuously synchronized Data Catalog. Data Catalog ensures that any new or modified tables and views are immediately added to the catalog, maintaining an up-to-date inventory of data assets. |
|
Sensitive Data Protection inspects BigQuery data and identifies sensitive information types. These findings are then ranked based on a classification reference table, and the highest sensitivity level is assigned as a tag in Data Catalog at the column and table levels. Tag Engine manages this process by updating the Data Catalog with sensitivity tags whenever new data assets are added or existing ones are modified. This process ensures a constantly updated classification of data based on sensitivity, which you can monitor and report on using Pub/Sub and integrated reporting tools. |
|
BigQuery policy tags control access to sensitive data at the column level, ensuring only authorized users can access specific data based on their assigned policy tag. IAM manages overall access to the data warehouse, while Data Catalog stores sensitivity classifications. Regular checks are performed to ensure all sensitive data has corresponding policy tags, with any discrepancies reported using Pub/Sub for remediation. |
|
Data sharing agreements for both providers and consumers are stored in a dedicated BigQuery data warehouse to control consumption purposes. Data Catalog labels data assets with the provider agreement information, while consumer agreements are linked to IAM bindings for access control. Query labels enforce consumption purposes, requiring consumers to specify a valid purpose when querying sensitive data, which is validated against their entitlements in BigQuery. An audit trail in BigQuery tracks all data access and ensures compliance with the data sharing agreements. |
|
Google's default encryption at rest helps protect data that is stored on disk. Cloud KMS supports customer-managed encryption keys (CMEK) for enhanced key management. BigQuery implements column-level dynamic data masking for de-identification and supports application-level de-identification during data ingestion. Data Catalog stores metadata tags for encryption and de-identification techniques that are applied to data assets. Automated checks ensure that the encryption and de-identification methods align with predefined security policies, with any discrepancies that are reported as findings using Pub/Sub. |
|
Data Catalog tags sensitive data assets with relevant information for impact assessment, such as subject location and assessment report links. Tag Engine applies these tags based on data sensitivity and a policy table in BigQuery, which defines the assessment requirements based on data and subject residency. This automated tagging process allows for continuous monitoring and reporting of compliance with impact assessment requirements, ensuring that data protection impact assessments (DPIAs) or protection impact assessment (PIAs) are conducted when necessary. |
|
Data Catalog labels data assets with retention policies, specifying retention periods and expiration actions (such as archive or purge). Record Manager automates the enforcement of these policies by purging or archiving BigQuery tables based on the defined tags. This enforcement ensures adherence to the data lifecycle policies and maintains compliance with data retention requirements, with any discrepancies detected and reported using Pub/Sub. |
|
The Cloud Data Quality Engine defines and runs data quality rules on specified table columns, measuring data quality based on metrics like correctness and completeness. Results from these checks, including success percentages and thresholds, are stored as tags in Data Catalog. Storing these results allows for continuous monitoring and reporting of data quality, with any issues or deviations from acceptable thresholds published as findings using Pub/Sub. |
|
Data Catalog stores cost-related metrics for data assets, such as query costs, storage costs, and data egress costs, which are calculated using billing information exported from Cloud Billing to BigQuery. Storing cost-related metrics allows for comprehensive cost tracking and analysis, ensuring adherence to cost policies and efficient resource utilization, with any anomalies reported using Pub/Sub. |
|
Data Catalog's built-in data lineage features track the provenance and lineage of data assets, visually representing the flow of data. Additionally, data ingestion scripts identify and tag the original source of the data in Data Catalog, enhancing the traceability of data back to its origin. |
Data access management
The architecture's access to data is controlled through an independent process which separates operational control (for example, running Dataflow jobs) from data access control. A user's access to a Google Cloud service is defined by an environmental or operational concern and is provisioned and approved by a cloud engineering group. A user's access to Google Cloud data assets (for example, a BigQuery table) is a privacy, regulatory, or governance concern and is subject to an access agreement between the producing and consuming parties and controlled through the following processes. The following diagram shows how data access is provisioned through the interaction of different software components.
As shown in the previous diagram, onboarding of data accesses is handled by the following processes:
- Cloud data assets are collected and inventoried by Data Catalog.
- The workflow manager retrieves the data assets from Data Catalog.
- Data owners are onboarded to workflow manager.
The operation of the data access management is as follows:
- A data consumer makes a request for a specific asset.
- The data owner of the asset is alerted to the request.
- The data owner approves or rejects the request.
- If the request is approved, the workflow manager passes the group, asset, and associated tag to the IAM mapper.
- The IAM mapper translates the workflow manager tags into IAM permissions, and gives the specified group IAM permissions for the data asset.
- When a user wants to access the data asset, IAM evaluates access to the Google Cloud asset based on the permissions of the group.
- If permitted, the user accesses the data asset.
Networking
The data security process initiates at the source application, which might reside on-premises or in another environment external to the target Google Cloud project. Before any network transfer occurs, this application uses Workload Identity Federation to securely authenticate itself to Google Cloud APIs. Using these credentials, it interacts with Cloud KMS to obtain or wrap the necessary keys and then employs the Tink library to perform initial encryption and de-identification on the sensitive data payload according to predefined templates.
After the data payload is protected, the payload must be securely transferred into the Google Cloud ingestion project. For on-premise applications, you can use Cloud Interconnect or potentially Cloud VPN. Within the Google Cloud network, use Private Service Connect to route the data towards the ingestion endpoint within the target project's VPC network. Private Service Connect lets the source application connect to Google APIs using private IP addresses, ensuring traffic isn't exposed to the internet.
The entire network path and the target ingestion services (Cloud Storage, BigQuery, and Pub/Sub) within the ingestion project are secured by a VPC Service Controls perimeter. This perimeter enforces a security boundary, ensuring that the protected data originating from the source can only be ingested into the authorized Google Cloud services within that specific project.
Logging
This architecture uses the Cloud Logging capabilities that are provided by the enterprise foundations blueprint.
Pipelines
The enterprise data mesh architecture uses a series of pipelines to provision the infrastructure, orchestration, data sets, data pipelines, and application components. The architecture's resource deployment pipelines use Terraform as the infrastructure as code (IaC) tool and Cloud Build as the CI/CD service to deploy the Terraform configurations into the architecture environment. The following diagram shows the relationship between the pipelines.
The foundation pipeline and the infrastructure pipeline are part of the enterprise foundations blueprint. The following table describes the purpose of the pipelines and the resources that they provision.
Pipeline | Provisioned by | Resources |
---|---|---|
Foundation pipeline |
Bootstrap |
|
Infrastructure pipeline |
Foundation pipeline |
|
Service Catalog pipeline |
Infrastructure pipeline |
|
Artifact pipelines |
Infrastructure pipeline |
Artifact pipelines produce the various containers and other components of the codebase used by the data mesh. |
Each pipeline has its own set of repositories that it pulls code and configuration files from. Each repository has a separation of duties where submitters and approvals of operational code deployments are the responsibilities of different groups.
Interactive deployment through Service Catalog
Interactive environments are the development environment within the architecture and exist under the development folder. The main interface for the interactive environment is Service Catalog, which lets developers use preconfigured templates to instantiate Google services. These preconfigured templates are known as service templates. Service templates help you to enforce your security posture, such as making CMEK encryption mandatory, and also prevents your users from having direct access to Google APIs.
The following diagram shows the components of the interactive environment and how data scientists deploy resources.
To deploy resources using the Service Catalog, the following steps occur:
- The MLOps engineer puts a Terraform resource template for Google Cloud into a Git repository.
- The Git Commit command triggers a Cloud Build pipeline.
- Cloud Build copies the template and any associated configuration files to Cloud Storage.
- The MLOps engineer sets up the Service Catalog solutions and Service Catalog manually. The engineer then shares the Service Catalog with a service project in the interactive environment.
- The data scientist selects a resource from the Service Catalog.
- Service Catalog deploys the template into the interactive environment.
- The resource pulls any necessary configuration scripts.
- The data scientist interacts with the resources.
Artifact pipelines
The data ingestion process uses Cloud Composer and Dataflow to orchestrate the movement and transformation of data within the data domain. The artifact pipeline builds all necessary resources for data ingestion and moves the resources to the appropriate location for the services to access them. The artifact pipeline creates the container artifacts that the orchestrator uses.
Security controls
The enterprise data mesh architecture uses a layered defense-in-depth security model that includes default Google Cloud capabilities, Google Cloud services, and security capabilities that are configured through the enterprise foundations blueprint. The following diagram shows the layering of the various security controls for the architecture.
The following table describes the security controls that are associated with the resources in each layer.
Layer | Resource | Security control |
---|---|---|
CDMC framework |
Google Cloud CDMC implementation |
Provides a governance framework that helps secure, manage and control your data assets. See CDMC Key Controls Framework for more information. |
Deployment |
Infrastructure pipeline |
Provides a series of pipelines that deploy infrastructure, build containers, and create data pipelines. The use of pipelines allows for auditability, traceability, and repeatability. |
Artifact pipeline |
Deploys various components not deployed by the infrastructure pipeline. |
|
Terraform templates |
Builds out the system infrastructure. |
|
Open Policy Agent |
Helps ensure that the platform conforms to selected policies. |
|
Network |
Private Service Connect |
Provides data exfiltration protections around the architecture resources at the API layer and the IP layer. Lets you communicate with Google Cloud APIs using private IP addresses so that you can avoid exposing traffic to the internet. |
VPC network with private IP addresses |
Helps remove exposure to internet-facing threats. |
|
VPC Service Controls |
Helps protect sensitive resources against data exfiltration. |
|
Firewall |
Helps protect the VPC network against unauthorized access. |
|
Access management |
Access Context Manager |
Controls who can access what resources and helps prevent unauthorized use of your resources. |
Workload Identity Federation |
Removes the need for external credentials to transfer data onto the platform from on-premises environments. |
|
Data Catalog |
Provides an index of assets available to users. |
|
IAM |
Provides fine-grained access. |
|
Encryption |
Cloud KMS |
Lets you manage your encryption keys and secrets, and help protect your data through encryption at rest and encryption in transit. |
Secrets Manager |
Provides a secret store for pipelines that are controlled by IAM. |
|
Encryption at rest |
By default, Google Cloud encrypts data at rest. |
|
Encryption in transit |
By default, Google Cloud encrypts data in transit. |
|
Detective |
Security Command Center |
Helps you to detect misconfigurations and malicious activity in your Google Cloud organization. |
Continuous architecture |
Continually checks your Google Cloud organization against a series of OPA policies that you have defined. |
|
IAM Recommender |
Analyzes user permissions and provides suggestions about reducing permissions to help enforce the principle of least privilege. |
|
Firewall Insights |
Analyzes firewall rules, identifies overly-permissive firewall rules, and suggests more restrictive firewalls to help strengthen your overall security posture. |
|
Cloud Logging |
Provides visibility into system activity and helps enable the detection of anomalies and malicious activity. |
|
Cloud Monitoring |
Tracks key signals and events that can help identify suspicious activity. |
|
Preventative |
Organization Policy |
Lets you control and restrict actions within your Google Cloud organization. |
Workflows
The following sections outline the data producer workflow and data consumer workflow, ensuring appropriate access controls based on data sensitivity and user roles.
Data producer workflow
The following diagram shows how data is protected as it is transferred to BigQuery.
The workflow for data transfer is the following:
- An application that is integrated with Workload Identity Federation uses Cloud KMS to decrypt a wrapped encryption key.
- The application uses the Tink library to de-identify or encrypt the data using a template.
- The application transfers data to the ingestion project in Google Cloud.
- The data arrives in Cloud Storage, BigQuery, or Pub/Sub.
- In the ingestion project, the data is decrypted or re-identified using a template.
- The decrypted data is encrypted or masked based on another de-identification template, then placed in the non-confidential project. Tags are applied by the tagging engine as appropriate.
- Data from the non-confidential project is transferred over to the confidential project and re-identified.
The following data access is permitted:
- Users who have access to the confidential project can access all the raw plaintext data.
- Users who have access to the non-confidential project can access masked, tokenized, or encrypted data based on the tags associated with the data and their permissions.
Data consumer workflow
The following steps describe how a consumer can access data that is stored in BigQuery.
- The data consumer searches for data assets using Data Catalog.
- After the consumer finds the assets that they are looking for, the data consumer requests access to the data assets.
- The data owner decides whether to provide access to the assets.
- If the consumer obtains access, the consumer can use a notebook and the Solution Catalog to create an environment in which they can analyze and transform the data assets.
Bringing it all together
The GitHub repository provides you with detailed instructions on deploying the data mesh on Google Cloud after you deployed the enterprise foundation. The process to deploy the architecture involves modifying your existing infrastructure repositories and deploying new data mesh specific components.
Complete the following:
- Complete all prerequisites, including the following:
- Install Google Cloud CLI, Terraform, Tink, Java, and Go.
- Deploy the enterprise foundations blueprint (v4.1).
- Maintain the following local repositories:
gcp-data-mesh-foundations
gcp-bootstrap
gcp-environments
gcp-networks
gcp-org
gcp-projects
- Modify the existing foundation blueprint and then deploy the data mesh
applications. For each item, complete the following:
- In your target repository, check out the
Plan
branch. - To add data mesh components, copy the relevant files and directories from
gcp-data-mesh-foundations
into the appropriate foundation directory. Overwrite files when required. - Update the data mesh variables, roles, and settings in the Terraform
files (for
example,
*.tfvars
and*.tf
). Set the GitHub tokens as environment variables. - Perform the Terraform initialize, plan, and apply operations on each repository.
- Commit your changes, push the code to your remote repository, create pull requests and merge to your development, nonproduction, and production environments.
- In your target repository, check out the
What's next
- Read about the architecture and functions in a data mesh.
- Import data from Google Cloud into a secured BigQuery data warehouse.
- Implement the CDMC key controls framework in a BigQuery data warehouse.
- Read about the enterprise foundations blueprint.