Import data from an external network into a secured BigQuery data warehouse

Last reviewed 2023-08-15 UTC

Many organizations deploy data warehouses that store confidential information so that they can analyze the data for various business purposes. This document is intended for data engineers and security administrators who deploy and secure data warehouses using BigQuery. It's part of a security blueprint that includes the following:

  • A GitHub repository that contains a set of Terraform configurations and scripts. The Terraform configuration sets up an environment in Google Cloud that supports a data warehouse that stores confidential data.
  • A guide to the architecture, design, and security controls that you use this blueprint to implement (this document).

This document discusses the following:

  • The architecture and Google Cloud services that you can use to help secure a data warehouse in a production environment.
  • Best practices for importing data into BigQuery from an external network such as an on-premises environment.
  • Best practices for data governance when creating, deploying, and operating a data warehouse in Google Cloud, including column-level encryption, differential handling of confidential data, and column-level access controls.

This document assumes that you have already configured a foundational set of security controls as described in the Google Cloud enterprise foundations blueprint. It helps you to layer additional controls onto your existing security controls to help protect confidential data in a data warehouse.

Data warehouse use cases

The blueprint supports the following use cases:

Overview

Data warehouses such as BigQuery let businesses analyze their business data for insights. Analysts access the business data that is stored in data warehouses to create insights. If your data warehouse includes data that you consider confidential, you must take measures to preserve the security, confidentiality, integrity, and availability of the business data while it's imported and stored, while it's in transit, or while it's being analyzed. In this blueprint, you do the following:

  • Encrypt your source data that's located outside of Google Cloud (for example, in an on-premises environment) and import it into BigQuery.
  • Configure controls that help secure access to confidential data.
  • Configure controls that help secure the data pipeline.
  • Configure an appropriate separation of duties for different personas.
  • Set up appropriate security controls and logging to help protect confidential data.
  • Use data classification, policy tags, dynamic data masking, and column-level encryption to restrict access to specific columns in the data warehouse.

Architecture

To create a confidential data warehouse, you need to import data securely and then store the data in a VPC Service Controls perimeter. The following image shows how data is ingested and stored.

The secured data warehouse architecture for external networks.

The architecture uses a combination of the following Google Cloud services and features:

  • Dedicated Interconnect lets you move data between your network and Google Cloud. You can use another connectivity option, as described in Choosing a Network Connectivity product.

  • Identity and Access Management (IAM) and Resource Manager restrict access and segment resources. The access controls and resource hierarchy follow the principle of least privilege.

  • VPC Service Controls creates security perimeters that isolate services and resources by setting up authorization, access controls, and secure data exchange. The perimeters are as follows:

    • A data ingestion perimeter that accepts incoming data (in batch or stream). A separate perimeter helps to protect the rest of your workloads from incoming data.
    • A data perimeter that isolates the encryption data from other workloads.
    • A governance perimeter that stores the encryption keys and defines what is considered confidential data.

    These perimeters are designed to protect incoming content, isolate confidential data by setting up additional access controls and monitoring, and separate your governance from the actual data in the warehouse. Your governance includes key management, data catalog management, and logging.

  • Cloud Storage and Pub/Sub receive data as follows:

  • Cloud Functions is triggered by Cloud Storage and writes the data that Cloud Storage uploads to the ingestion bucket into BigQuery.

  • A Dataflow pipeline writes streaming data into BigQuery. To protect data, Dataflow uses a unique service account and access controls. To help secure pipeline execution by moving it to the backend service, Dataflow uses Streaming Engine. For more information, see Dataflow security and permissions.

  • Cloud Data Loss Prevention (Cloud DLP) scans data that is stored in BigQuery to find any sensitive data that isn't protected. For more information, see Using Cloud DLP to scan BigQuery data.

  • Cloud HSM hosts the key encryption key (KEK). Cloud HSM is a cloud-based Hardware Security Module (HSM) service. You use Cloud HSM to generate the encryption key that you use to encrypt the data in your network before sending it to Google Cloud.

  • Data Catalog automatically categorizes confidential data with metadata, also known as policy tags, when it's discovered in BigQuery. Data Catalog also uses metadata to manage access to confidential data. For more information, see Data Catalog overview. To control access to data within the data warehouse, you apply policy tags to columns that include confidential data.

  • BigQuery stores the encrypted data and the wrapped encryption key in separate tables.

    BigQuery uses various security controls to help protect content, including access controls, column-level encryption, column-level security, and data encryption.

  • Security Command Center monitors and reviews security findings from across your Google Cloud environment in a central location.

  • Cloud Logging collects all the logs from Google Cloud services for storage and retrieval by your analysis and investigation tools.

  • Cloud Monitoring collects and stores performance information and metrics about Google Cloud services.

  • Data Profiler for BigQuery automatically scans for sensitive data in all BigQuery tables and columns across the entire organization, including all folders and projects.

Organization structure

You group your organization's resources so that you can manage them and separate your testing environments from your production environment. Resource Manager lets you logically group resources by project, folder, and organization.

The following diagram shows you a resource hierarchy with folders that represent different environments such as bootstrap, common, production, non-production (or staging), and development. This hierarchy aligns with the organization structure used by the enterprise foundations blueprint. You deploy most of the projects in the blueprint into the production folder, and the Data governance project in the common folder which is used for governance.

The resource hierarchy for a secured data warehouse for external networks.

For alternative resource hierarchies, see Decide a resource hierarchy for your Google Cloud landing zone.

Folders

You use folders to isolate your production environment and governance services from your non-production and testing environments. The following table describes the folders from the enterprise foundations blueprint that are used by this blueprint.

Folder Description
Bootstrap Contains resources required to deploy the enterprise foundations blueprint.
Common Contains centralized services for the organization, such as the Data governance project.
Production Contains projects that have cloud resources that have been tested and are ready to use. In this blueprint, the Production folder contains the Data ingestion project and Data project.
Non-production Contains projects that have cloud resources that are currently being tested and staged for release. In this blueprint, the Non-production folder contains the Data ingestion project and Data project.
Development Contains projects that have cloud resources that are currently being developed. In this blueprint, the Development folder contains the Data ingestion project and Data project.

You can change the names of these folders to align with your organization's folder structure, but we recommend that you maintain a similar structure. For more information, see the Google Cloud enterprise foundations blueprint.

Projects

You isolate parts of your environment using projects. The following table describes the projects that are needed within the organization. You create these projects when you run the Terraform code. You can change the names of these projects, but we recommend that you maintain a similar project structure.

Project Description
Data ingestion Contains services that are required to receive data and write it to BigQuery.
Data governance Contains services that provide key management, logging, and data cataloging capabilities.
Data Contains services that are required to store data.

In addition to these projects, your environment must also include a project that hosts a Dataflow Flex Template job. The Flex Template job is required for the streaming data pipeline.

Mapping roles and groups to projects

You must give different user groups in your organization access to the projects that make up the confidential data warehouse. The following sections describe the blueprint recommendations for user groups and role assignments in the projects that you create. You can customize the groups to match your organization's existing structure, but we recommend that you maintain a similar segregation of duties and role assignment.

Data analyst group

Data analysts view and analyze the data in the warehouse. This group can view data after it has been loaded into the data warehouse and perform the same operations as the Encrypted data viewer group. This group requires roles in different projects, as described in the following table.

Scope of assignment Roles
Data ingestion project
Data project
Data policy level Masked Reader (roles/bigquerydatapolicy.maskedReader)

Encrypted data viewer group

The Encrypted data viewer group can view encrypted data from BigQuery reporting tables through Cloud Looker Studio and other reporting tools, such as SAP Business Objects. The encrypted data viewer group can't view cleartext data from encrypted columns.

This group requires the BigQuery User (roles/bigquery.jobUser) role in the Data project. This group also requires the Masked Reader (roles/bigquerydatapolicy.maskedReader) at the data policy level.

Plaintext reader group

The Plaintext reader group has the required permission to call the decryption user-defined function (UDF) to view plaintext data and the additional permission to read unmasked data. This group requires roles in Data project, as described in the following table.

This group requires the following roles in the Data project:

In addition, this group requires the Fine-Grained Reader (roles/datacatalog.categoryFineGrainedReader) role at the Data Catalog level.

Data engineer group

Data engineers set up and maintain the data pipeline and warehouse. This group requires roles in different projects, as described in the following table.

Scope of assignment Roles
Data ingestion project
Data project
  • BigQuery Data Editor (roles/bigquery.dataeditor)
  • BigQuery Job User (roles/bigquery.jobUser)
  • Cloud Build Editor (roles/cloudbuild.builds.editor)
  • Cloud KMS Viewer (roles/cloudkms.viewer)
  • Compute Network User (roles/compute.networkuser)
  • Dataflow Admin (roles/dataflow.admin)
  • DLP Administrator (roles/dlp.admin)
  • Logs Viewer (roles/logging.viewer)

Network administrator group

Network administrators configure the network. Typically, they are members of the networking team.

Network administrators require the following roles at the organization level:

  • Compute Admin (roles/compute.networkAdmin)
  • Logs Viewer (roles/logging.viewer)

Security administrator group

Security administrators administer security controls such as access, keys, firewall rules, VPC Service Controls, and the Security Command Center.

Security administrators require the following roles at the organization level:

Security analyst group

Security analysts monitor and respond to security incidents and Cloud DLP findings.

Security analysts require the following roles at the organization level:

Example group access flows

The following sections describe access flows for two groups within the secured data warehouse solution.

Access flow for Encrypted data viewer group

The following diagram shows what occurs when a user from the Encrypted data viewer group tries to access encrypted data in BigQuery.

The flow for the encrypted data viewer group.

The steps to access data in BigQuery are as follows:

  1. The Encrypted data viewer executes the following query on BigQuery to access confidential data:

    SELECT ssn, pan FROM cc_card_table
    
  2. BigQuery verifies access as follows:

    • The user is authenticated using valid, unexpired Google Cloud credentials.
    • The user identity and the IP address that the request originated from are part of the allowlist in the Access Level/Ingress rule on the VPC Service Controls perimeter.
    • IAM verifies that the user has the appropriate roles and is authorized to access selected encrypted columns on the BigQuery table.

BigQuery returns the confidential data in enc