Protecting confidential data in Vertex AI Workbench user-managed notebooks

Last reviewed 2021-04-29 UTC

This document suggests controls and layers of security that you can use to help protect confidential data in Vertex AI Workbench user-managed notebooks. It's part of a blueprint solution which is made up of the following:

In this document, confidential data refers to sensitive information that someone in your enterprise would need higher levels of privilege to access. This document is intended for teams that administer user-managed notebooks.

This document assumes that you have already configured a foundational set of security controls to protect your cloud infrastructure deployment. The blueprint helps you layer additional controls onto these existing security controls to protect confidential data in user-managed notebooks. For more information about best practices for building security into your Google Cloud deployments, see the Google Cloud enterprise foundations blueprint.


Applying data governance and security policies to help protect user-managed notebooks with confidential data often requires you to balance the following objectives:

  • Helping protect data used by notebook instances by using the same data governance and security practices and controls that you apply across your enterprise.
  • Ensuring that data scientists in your enterprise have the access to the data that they need to provide meaningful insights.

Before you give data scientists in your enterprise access to data in user-managed notebooks, you must understand the following:

  • How the data flows through your environment.
  • Who is accessing the data.

Consider the following to help your understanding:

  • How to deploy your Google Cloud resource hierarchy to isolate your data.
  • Which IAM groups are authorized to use data from BigQuery.
  • How your data governance policy influences your environment.

The Terraform scripts in the GitHub repository associated with the blueprint implement the security controls that are described in this document. The repository also contains sample data to illustrate data governance practices. For more information about data governance within Google Cloud, see What is data governance?


The following architectural diagram shows the project hierarchy and resources such as user-managed notebooks and encryption keys.

Architecture of the blueprint.

The perimeter in this architecture is referred to as the higher trust boundary. It helps protect confidential data used in the Virtual Private Cloud (VPC). Data scientists must access data through the higher trust boundary. For more information, see VPC Service Controls.

The higher trust boundary contains every cloud resource that interacts with confidential data, which can help you to manage your data governance controls. Services such as user-managed notebooks, BigQuery, and Cloud Storage have the same trust level within the boundary.

The architecture also creates security controls that help you to do the following:

  • Mitigate the risk of data exfiltration to a device that is used by data scientists in your enterprise.
  • Protect your notebooks instances from external network traffic.
  • Limit access to the VM that hosts the notebook instances.

Organization structure

Resource Manager lets you logically group resources by project, folder, and organization. The following diagram shows you a resource hierarchy with folders that represent different environments such as production or development.

Resource hierarchy with production and developer folders.

In your production folder, you create a new folder that represents your trusted environment.

You add organization policies to the trusted folder that you create. The following sections describe how information is organized within the folder, subfolders, and projects.

Trusted folder

The blueprint helps you isolate data by introducing a new subfolder within your production folder for user-managed notebooks and any data that the notebook instances use from BigQuery. The following table describes the relationships of the folders within the organization and lists the folders that are used by this blueprint.

Folder Description
production Contains projects which have cloud resources that have been tested and are ready to use.
trusted Contains projects and resources for notebook instances with confidential data. This folder is a subfolder that is a child of the production folder.

Projects within the organization

The blueprint helps you isolate parts of your environment using projects. Because these projects don't have a project owner, you must create explicit IAM policy bindings for the appropriate IAM groups.

The following table describes where you create the projects that are needed within the organization.

Project Parent folder Description
trusted-kms trusted Contains services that manage the encryption key that protects your data (for example, Cloud HSM). This project is in the higher trust boundary.
trusted-data trusted Contains services that handle confidential data (for example, BigQuery). This project is in the higher trust boundary.
trusted-analytics trusted Contains the user-managed notebooks that are used by data scientists. This project is in the higher trust boundary.

Understanding the security controls that you apply

This section discusses the security controls within Google Cloud that help you protect your notebook instances. The approach discussed in this document uses multiple layers of control to help secure your sensitive data. We recommend that you adapt these layers of control as required by your enterprise.

Organization policy setup

The Organization Policy Service is used to configure restrictions on supported resources within your Google Cloud organization. You configure constraints that are applied to the trusted folder as described in the following table. For more information about the policy constraints, see the Organization policy constraints.

Policy constraint Description Recommended value
gcp.resourceLocations (list) Defines constraints on how resources are deployed to particular regions. For additional values, see valid region groups. ["in:us-locations", "in:eu-locations"]
iam.disableServiceAccountCreation (boolean) When the value is true, prevents the creation of a service account. true
iam.disableServiceAccountKeyCreation (boolean) When the value is true, prevents the creation of service account keys. true
iam.automaticIamGrantsForDefaultServiceAccounts (boolean) When the value is true, prevents default service accounts being granted to any IAM role on the project when the accounts are created. true
compute.requireOsLogin (boolean) When the value is true, enables OS Login. For more information, see OS Login. true
constraints/compute.restrictProtocolForwardingCreationForTypes (list) Limits new forwarding rules to be internal only. ["is:INTERNAL"]
compute.restrictSharedVpcSubnetworks (list) Defines the set of shared VPC subnetworks that eligible resources can use. Provide the name of the project that has your shared VPC subnet.

Replace the VPC_SUBNET subnet with the resource ID of the private subnet that you want user-managed notebooks to use.
compute.vmExternalIpAccess (list) Defines the set of Compute Engine VM instances that have permission to use external IP addresses. deny all=true
compute.skipDefaultNetworkCreation (boolean) When the value is true, causes Google Cloud to skip the creation of the default network and related resources during Google Cloud resource creation. true
compute.disableSerialPortAccess (boolean) When the value is true, prevents serial port access to Compute Engine VMs. true
compute.disableSerialPortLogging (boolean) When the value is true, prevents serial port logging to Cloud Logging from Compute Engine VMs. true

For more information about additional policy controls, see the Google Cloud enterprise foundations blueprint.

Authentication and authorization

The blueprint helps you establish IAM controls and access patterns that you can apply to user-managed notebooks. The blueprint helps you define access patterns in the following ways:

  • Using a higher trust data scientist group. Individual identities do not have permissions assigned to access the data.
  • Defining a custom IAM role called restrictedDataViewer.
  • Using least privilege principles to limit access to your data.

Users and groups

The higher trust boundary has two personas which are as follows:

  • The data owner, which is responsible for classifying the data within BigQuery.
  • The trusted data scientist, which is allowed to handle confidential data.

You associate these personas to groups. You add an identity that matches the persona to the group, instead of granting the role to individual identities.

The blueprint helps you enforce least privilege by defining a one-to-one mapping between data scientists and their notebook instances so that only a single data scientist identity can access the notebook instance. Individual data scientists are not granted editor permissions to a notebook instance.

The table shows the following information:

  • The personas that you assign to the group.
  • The IAM roles that you assign to the group at the project level.
Group Description Roles Project Members are responsible for data classification and managing data within BigQuery. roles/bigquery.dataOwner trusted-data Members are allowed to access data that is within the trusted folder. roles/restrictedDataViewer (custom) trusted-data

User-managed service accounts

You create a user-managed service account for user-managed notebooks to use instead of the Compute Engine default service account. The roles for the service account for notebook instances are defined in the following table.

Service account Description Roles Project A service account used by Vertex AI for provisioning notebook instances. trusted-analytics

The blueprint also helps you configure the Google-managed service account that represents your user-managed notebooks by providing the Google-managed service account access to the specified customer-managed encryption keys (CMEKs). This resource-specific grant applies least privilege to the key that is used by user-managed notebooks.

Because the projects don't have a project owner defined, data scientists aren't permitted to manage the keys.

Custom roles

In the blueprint, you create a roles/restrictedDataViewer custom role by removing the export permission. The custom role is based on the predefined BigQuery dataViewer role that lets users read data from the BigQuery table. You assign this role to the group. The following table shows the permissions that are used by the roles/restrictedDataViewer role.

Custom Role name Description Permissions
roles/restrictedDataViewer Lets notebook instances within the higher trust boundary view sensitive data from BigQuery.

Based on the roles/bigquery.dataViewer role
without the export permission (for example, bigquery.models.export).

Least privilege

The blueprint helps you grant roles that have the minimum level of privilege. For example, you need to configure a one-to-one mapping between a single data scientist identity and a notebook instance, rather than a shared mapping with a service account. Restricting privilege helps you prevent data scientists directly logging into the instances that host their notebook instance.

Privileged access

Users in the higher trust data scientist group named have privileged access. This level of access means these users have identities that can access confidential data. Work with your identity team to provide hardware security keys with 2SV enabled for these data scientist identities.


You specify a shared VPC environment for your notebooks, such as one defined by the Google Cloud enterprise foundations network scripts.

The network for the notebook instances has the following properties:

  • A shared VPC using a private restricted network with no external IP address.
  • Restrictive firewall rules.
  • A VPC Service Controls perimeter that encompasses all the services and projects that your user-managed notebooks interact with.
  • An Access Context Manager policy.

Restricted shared VPC

You configure user-managed notebooks to use the shared VPC that you specify. Because OS Login is required, your shared VPC minimizes access to the notebook instances. You can configure explicit access for your data scientists using Identity-Aware Proxy (IAP).

You also configure the private connectivity to Google APIs and services in your shared VPC using the domain. This configuration enables the services in your environment to support VPC Service Controls.

For an example of how to set up your shared restricted VPC, see the security foundation blueprint network configuration Terraform scripts.

VPC Service Controls perimeter

The blueprint helps you establish the higher trust boundary for your trusted environment by using VPC Service Controls.

Service perimeters are an organization-level control that you can use to help protect Google Cloud services in your projects by mitigating the risk of data exfiltration.

The following table describes how you configure your VPC Service Control perimeter.

Attribute Consideration Value
projects Include all projects that contain data accessed by data scientists that use user-managed notebooks, including keys. ["trusted-kms"
services Add additional services as necessary. ["",
access_level Add Access Context Manager policies that align with your security requirements and add more detailed endpoint verification policies. ACCESS_POLICIES
For more information, see Access Context Manager

Access Context Manager

The blueprint helps you configure Access Context Manager with your VPC Service Controls perimeter. Access Context Manager lets you define fine-grained, attribute-based access control for projects and resources. You use Endpoint Verification and configure the policy to align with your corporate governance requirements for accessing data. Work with your administrator to create an access policy for the data scientists in your enterprise.

We recommend that you use the values shown in the following table for your access policy.

Condition Consideration Values
ip_subnetworks Use IP ranges that are trusted by your enterprise. (list) CIDR ranges allowed access to resources within the perimeter.
members Add highly privileged users that can access the perimeter. (list) Privileged identities of data scientists and Terraform service account for provisioning.
device_policy.require_screen_lock Devices must have screen lock enabled. true
device_policy.require_corp_owned Only allow corporate devices to access user-managed notebooks. true
device_policy.allowed_encryption_statuses Only allow data scientists to use devices that encrypt data at rest. (list) ENCRYPTED
regions Maintain regionalization where data scientists can access their notebook instances.

Limit to the smallest set of regions where you expect data scientists to work.
Valid region codes

BigQuery least privilege

The blueprint shows you how to configure access to datasets in BigQuery that are used by data scientists. In the configuration that you set, data scientists must have a notebook instance to access datasets in BigQuery.

The configuration that you set also helps you add layers of security to datasets in BigQuery in the following ways:

  • Granting access to the service account of the notebook instance. Data scientists must have a notebook instance to directly access datasets in BigQuery.
  • Mitigating the risk of data scientists creating copies of data that don't meet the data governance requirements of your enterprise. Data scientists that need to directly interact with BigQuery must be added to the group.

Alternatively, to provide limited access to BigQuery for data scientists, you can use fine-grained access contr