BigQuery Engine for Apache Flink security and permissions

When you use BigQuery Engine for Apache Flink to create deployments and to run jobs, you use a permissions system to maintain secure access to files and resources. This document explains the following concepts:

  • What the Managed Flink service agent is and how it's used
  • What the Managed Flink Default Workload Identity is and how it's used
  • Roles and permissions required for creating deployments and jobs
  • Roles and permissions required for accessing resources used by your jobs
  • Types of data used with BigQuery Engine for Apache Flink and data security

Before you begin

BigQuery Engine for Apache Flink service accounts

A service account is a special account used by an application or a virtual machine (VM) instance, not a person. You can give permissions to a service account to allow the service account to access resources or applications. A service agent is a type of service account that is provided automatically by Google. The service agent enables a service to access resources on your behalf.

As part of running BigQuery Engine for Apache Flink deployments and jobs, the BigQuery Engine for Apache Flink service manipulates resources on your behalf. When you create your deployment and run your job on the BigQuery Engine for Apache Flink service, the service uses service agents. For example, BigQuery Engine for Apache Flink uses a service agent to verify whether the subnetwork provided for a job exists.

When you use BigQuery Engine for Apache Flink, two service accounts are created: the Managed Flink service agent and the Managed Flink Default Workload Identity.

Managed Flink service agent

The Managed Flink service agent is created automatically and managed by Google. It enables BigQuery Engine for Apache Flink to access resources on your behalf. The Managed Flink service agent has the following email address:

service-PROJECT_NUMBER@gcp-sa-managedflink.iam.gserviceaccount.com
  • This account is used exclusively by BigQuery Engine for Apache Flink and is specific to your project.
  • Because Google Cloud services expect to have read and write access to the project and its resources, it's recommended that you don't change the default permissions automatically established for your project. If a Managed Flink service agent loses permissions to a project, BigQuery Engine for Apache Flink cannot perform management tasks.
  • If you remove the permissions for the service agent from the Identity and Access Management (IAM) policy, the account remains present, because it's owned by the BigQuery Engine for Apache Flink service.

Managed Flink Default Workload Identity

The Managed Flink Default Workload Identity is a service account that is created automatically. To run your jobs successfully, you need to give this service account permission to access the resources that your job uses, such as Cloud Storage buckets and BigQuery tables.

The Managed Flink Default Workload Identity has the following email address:

gmf-PROJECT_NUMBER-default@gcp-sa-managedflink-wi.iam.gserviceaccount.com

The first time you create a deployment or job, BigQuery Engine for Apache Flink creates the Managed Flink Default Workload Identity. After it is created, you need to add roles to this service account so that it can access the resources used by your job. For more information, see the Access Google Cloud resources section of this document. BigQuery Engine for Apache Flink uses IAM to manage access to resources.

To add roles to your Managed Flink Default Workload Identity, follow these steps. Until you grant roles to this service account, the account doesn't display in the Google Cloud console. To start granting roles to this account, follow the gcloud CLI instructions. After you add one role to this account, you can use the Google Cloud console to grant additional roles.

Google Cloud console

  1. In the Google Cloud console, go to the IAM page.

    Go to IAM

  2. Select your project.

  3. Select Include Google-provided role grants.

  4. In the row containing your Managed Flink Default Workload Identity, click Edit principal, and then click Add another role.

  5. In the drop-down list, select the role that you want to add. For example, if your BigQuery Engine for Apache Flink job needs to read from a Cloud Storage bucket, add the Storage Object Viewer role.

  6. Repeat for any roles required by resources used in your job, and then click Save.

For more information about granting roles, see Grant an IAM role by using the console.

gcloud CLI

To grant roles to your Managed Flink Default Workload Identity, run the following command for any roles required by resources used in your job. For example, if your BigQuery Engine for Apache Flink job needs to read from a Cloud Storage bucket, add roles/storage.objectViewer.

 gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:gmf-PROJECT_NUMBER-default@gcp-sa-managedflink-wi.iam.gserviceaccount.com" --role=SERVICE_ACCOUNT_ROLE
  • Replace PROJECT_ID with your project ID.
  • Replace PROJECT_NUMBER with your project number. To find your project number, see Identify projects or use the gcloud projects describe command.
  • Replace SERVICE_ACCOUNT_ROLE with each individual role.

Grant roles to your user account

You need to grant one of the following roles to the Google Cloud account that you use to run the BigQuery Engine for Apache Flink job:

  • Managed Flink Developer (roles/managedflink.developer)
  • Managed Flink Admin (roles/managedflink.admin)

For more information, see Access control with IAM.

Access Google Cloud resources

Your BigQuery Engine for Apache Flink jobs can access Google Cloud resources, either in the same Google Cloud project or in other projects. These resources might include the following:

To ensure that your BigQuery Engine for Apache Flink jobs can access these resources, you need to use the resources' respective access control mechanisms to explicitly authorize access for your Managed Flink Default Workload Identity.

BigQuery Engine for Apache Flink also uses firewall rules to allow or deny traffic to and from the resources that your jobs use. You need to configure firewall rules so that your job can access your resources. For more information, see Firewall rules.

If you use Assured Workloads features with BigQuery Engine for Apache Flink, such as EU Regions and Support with Sovereignty Controls, all Cloud Storage, BigQuery, Pub/Sub, I/O connectors, and other resources that your job accesses must be located in your organization's Assured Workloads project or folder.

Access Managed Service for Apache Kafka

Managed Service for Apache Kafka uses two levels of access control:

  • IAM roles from Google Cloud: These roles control who can manage your Managed Service for Apache Kafka cluster using Google Cloud APIs and tools. You manage these roles through the Google Cloud console or the IAM API. For more information, see the IAM documentation.

  • Apache Kafka ACLs from open source Apache Kafka: These ACLs control access to Managed Service for Apache Kafka topics and operations within your cluster, enforced at the Apache Kafka API level. You manage them using the Apache Kafka authorization CLI.

If you're using Google Cloud to manage your cluster, IAM roles are the primary mechanism for controlling access. You can restrict access to individual resources by using Apache Kafka ACLs in addition to IAM permissions. For more information, see Access control with IAM and Apache Kafka ACLs.

Access BigQuery datasets

You can access BigQuery datasets, either in the same project where you're using BigQuery Engine for Apache Flink or in a different project. For the BigQuery source and sink to operate properly, the Managed Flink Default Workload Identity must have access to any BigQuery datasets that your BigQuery Engine for Apache Flink job reads from or writes to.

You might need to configure BigQuery to explicitly grant access to this account. See BigQuery Access Control for more information about granting access to BigQuery datasets using either the BigQuery page or the BigQuery API.

Among the required BigQuery permissions, the bigquery.datasets.get IAM permission is required by the job to access a BigQuery dataset. Typically, most BigQuery IAM roles include the bigquery.datasets.get permission, but the roles/bigquery.jobUser role is an exception.

Access Cloud Storage buckets

To grant your BigQuery Engine for Apache Flink job access to a Cloud Storage bucket, make the bucket accessible to the Managed Flink Default Workload Identity. At a minimum, this account needs to have read and write permissions to both the bucket and its contents. You can use IAM permissions for Cloud Storage to grant the required access.

If you use the gcloud CLI to upload a local file, such as a JAR file or a SQL file, to Cloud Storage, the user account running the job needs the storage.objects.create permission to write to the Cloud Storage bucket.

IAM controls permissioning throughout Google Cloud and lets you grant permissions at the bucket and project levels. For a list of IAM roles that are associated with Cloud Storage and the permissions that are contained in each role, see IAM roles for Cloud Storage. If you need more control over permissions, create a custom role.

To give your Managed Flink Default Workload Identity the necessary permissions to read from and write to a bucket, use the gcloud storage buckets add-iam-policy-binding command. This command adds your BigQuery Engine for Apache Flink project service agent to a bucket-level policy.

To retrieve a list of the Cloud Storage buckets in a Google Cloud project, use the gcloud storage buckets list command:

gcloud storage buckets list --project= PROJECT_ID

Replace PROJECT_ID with the ID of the project.

Unless you're restricted by organizational policies that limit resource sharing, you can access a bucket that resides in a different project than your BigQuery Engine for Apache Flink job. For more information about domain restrictions, see Restricting identities by domain.

You can also set bucket permissions from the Google Cloud console. For more information, see Setting bucket permissions.

Access Pub/Sub topics and subscriptions

To access a Pub/Sub topic or subscription, use the Identity and Access Management features of Pub/Sub to set up permissions for the Managed Flink Default Workload Identity.

Permissions from the following Pub/Sub roles are relevant:

  • roles/pubsub.subscriber is required to consume data.
  • roles/pubsub.editor is required to create a Pub/Sub subscription.
  • roles/pubsub.viewer is recommended so that BigQuery Engine for Apache Flink can query the configurations of topics and subscriptions.
  • If VPC Service Controls is enabled on the project that owns the subscription or topic, IP address-based ingress rules don't allow BigQuery Engine for Apache Flink to query the configurations. In this case, an ingress rule based on the service agent is required.

Data access and security

The BigQuery Engine for Apache Flink service works with two kinds of data:

  • End-user data. This data is processed by a BigQuery Engine for Apache Flink job. A typical job reads data from one or more sources, implements transformations of the data, and writes the results to one or more sinks. All the sources and sinks are storage services that are not directly managed by BigQuery Engine for Apache Flink.

  • Operational data. This data includes all the metadata that is required for managing a BigQuery Engine for Apache Flink job. This data includes both user-provided metadata, such as a job name, and also system-generated metadata, such as a job ID.

The BigQuery Engine for Apache Flink service uses several security mechanisms to help keep your data secure and private. These mechanisms apply to the following scenarios:

  • Submitting a job to the service
  • Evaluating a job
  • Requesting access to telemetry and metrics during and after a job execution

Data locality

All of the core data processing for the BigQuery Engine for Apache Flink service happens in the region that is specified in the job. Although the job can optionally read and write from sources and sinks in other regions, it is recommended that you locate your job and your resources in the same region. The actual data processing occurs only in the region that is specified to run the BigQuery Engine for Apache Flink job.

BigQuery Engine for Apache Flink is a regional service. For more information about data locality and regions, see BigQuery Engine for Apache Flink regions.

Data in job submission

The IAM permissions for your Google Cloud project control access to the BigQuery Engine for Apache Flink service. Any principals who are granted the Managed Flink Admin role or the Managed Flink Developer role can submit jobs to the service. To submit a job, you must authenticate by using the Google Cloud CLI. After you're authenticated, your jobs are submitted using the HTTPS protocol.

Data in job evaluation

As part of evaluating a job, temporary data might be generated and stored locally in the deployment. Temporary data is encrypted at rest and does not persist after job evaluation concludes.

Data in job logs and telemetry

Information stored in Cloud Logging is primarily generated by the code in your BigQuery Engine for Apache Flink program. The BigQuery Engine for Apache Flink service might also generate warning and error data in Cloud Logging, but this data is the only intermediate data that the service adds to logs. Cloud Logging is a global service.

Telemetry data and associated metrics are encrypted at rest, and access to this data is controlled by your Google Cloud project's read permissions.

We recommend that you use the security mechanisms available in the underlying cloud resources of your job. These mechanisms include the data security capabilities of data sources and sinks such as BigQuery and Managed Service for Apache Kafka. It's also best not to mix different trust levels in a single project.