Data plane Identity
Dataproc on GKE uses GKE workload identity to allow pods within the Dataproc on GKE cluster to act with the authority of the default Dataproc VM service account (data plane identity). Workload identity requires the following permissions to update IAM policies on the GSA used by your Dataproc on GKE virtual cluster:
compute.projects.getiam.serviceAccounts.getIamPolicyiam.serviceAccounts.setIamPolicy
GKE workload identity links the following GKE Service Accounts (KSAs) to the Dataproc VM Service Account:
agentKSA (interacts with Dataproc control plane):
serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/agent]spark-driverKSA (runs Spark drivers):serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-driver]spark-executorKSA (runs Spark executors):serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-executor]
Assign roles
Grant permissions to the
Dataproc VM service account
to allow the spark-driver and spark-executor to access project resources,
data sources, data sinks, and any other services required by your workload.
Example:
The following command assigns roles to the default Dataproc VM service account to allow Spark workloads running on Dataproc on GKE cluster VMs to access Cloud Storage buckets and BigQuery data sets in the project.
gcloud projects add-iam-policy-binding \
--role=roles/storage.objectAdmin \
--role=roles/bigquery.dataEditor \
--member="project-number-compute@developer.gserviceaccount.com" \
"${PROJECT}"
Custom IAM configuration
Dataproc on GKE uses GKE workload identity to link the default Dataproc VM service account (data plane identity) to the three GKE service accounts (KSAs).
To create and use a different Google service account (GSA) to link to the KSAs:
Create the GSA (see Creating and managing service accounts).
gcloud CLI example:
Notes:gcloud iam service-accounts create "dataproc-${USER}" \ --description "Used by Dataproc on GKE workloads."- The example sets the GSA name as "dataproc-${USER}", but you can use a different name.
Set environmental variables:
Notes:PROJECT=project-id \ DPGKE_GSA="dataproc-${USER}@${PROJECT}.iam.gserviceaccount.com" DPGKE_NAMESPACE=GKE namespaceDPGKE_GSA: The examples set and useDPGKE_GSAas the name of the variable that contains the email address of your GSA. You can set and use a different variable name.DPGKE_NAMESPACE: The default GKE namespace is the name of your Dataproc on GKE cluster.
When you create the Dataproc on GKE cluster, add the following properties for Dataproc to use your GSA instead of the default GSA:
--properties "dataproc:dataproc.gke.agent.google-service-account=${DPGKE_GSA}" \ --properties "dataproc:dataproc.gke.spark.driver.google-service-account=${DPGKE_GSA}" \ --properties "dataproc:dataproc.gke.spark.executor.google-service-account=${DPGKE_GSA}" \Run the following commands to assign necessary Workload Identity permissions to the service accounts:
- Assign your GSA the
dataproc.workerrole to allow it to act as agent:gcloud projects add-iam-policy-binding \ --role=roles/dataproc.worker \ --member="serviceAccount:${DPGKE_GSA}" \ "${PROJECT}" Assign the
agentKSA theiam.workloadIdentityUserrole to allow it to act as your GSA:gcloud iam service-accounts add-iam-policy-binding \ --role=roles/iam.workloadIdentityUser \ --member="serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/agent]" \ "${DPGKE_GSA}"Grant the
spark-driverKSA theiam.workloadIdentityUserrole to allow it to act as your GSA:gcloud iam service-accounts add-iam-policy-binding \ --role=roles/iam.workloadIdentityUser \ --member="serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-driver]" \ "${DPGKE_GSA}"Grant the
spark-executorKSA theiam.workloadIdentityUserrole to allow it to act as your GSA:gcloud iam service-accounts add-iam-policy-binding \ --role=roles/iam.workloadIdentityUser \ --member="serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-executor]" \ "${DPGKE_GSA}"
- Assign your GSA the