Stay organized with collections
Save and categorize content based on your preferences.
When you use the Dataproc service to create clusters and run
jobs on your clusters, the service sets up the necessary
Dataproc roles and permissions
in your project to access and use the Google Cloud resources it needs to accomplish
these tasks. However, if you do cross-project work, for example to access data
in another project, you will need to set up the necessary roles and permissions
to access cross-project resources.
To help you do cross-project work successfully, this document lists the different
principals that use the Dataproc service
and the roles that contain the necessary permissions for those principals to access
and use Google Cloud resources.
There are three principals (identities) that access and use the Dataproc:
User Identity
Control Plane Identity
Data Plane Identity
Dataproc API User (User identity)
Example: username@example.com
This is the user that calls the Dataproc service to create
clusters, submit jobs, and make other requests to the service. The user
is usually an individual, but it can also be a
service account
if Dataproc is invoked through an API client or from another
Google Cloud service such as Compute Engine, Cloud Run functions, or Cloud Composer.
Dataproc clusters inherit project-wide Compute Engine
SSH metadata unless explicitly blocked by setting
--metadata=block-project-ssh-keys=true when you create your cluster
(see
Cluster metadata).
HDFS user directories are created for each project-level SSH user. These
HDFS directories are created at cluster deployment time, and a new (post-deployment)
SSH user is not given an HDFS directory on existing clusters.
The
Dataproc Service Agent service account
is used to perform a broad set of system operations on resources located
in the project where a Dataproc cluster is created, including:
Creation of Compute Engine resources, including VM instances,
instance groups, and instance templates
get and list operations to confirm the configuration of
resources such as images, firewalls, Dataproc initialization
actions, and Cloud Storage buckets
Auto-creation of the Dataproc
staging and temp buckets
if the staging or temp bucket is not specified by the user
Writing cluster configuration metadata to the staging bucket
Your application code runs as the
VM service account
on Dataproc VMs. User jobs are granted the roles (with their
associated permissions) of this service account.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eDataproc sets up necessary roles and permissions for accessing Google Cloud resources within the same project, but cross-project access requires additional setup.\u003c/p\u003e\n"],["\u003cp\u003eThere are three primary identities (principals) that interact with the Dataproc service: User Identity, Control Plane Identity, and Data Plane Identity.\u003c/p\u003e\n"],["\u003cp\u003eThe User Identity (Dataproc API User) is the individual or service account initiating actions like cluster creation and job submission.\u003c/p\u003e\n"],["\u003cp\u003eThe Control Plane Identity (Dataproc Service Agent) handles system operations on resources within the project, including VM creation and bucket management.\u003c/p\u003e\n"],["\u003cp\u003eThe Data Plane Identity (Dataproc VM Service Account) executes application code on Dataproc VMs, interacting with the control plane, staging buckets, and other Google Cloud resources as needed by jobs.\u003c/p\u003e\n"]]],[],null,["# Dataproc principals\n\nWhen you use the Dataproc service to create clusters and run\njobs on your clusters, the service sets up the necessary\n[Dataproc roles and permissions](/dataproc/docs/concepts/iam/iam)\nin your project to access and use the Google Cloud resources it needs to accomplish\nthese tasks. However, if you do cross-project work, for example to access data\nin another project, you will need to set up the necessary roles and permissions\nto access cross-project resources.\n\nTo help you do cross-project work successfully, this document lists the different\nprincipals that use the Dataproc service\nand the roles that contain the necessary permissions for those principals to access\nand use Google Cloud resources.\n\nThere are three principals (identities) that access and use the Dataproc:\n\n1. User Identity\n2. Control Plane Identity\n3. Data Plane Identity\n\nDataproc API User (User identity)\n---------------------------------\n\nExample: *username@example.com*\n\nThis is the user that calls the Dataproc service to create\nclusters, submit jobs, and make other requests to the service. The user\nis usually an individual, but it can also be a\n[service account](/iam/docs/understanding-service-accounts)\nif Dataproc is invoked through an API client or from another\nGoogle Cloud service such as Compute Engine, Cloud Run functions, or Cloud Composer.\n\n**Related roles**\n\n- [Dataproc roles](/dataproc/docs/concepts/iam/iam#roles), [Project roles](/dataproc/docs/concepts/iam/iam#project_roles)\n\n**Notes**\n\n- Dataproc API-submitted jobs run as `root` on Linux.\n- Dataproc clusters inherit project-wide Compute Engine\n SSH metadata unless explicitly blocked by setting\n `--metadata=block-project-ssh-keys=true` when you create your cluster\n (see\n [Cluster metadata](/dataproc/docs/concepts/configuring-clusters/metadata)).\n\n- HDFS user directories are created for each project-level SSH user. These\n HDFS directories are created at cluster deployment time, and a new (post-deployment)\n SSH user is not given an HDFS directory on existing clusters.\n\nDataproc Service Agent (Control Plane identity)\n-----------------------------------------------\n\nExample: *service-\u003cvar translate=\"no\"\u003eproject-number\u003c/var\u003e@dataproc-accounts.iam.gserviceaccount.com*\n\nThe\n[Dataproc Service Agent service account](/dataproc/docs/concepts/configuring-clusters/service-accounts#service_agent_account)\nis used to perform a broad set of system operations on resources located\nin the project where a Dataproc cluster is created, including:\n\n- Creation of Compute Engine resources, including VM instances, instance groups, and instance templates\n- `get` and `list` operations to confirm the configuration of resources such as images, firewalls, Dataproc initialization actions, and Cloud Storage buckets\n- Auto-creation of the Dataproc [staging and temp buckets](/dataproc/docs/concepts/configuring-clusters/staging-bucket) if the staging or temp bucket is not specified by the user\n- Writing cluster configuration metadata to the staging bucket\n- Accessing [VPC networks in a host project](/dataproc/docs/concepts/configuring-clusters/network#create_a_cluster_that_uses_a_network_in_another_project)\n\n**Related roles**\n\n- [Dataproc Service Agent](/iam/docs/understanding-roles#dataproc.serviceAgent)\n\nDataproc VM service account (Data Plane identity)\n-------------------------------------------------\n\nExample: *\u003cvar translate=\"no\"\u003eproject-number\u003c/var\u003e-compute@developer.gserviceaccount.com*\n\nYour application code runs as the\n[VM service account](/dataproc/docs/concepts/configuring-clusters/service-accounts#VM_service_account)\non Dataproc VMs. User jobs are granted the roles (with their\nassociated permissions) of this service account.\n\nThe VM service account does the following:\n\n- Communicates with the [Dataproc control plane](#service-agent).\n- Reads and writes data from and to the [Dataproc staging and temp buckets](/dataproc/docs/concepts/configuring-clusters/staging-bucket).\n- As needed by your Dataproc jobs, reads and writes data from and to Cloud Storage, BigQuery, Cloud Logging, and other Google Cloud resources.\n\n**Related roles**\n\n- [Dataproc Worker](/dataproc/docs/concepts/iam/iam#roles)\n- [Cloud Storage roles](/storage/docs/access-control/iam-roles#standard-roles)\n- [BigQuery roles](/bigquery/docs/access-control#bigquery)\n\n| **Note:** For interactive workloads, users can opt to use their user identity to access Cloud Storage objects in buckets owned by the same project that contains the cluster (see [Dataproc Personal Cluster Authentication](/dataproc/docs/concepts/iam/personal-auth)).\n\nWhat's next\n-----------\n\n- Learn more about [Dataproc roles and permissions](/dataproc/docs/concepts/iam/iam).\n- Learn more about [Dataproc service accounts](/dataproc/docs/concepts/configuring-clusters/service-accounts).\n- See [BigQuery Access Control](/bigquery/docs/access-control).\n- See [Cloud Storage Access Control options](/storage/docs/access-control)."]]