The Cloud Shell walkthrough in this
tutorial provides authentication by using your Google Cloud project credentials.
When you run code locally, the recommended practice is to use
service account credentials
to authenticate your code.
Create a Dataproc cluster
The following values are set to create the cluster:
The project in which the cluster will be created
The region where the cluster will be created
The name of the cluster
The cluster config, which specifies one master and two primary
workers
Default config settings are used for the remaining cluster settings.
You can override default cluster config settings. For example, you
can add secondary VMs (default = 0) or specify a non-default
VPC network for the cluster. For more information, see
CreateCluster.
defquickstart(project_id,region,cluster_name,gcs_bucket,pyspark_file):# Create the cluster client.cluster_client=dataproc_v1.ClusterControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})# Create the cluster config.cluster={"project_id":project_id,"cluster_name":cluster_name,"config":{"master_config":{"num_instances":1,"machine_type_uri":"n1-standard-2"},"worker_config":{"num_instances":2,"machine_type_uri":"n1-standard-2"},},}# Create the cluster.operation=cluster_client.create_cluster(request={"project_id":project_id,"region":region,"cluster":cluster})result=operation.result()print(f"Cluster created successfully: {result.cluster_name}")
Submit a job
The following values are set to submit the job:
The project in which the cluster will be created
The region where the cluster will be created
The job config, which specifies the cluster name and the Cloud Storage
filepath (URI) of the PySpark job
# Create the job client.job_client=dataproc_v1.JobControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})# Create the job config.job={"placement":{"cluster_name":cluster_name},"pyspark_job":{"main_python_file_uri":f"gs://{gcs_bucket}/{spark_filename}"},}operation=job_client.submit_job_as_operation(request={"project_id":project_id,"region":region,"job":job})response=operation.result()# Dataproc job output is saved to the Cloud Storage bucket# allocated to the job. Use regex to obtain the bucket and blob info.matches=re.match("gs://(.*?)/(.*)",response.driver_output_resource_uri)output=(storage.Client().get_bucket(matches.group(1)).blob(f"{matches.group(2)}.000000000").download_as_bytes().decode("utf-8"))print(f"Job finished successfully: {output}\r\n")
Delete the cluster
The following values are set to delete the cluster:
# Delete the cluster once the job has terminated.operation=cluster_client.delete_cluster(request={"project_id":project_id,"region":region,"cluster_name":cluster_name,})operation.result()print(f"Cluster {cluster_name} successfully deleted.")
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eThis tutorial guides users through a Cloud Shell walkthrough to interact with Dataproc gRPC APIs using Google Cloud client libraries for Python.\u003c/p\u003e\n"],["\u003cp\u003eThe walkthrough code demonstrates how to programmatically create a Dataproc cluster, submit a job to the cluster, and then delete the cluster.\u003c/p\u003e\n"],["\u003cp\u003eThe tutorial details the required values to set when creating a cluster, such as project ID, region, cluster name, and cluster configuration, allowing for default setting overides.\u003c/p\u003e\n"],["\u003cp\u003eThe tutorial also describes the necessary values to submit a job, including project ID, region, cluster name, and the Cloud Storage filepath of the PySpark job.\u003c/p\u003e\n"],["\u003cp\u003eUsers can utilize an inline workflow to perform all actions with one API request, rather than making separate requests, as shown in the provided example.\u003c/p\u003e\n"]]],[],null,["# Use the Cloud Client Libraries for Python\n\nThis tutorial includes a [Cloud Shell walkthrough](/shell/docs/tutorials) that uses the\n[Google Cloud client libraries for Python](/python/docs/reference/dataproc/latest)\nto programmatically call\n[Dataproc gRPC APIs](/dataproc/docs/reference/rpc) to create\na cluster and submit a job to the cluster.\n\nThe following sections explain the operation of the walkthrough code contained\nin the GitHub\n[GoogleCloudPlatform/python-dataproc](https://github.com/googleapis/python-dataproc/tree/master/samples/snippets) repository.\n| The walkthrough tutorial code makes separate API requests to create a cluster, submit a job to the cluster, then delete the cluster. You can use an [inline workflow](/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.WorkflowTemplateService.InstantiateInlineWorkflowTemplate) to accomplish these tasks with one API request (see [instantiate_inline_workflow_template.py](https://github.com/googleapis/python-dataproc/blob/master/samples/snippets/instantiate_inline_workflow_template.py) for an example).\n\nRun the Cloud Shell walkthrough\n-------------------------------\n\nClick **Open in Cloud Shell** to run the walkthrough.\n\n[Open in Cloud Shell](https://ssh.cloud.google.com/cloudshell/open?cloudshell_git_repo=https://github.com/googleapis/python-dataproc&cloudshell_working_dir=samples/snippets&tutorial=python-api-walkthrough.md&cloudshell_open_in_editor=submit_job_to_cluster.py)\n\nUnderstand the code\n-------------------\n\n### Application Default Credentials\n\nThe [Cloud Shell walkthrough](#run_the_cloud_shell_walkthrough) in this\ntutorial provides authentication by using your Google Cloud project credentials.\nWhen you run code locally, the recommended practice is to use\n[service account credentials](/docs/authentication/production#obtaining_and_providing_service_account_credentials_manually)\nto authenticate your code.\n\n### Create a Dataproc cluster\n\nThe following values are set to create the cluster:\n\n- The project in which the cluster will be created\n- The region where the cluster will be created\n- The name of the cluster\n- The cluster config, which specifies one master and two primary workers\n\nDefault config settings are used for the remaining cluster settings.\nYou can override default cluster config settings. For example, you\ncan add secondary VMs (default = 0) or specify a non-default\nVPC network for the cluster. For more information, see\n[CreateCluster](/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.ClusterController.CreateCluster). \n\n def quickstart(project_id, region, cluster_name, gcs_bucket, pyspark_file):\n # Create the cluster client.\n cluster_client = dataproc_v1.ClusterControllerClient(\n client_options={\"api_endpoint\": f\"{region}-dataproc.googleapis.com:443\"}\n )\n\n # Create the cluster config.\n cluster = {\n \"project_id\": project_id,\n \"cluster_name\": cluster_name,\n \"config\": {\n \"master_config\": {\"num_instances\": 1, \"machine_type_uri\": \"n1-standard-2\"},\n \"worker_config\": {\"num_instances\": 2, \"machine_type_uri\": \"n1-standard-2\"},\n },\n }\n\n # Create the cluster.\n operation = cluster_client.create_cluster(\n request={\"project_id\": project_id, \"region\": region, \"cluster\": cluster}\n )\n result = operation.result()\n\n print(f\"Cluster created successfully: {result.cluster_name}\")\n\n\u003cbr /\u003e\n\n### Submit a job\n\nThe following values are set to submit the job:\n\n- The project in which the cluster will be created\n- The region where the cluster will be created\n- The job config, which specifies the cluster name and the Cloud Storage filepath (URI) of the PySpark job\n\nSee [SubmitJob](/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.JobController.SubmitJob)\nfor more information. \n\n # Create the job client.\n job_client = dataproc_v1.JobControllerClient(\n client_options={\"api_endpoint\": f\"{region}-dataproc.googleapis.com:443\"}\n )\n\n # Create the job config.\n job = {\n \"placement\": {\"cluster_name\": cluster_name},\n \"pyspark_job\": {\"main_python_file_uri\": f\"gs://{gcs_bucket}/{spark_filename}\"},\n }\n\n operation = job_client.submit_job_as_operation(\n request={\"project_id\": project_id, \"region\": region, \"job\": job}\n )\n response = operation.result()\n\n # Dataproc job output is saved to the Cloud Storage bucket\n # allocated to the job. Use regex to obtain the bucket and blob info.\n matches = re.match(\"gs://(.*?)/(.*)\", response.driver_output_resource_uri)\n\n output = (\n storage.Client()\n .get_bucket(matches.group(1))\n .blob(f\"{matches.group(2)}.000000000\")\n .download_as_bytes()\n .decode(\"utf-8\")\n )\n\n print(f\"Job finished successfully: {output}\\r\\n\")\n\n\u003cbr /\u003e\n\n### Delete the cluster\n\nThe following values are set to delete the cluster:\n\n- The project in which the cluster will be created\n- The region where the cluster will be created\n- The name of the cluster\n\nFor more information, see the [DeleteCluster](/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.JobController.DeleteCluster). \n\n # Delete the cluster once the job has terminated.\n operation = cluster_client.delete_cluster(\n request={\n \"project_id\": project_id,\n \"region\": region,\n \"cluster_name\": cluster_name,\n }\n )\n operation.result()\n\n print(f\"Cluster {cluster_name} successfully deleted.\")\n\n\u003cbr /\u003e"]]