Stay organized with collections
Save and categorize content based on your preferences.
This page delineates the sequence of steps involved with the submission,
execution, and completion of a Dataproc job. It also discusses job
throttling and debugging.
If the job is acquired,
JobStatus.State
is marked as RUNNING.
If the job is not acquired due to agent failure, Compute Engine
network failure, or other cause, the job is marked ERROR.
Once a job is acquired by the agent, the agent verifies that there are
sufficient resources available on the Dataproc cluster's master node
to start the driver.
If sufficient resources are not available, the job is delayed (throttled).
JobStatus.Substate
shows the job as QUEUED, and
Job.JobStatus.details
provides information on the cause of the delay.
If sufficient resources are available, the dataproc agent starts the
job driver process.
At this stage, typically there are one or more applications running in
Apache Hadoop YARN. However, Yarn applications may not start until the driver
finishes scanning Cloud Storage directories
or performing other start-up job tasks.
The dataproc agent periodically sends updates to Dataproc on job
progress, cluster metrics, and Yarn applications associated with the job
(see Job monitoring and debugging).
Yarn application(s) complete.
Job continues to be reported as RUNNING while driver performs any
job completion tasks, such as materializing collections.
An unhandled or uncaught failure in the Main thread can leave the
driver in a zombie state (marked as RUNNING without
information as to the cause of the failure).
Driver exits.
dataproc agent reports completion to Dataproc.
Dataproc reports job as DONE.
Job concurrency
You can configure the maximum number of concurrent Dataproc jobs
with the
dataproc:dataproc.scheduler.max-concurrent-jobs
cluster property when you create a cluster. If this property value is not set,
the upper limit on concurrent jobs is calculated as
max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5). masterMemoryMb
is determined by the master VM's machine type. masterMemoryMbPerJob is
1024 by default, but is configurable at cluster creation with the
dataproc:dataproc.scheduler.driver-size-mb
cluster property.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-01 UTC."],[[["\u003cp\u003eThis document outlines the process of submitting, executing, and completing a Dataproc job, including the interactions between the Dataproc service, agent, and driver.\u003c/p\u003e\n"],["\u003cp\u003eThe job status transitions through states like PENDING, RUNNING, and DONE, with potential for ERROR if issues arise and QUEUED if resources are insufficient for the driver.\u003c/p\u003e\n"],["\u003cp\u003eThe Dataproc agent manages the job lifecycle, verifies resource availability, starts the driver process, and communicates updates on the job's progress to the Dataproc service.\u003c/p\u003e\n"],["\u003cp\u003eJob concurrency is controlled by a configurable property, and the maximum number of concurrent jobs is determined by available master node memory if the property is not specified.\u003c/p\u003e\n"],["\u003cp\u003eThe document includes details on job throttling due to resource constraints, and information on job monitoring and debugging.\u003c/p\u003e\n"]]],[],null,["# Life of a Dataproc job\n\nThis page delineates the sequence of steps involved with the submission,\nexecution, and completion of a Dataproc job. It also discusses job\nthrottling and debugging.\n| **Key Terms** . Below are key terms and associated actions related to the Dataproc jobs flow.\n|\n| - **Dataproc service** -- receives job requests from the user and submits the request to the `dataproc agent`\n| - **dataproc agent** -- runs on the VM, receives job requests from the Dataproc service, and spawns `driver`\n| - **driver** -- runs customer-supplied code, such as Hadoop jar, spark-submit, beeline, and pig applications\n\nDataproc jobs flow\n------------------\n\n1. User submits job to Dataproc.\n - [JobStatus.State](/dataproc/docs/reference/rest/v1/projects.regions.jobs#State) is marked as `PENDING`.\n2. Job waits to be acquired by the `dataproc` agent.\n - If the job is acquired, [JobStatus.State](/dataproc/docs/reference/rest/v1/projects.regions.jobs#State) is marked as `RUNNING`.\n - If the job is not acquired due to agent failure, Compute Engine network failure, or other cause, the job is marked `ERROR`.\n3. Once a job is acquired by the agent, the agent verifies that there are sufficient resources available on the Dataproc cluster's master node to start the driver.\n - If sufficient resources are not available, the job is delayed (throttled). [JobStatus.Substate](/dataproc/docs/reference/rest/v1/projects.regions.jobs#substate) shows the job as `QUEUED`, and [Job.JobStatus.details](/dataproc/docs/reference/rest/v1/projects.regions.jobs#jobstatus) provides information on the cause of the delay.\n4. If sufficient resources are available, the `dataproc` agent starts the job driver process.\n - At this stage, typically there are one or more applications running in [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). However, Yarn applications may not start until the driver finishes scanning Cloud Storage directories or performing other start-up job tasks.\n5. The `dataproc` agent periodically sends updates to Dataproc on job progress, cluster metrics, and Yarn applications associated with the job (see [Job monitoring and debugging](/dataproc/docs/concepts/jobs/troubleshoot-jobs#job_monitoring_and_debugging)).\n6. Yarn application(s) complete.\n - Job continues to be reported as `RUNNING` while driver performs any job completion tasks, such as materializing collections.\n - An unhandled or uncaught failure in the Main thread can leave the driver in a zombie state (marked as `RUNNING` without information as to the cause of the failure).\n7. Driver exits. `dataproc` agent reports completion to Dataproc.\n - Dataproc reports job as `DONE`.\n\nJob concurrency\n---------------\n\nYou can configure the maximum number of concurrent Dataproc jobs\nwith the\n[dataproc:dataproc.scheduler.max-concurrent-jobs](/dataproc/docs/concepts/configuring-clusters/cluster-properties#service_properties)\ncluster property when you create a cluster. If this property value is not set,\nthe upper limit on concurrent jobs is calculated as\n`max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5)`. `masterMemoryMb`\nis determined by the master VM's machine type. `masterMemoryMbPerJob` is\n`1024` by default, but is configurable at cluster creation with the\n[dataproc:dataproc.scheduler.driver-size-mb](/dataproc/docs/concepts/configuring-clusters/cluster-properties#service_properties)\ncluster property.\n| For information on troubleshooting job delays due to excessive job concurrency and other causes, see [Troubleshoot job delays](/dataproc/docs/concepts/jobs/troubleshoot-job-delays).\n\nWhats next\n----------\n\n- See [Troubleshoot jobs](/dataproc/docs/concepts/jobs/troubleshoot-jobs)"]]