This page delineates the sequence of steps involved with the submission, execution, and completion of a Dataproc job. It also discusses job throttling and debugging.
Dataproc jobs flow
- User submits job to Dataproc.
- JobStatus.State
is marked as
PENDING.
- JobStatus.State
is marked as
- Job waits to be acquired by the
dataprocagent.- If the job is acquired,
JobStatus.State
is marked as
RUNNING. - If the job is not acquired due to agent failure, Compute Engine
network failure, or other cause, the job is marked
ERROR.
- If the job is acquired,
JobStatus.State
is marked as
- Once a job is acquired by the agent, the agent verifies that there are
sufficient resources available on the Dataproc cluster's master node
to start the driver.
- If sufficient resources are not available, the job is delayed (throttled).
JobStatus.Substate
shows the job as
QUEUED, and Job.JobStatus.details provides information on the cause of the delay.
- If sufficient resources are not available, the job is delayed (throttled).
JobStatus.Substate
shows the job as
- If sufficient resources are available, the
dataprocagent starts the job driver process.- At this stage, typically there are one or more applications running in Apache Hadoop YARN. However, Yarn applications may not start until the driver finishes scanning Cloud Storage directories or performing other start-up job tasks.
- The
dataprocagent periodically sends updates to Dataproc on job progress, cluster metrics, and Yarn applications associated with the job (see Job monitoring and debugging). - Yarn application(s) complete.
- Job continues to be reported as
RUNNINGwhile driver performs any job completion tasks, such as materializing collections. - An unhandled or uncaught failure in the Main thread can leave the
driver in a zombie state (marked as
RUNNINGwithout information as to the cause of the failure).
- Job continues to be reported as
- Driver exits.
dataprocagent reports completion to Dataproc.- Dataproc reports job as
DONE.
- Dataproc reports job as
Job concurrency
You can configure the maximum number of concurrent Dataproc jobs
with the
dataproc:dataproc.scheduler.max-concurrent-jobs
cluster property when you create a cluster. If this property value is not set,
the upper limit on concurrent jobs is calculated as
max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5). masterMemoryMb
is determined by the master VM's machine type. masterMemoryMbPerJob is
1024 by default, but is configurable at cluster creation with the
dataproc:dataproc.scheduler.driver-size-mb
cluster property.