Mantenha tudo organizado com as coleções
Salve e categorize o conteúdo com base nas suas preferências.
Esta página descreve a sequência de etapas envolvidas no envio,
execução e conclusão de um job do Dataproc. Aqui você também encontra redução e depuração do job.
Se o job for adquirido, o JobStatus.State será marcado como RUNNING.
Se não for adquirido por causa de uma falha no agente, na rede do Google Compute Engine ou por outro motivo, o job será marcado como ERROR.
Quando o job é adquirido pelo agente, este verifica se há
recursos suficientes disponíveis no nó mestre do cluster do Dataproc
para iniciar o driver.
Se não houver recursos suficientes disponíveis, o job ficará atrasado (limitado).
JobStatus.Substate mostra o job como QUEUED, e Job.JobStatus.details fornece informações sobre a causa do atraso.
Se houver recursos suficientes disponíveis, o agente dataproc iniciará o processo do driver do job.
Neste estágio, normalmente há um ou mais aplicativos em execução no Apache Hadoop YARN. No entanto, os aplicativos Yarn talvez não sejam iniciados até o driver terminar de verificar os diretórios do Cloud Storage ou realizar outras tarefas do job de inicialização.
O agente dataproc envia periodicamente atualizações para o Dataproc sobre o progresso
do job, as métricas do cluster e os aplicativos Yarn associados ao job.
Consulte Monitoramento e depuração de jobs.
Aplicativos Yarn concluídos.
O job continua sendo relatado como RUNNING enquanto o driver realiza todas as tarefas de conclusão do job, como as coleções de materialização.
Uma falha não processada ou não detectada no thread principal pode deixar o driver em um estado zumbi (marcado como RUNNING sem informações quanto à causa da falha).
O driver sai.
O agente dataproc relata a conclusão para o Dataproc.
O Dataproc relata o job como DONE.
Simultaneidade do job
É possível configurar o número máximo de jobs simultâneos do Dataproc com a propriedade dataproc:dataproc.scheduler.max-concurrent-jobs ao criar um cluster. Se esse valor de propriedade não estiver definido,
o limite superior de jobs simultâneos será calculado como
max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5). masterMemoryMb é determinado pelo tipo de máquina da VM mestre. masterMemoryMbPerJob é
1024 por padrão, mas pode ser configurado na criação do cluster com a
propriedade de cluster
dataproc:dataproc.scheduler.driver-size-mb.
[[["Fácil de entender","easyToUnderstand","thumb-up"],["Meu problema foi resolvido","solvedMyProblem","thumb-up"],["Outro","otherUp","thumb-up"]],[["Difícil de entender","hardToUnderstand","thumb-down"],["Informações incorretas ou exemplo de código","incorrectInformationOrSampleCode","thumb-down"],["Não contém as informações/amostras de que eu preciso","missingTheInformationSamplesINeed","thumb-down"],["Problema na tradução","translationIssue","thumb-down"],["Outro","otherDown","thumb-down"]],["Última atualização 2025-08-22 UTC."],[[["\u003cp\u003eThis document outlines the process of submitting, executing, and completing a Dataproc job, including the interactions between the Dataproc service, agent, and driver.\u003c/p\u003e\n"],["\u003cp\u003eThe job status transitions through states like PENDING, RUNNING, and DONE, with potential for ERROR if issues arise and QUEUED if resources are insufficient for the driver.\u003c/p\u003e\n"],["\u003cp\u003eThe Dataproc agent manages the job lifecycle, verifies resource availability, starts the driver process, and communicates updates on the job's progress to the Dataproc service.\u003c/p\u003e\n"],["\u003cp\u003eJob concurrency is controlled by a configurable property, and the maximum number of concurrent jobs is determined by available master node memory if the property is not specified.\u003c/p\u003e\n"],["\u003cp\u003eThe document includes details on job throttling due to resource constraints, and information on job monitoring and debugging.\u003c/p\u003e\n"]]],[],null,["# Life of a Dataproc job\n\nThis page delineates the sequence of steps involved with the submission,\nexecution, and completion of a Dataproc job. It also discusses job\nthrottling and debugging.\n| **Key Terms** . Below are key terms and associated actions related to the Dataproc jobs flow.\n|\n| - **Dataproc service** -- receives job requests from the user and submits the request to the `dataproc agent`\n| - **dataproc agent** -- runs on the VM, receives job requests from the Dataproc service, and spawns `driver`\n| - **driver** -- runs customer-supplied code, such as Hadoop jar, spark-submit, beeline, and pig applications\n\nDataproc jobs flow\n------------------\n\n1. User submits job to Dataproc.\n - [JobStatus.State](/dataproc/docs/reference/rest/v1/projects.regions.jobs#State) is marked as `PENDING`.\n2. Job waits to be acquired by the `dataproc` agent.\n - If the job is acquired, [JobStatus.State](/dataproc/docs/reference/rest/v1/projects.regions.jobs#State) is marked as `RUNNING`.\n - If the job is not acquired due to agent failure, Compute Engine network failure, or other cause, the job is marked `ERROR`.\n3. Once a job is acquired by the agent, the agent verifies that there are sufficient resources available on the Dataproc cluster's master node to start the driver.\n - If sufficient resources are not available, the job is delayed (throttled). [JobStatus.Substate](/dataproc/docs/reference/rest/v1/projects.regions.jobs#substate) shows the job as `QUEUED`, and [Job.JobStatus.details](/dataproc/docs/reference/rest/v1/projects.regions.jobs#jobstatus) provides information on the cause of the delay.\n4. If sufficient resources are available, the `dataproc` agent starts the job driver process.\n - At this stage, typically there are one or more applications running in [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). However, Yarn applications may not start until the driver finishes scanning Cloud Storage directories or performing other start-up job tasks.\n5. The `dataproc` agent periodically sends updates to Dataproc on job progress, cluster metrics, and Yarn applications associated with the job (see [Job monitoring and debugging](/dataproc/docs/concepts/jobs/troubleshoot-jobs#job_monitoring_and_debugging)).\n6. Yarn application(s) complete.\n - Job continues to be reported as `RUNNING` while driver performs any job completion tasks, such as materializing collections.\n - An unhandled or uncaught failure in the Main thread can leave the driver in a zombie state (marked as `RUNNING` without information as to the cause of the failure).\n7. Driver exits. `dataproc` agent reports completion to Dataproc.\n - Dataproc reports job as `DONE`.\n\nJob concurrency\n---------------\n\nYou can configure the maximum number of concurrent Dataproc jobs\nwith the\n[dataproc:dataproc.scheduler.max-concurrent-jobs](/dataproc/docs/concepts/configuring-clusters/cluster-properties#service_properties)\ncluster property when you create a cluster. If this property value is not set,\nthe upper limit on concurrent jobs is calculated as\n`max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5)`. `masterMemoryMb`\nis determined by the master VM's machine type. `masterMemoryMbPerJob` is\n`1024` by default, but is configurable at cluster creation with the\n[dataproc:dataproc.scheduler.driver-size-mb](/dataproc/docs/concepts/configuring-clusters/cluster-properties#service_properties)\ncluster property.\n| For information on troubleshooting job delays due to excessive job concurrency and other causes, see [Troubleshoot job delays](/dataproc/docs/concepts/jobs/troubleshoot-job-delays).\n\nWhats next\n----------\n\n- See [Troubleshoot jobs](/dataproc/docs/concepts/jobs/troubleshoot-jobs)"]]