Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Halaman ini menjelaskan urutan langkah-langkah yang terkait dengan pengiriman, eksekusi, dan penyelesaian tugas Dataproc. Bagian ini juga membahas pembatasan dan proses debug tugas.
Tugas menunggu untuk diperoleh oleh agen dataproc.
Jika tugas diperoleh,
JobStatus.State
ditandai sebagai RUNNING.
Jika tugas tidak diperoleh karena kegagalan agen, kegagalan jaringan Compute Engine, atau penyebab lainnya, tugas akan ditandai ERROR.
Setelah tugas diambil oleh agen, agen akan memverifikasi bahwa ada
resource yang cukup tersedia di node master cluster Dataproc
untuk memulai driver.
Jika resource yang memadai tidak tersedia, tugas akan tertunda (dibatasi).
JobStatus.Substate
menampilkan tugas sebagai QUEUED, dan
Job.JobStatus.details
memberikan informasi tentang penyebab keterlambatan.
Jika resource yang memadai tersedia, agen dataproc akan memulai
proses driver tugas.
Pada tahap ini, biasanya ada satu atau beberapa aplikasi yang berjalan di
Apache Hadoop YARN. Namun, aplikasi Yarn mungkin tidak dimulai hingga driver selesai memindai direktori Cloud Storage atau melakukan tugas start-up lainnya.
Agen dataproc secara berkala mengirimkan update ke Dataproc tentang progres tugas, metrik cluster, dan aplikasi Yarn yang terkait dengan tugas (lihat Pemantauan dan proses debug tugas).
Penerapan Yarn selesai.
Tugas terus dilaporkan sebagai RUNNING saat driver melakukan tugas penyelesaian pekerjaan apa pun, seperti mewujudkan kumpulan.
Kegagalan yang tidak tertangani atau tidak tertangkap di thread Utama dapat membuat
driver dalam status zombie (ditandai sebagai RUNNING tanpa
informasi mengenai penyebab kegagalan).
Pengemudi keluar.
Agen dataproc melaporkan penyelesaian ke Dataproc.
Dataproc melaporkan tugas sebagai DONE.
Konkurensi tugas
Anda dapat mengonfigurasi jumlah maksimum tugas Dataproc serentak dengan properti cluster
dataproc:dataproc.scheduler.max-concurrent-jobs
saat Anda membuat cluster. Jika nilai properti ini tidak ditetapkan,
batas atas untuk tugas serentak dihitung sebagai
max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5). masterMemoryMb
ditentukan oleh jenis mesin VM master. masterMemoryMbPerJob adalah
1024 secara default, tetapi dapat dikonfigurasi saat pembuatan cluster dengan properti cluster
dataproc:dataproc.scheduler.driver-size-mb.
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-09-04 UTC."],[[["\u003cp\u003eThis document outlines the process of submitting, executing, and completing a Dataproc job, including the interactions between the Dataproc service, agent, and driver.\u003c/p\u003e\n"],["\u003cp\u003eThe job status transitions through states like PENDING, RUNNING, and DONE, with potential for ERROR if issues arise and QUEUED if resources are insufficient for the driver.\u003c/p\u003e\n"],["\u003cp\u003eThe Dataproc agent manages the job lifecycle, verifies resource availability, starts the driver process, and communicates updates on the job's progress to the Dataproc service.\u003c/p\u003e\n"],["\u003cp\u003eJob concurrency is controlled by a configurable property, and the maximum number of concurrent jobs is determined by available master node memory if the property is not specified.\u003c/p\u003e\n"],["\u003cp\u003eThe document includes details on job throttling due to resource constraints, and information on job monitoring and debugging.\u003c/p\u003e\n"]]],[],null,["This page delineates the sequence of steps involved with the submission,\nexecution, and completion of a Dataproc job. It also discusses job\nthrottling and debugging.\n| **Key Terms** . Below are key terms and associated actions related to the Dataproc jobs flow.\n|\n| - **Dataproc service** -- receives job requests from the user and submits the request to the `dataproc agent`\n| - **dataproc agent** -- runs on the VM, receives job requests from the Dataproc service, and spawns `driver`\n| - **driver** -- runs customer-supplied code, such as Hadoop jar, spark-submit, beeline, and pig applications\n\nDataproc jobs flow\n\n1. User submits job to Dataproc.\n - [JobStatus.State](/dataproc/docs/reference/rest/v1/projects.regions.jobs#State) is marked as `PENDING`.\n2. Job waits to be acquired by the `dataproc` agent.\n - If the job is acquired, [JobStatus.State](/dataproc/docs/reference/rest/v1/projects.regions.jobs#State) is marked as `RUNNING`.\n - If the job is not acquired due to agent failure, Compute Engine network failure, or other cause, the job is marked `ERROR`.\n3. Once a job is acquired by the agent, the agent verifies that there are sufficient resources available on the Dataproc cluster's master node to start the driver.\n - If sufficient resources are not available, the job is delayed (throttled). [JobStatus.Substate](/dataproc/docs/reference/rest/v1/projects.regions.jobs#substate) shows the job as `QUEUED`, and [Job.JobStatus.details](/dataproc/docs/reference/rest/v1/projects.regions.jobs#jobstatus) provides information on the cause of the delay.\n4. If sufficient resources are available, the `dataproc` agent starts the job driver process.\n - At this stage, typically there are one or more applications running in [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). However, Yarn applications may not start until the driver finishes scanning Cloud Storage directories or performing other start-up job tasks.\n5. The `dataproc` agent periodically sends updates to Dataproc on job progress, cluster metrics, and Yarn applications associated with the job (see [Job monitoring and debugging](/dataproc/docs/concepts/jobs/troubleshoot-jobs#job_monitoring_and_debugging)).\n6. Yarn application(s) complete.\n - Job continues to be reported as `RUNNING` while driver performs any job completion tasks, such as materializing collections.\n - An unhandled or uncaught failure in the Main thread can leave the driver in a zombie state (marked as `RUNNING` without information as to the cause of the failure).\n7. Driver exits. `dataproc` agent reports completion to Dataproc.\n - Dataproc reports job as `DONE`.\n\nJob concurrency\n\nYou can configure the maximum number of concurrent Dataproc jobs\nwith the\n[dataproc:dataproc.scheduler.max-concurrent-jobs](/dataproc/docs/concepts/configuring-clusters/cluster-properties#service_properties)\ncluster property when you create a cluster. If this property value is not set,\nthe upper limit on concurrent jobs is calculated as\n`max((masterMemoryMb - 3584) / masterMemoryMbPerJob, 5)`. `masterMemoryMb`\nis determined by the master VM's machine type. `masterMemoryMbPerJob` is\n`1024` by default, but is configurable at cluster creation with the\n[dataproc:dataproc.scheduler.driver-size-mb](/dataproc/docs/concepts/configuring-clusters/cluster-properties#service_properties)\ncluster property.\n| For information on troubleshooting job delays due to excessive job concurrency and other causes, see [Troubleshoot job delays](/dataproc/docs/concepts/jobs/troubleshoot-job-delays).\n\nWhats next\n\n- See [Troubleshoot jobs](/dataproc/docs/concepts/jobs/troubleshoot-jobs)"]]