Stay organized with collections
Save and categorize content based on your preferences.
The DataprocFileOutputCommitter feature is an enhanced
version of the open source FileOutputCommitter. It
enables concurrent writes by Apache Spark jobs to an output location.
Limitations
The DataprocFileOutputCommitter feature supports Spark jobs run on
Dataproc Compute Engine clusters created with
the following image versions:
Set spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory and spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false
as a job property when you submit a Spark job
to the cluster.
Google Cloud CLI example:
gcloud dataproc jobs submit spark \
--properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
--region=REGION \
other args ...
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-03-21 UTC."],[[["The DataprocFileOutputCommitter is an enhanced version of FileOutputCommitter, designed to enable concurrent writes by Apache Spark jobs to an output location."],["This feature is available for Dataproc Compute Engine clusters running image versions 2.1.10 and higher, or 2.0.62 and higher."],["To utilize DataprocFileOutputCommitter, set `spark.hadoop.mapreduce.outputcommitter.factory.class` to `org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory` and `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs` to `false` when submitting a Spark job."],["When using the Dataproc file output committer, it is required that `spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs` is set to false in order to prevent conflicts with the created success marker files."]]],[]]