[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-02。"],[[["\u003cp\u003eThis document outlines Dataproc best practices for running reliable and efficient data processing jobs in production environments.\u003c/p\u003e\n"],["\u003cp\u003eFor production, specify a \u003ccode\u003emajor.minor\u003c/code\u003e Dataproc image version when creating a cluster to ensure consistency, and consider testing new minor versions with preview images beforehand.\u003c/p\u003e\n"],["\u003cp\u003eUse custom images to incorporate dependencies, security measures, or virus protection software, and regularly update these images to the latest sub-minor version.\u003c/p\u003e\n"],["\u003cp\u003eSubmit jobs to the Dataproc service for benefits like simplified networking, easy IAM management, and streamlined job status tracking, while bundling job-specific dependencies.\u003c/p\u003e\n"],["\u003cp\u003eMonitor Dataproc release notes for updates and use the cluster's staging bucket to diagnose cluster and job-related issues.\u003c/p\u003e\n"]]],[],null,["This document discusses Dataproc best practices that can help you\nrun reliable, efficient, and insightful data processing jobs on\nDataproc clusters in production environments.\n\nSpecify cluster image versions\n\nDataproc uses [image versions](/dataproc/docs/concepts/versioning/dataproc-versions)\nto bundle operating system, big data [components](/dataproc/docs/concepts/components/overview),\nand Google Cloud connectors into a package that is deployed on a cluster.\nIf you don't specify an image version when creating a cluster, Dataproc\ndefaults to the most recent stable image version.\n\nFor production environments, associate your cluster with a specific\n`major.minor` Dataproc image version, as\nshown in the following gcloud CLI command. \n\n```\ngcloud dataproc clusters create CLUSTER_NAME \\\n --region=region \\\n --image-version=2.0\n```\n\nDataproc resolves the `major.minor` version to the latest sub-minor version version\n(`2.0` is resolved to `2.0.x`). Note: if you need to rely on a specific sub-minor version for your cluster,\nyou can specify it: for example, `--image-version=2.0.x`. See\n[How versioning works](/dataproc/docs/concepts/versioning/overview#how_versioning_works) for\nmore information.\n| Each supported minor image version page, such as [2.0.x release versions](/dataproc/docs/concepts/versioning/dataproc-release-2.0), lists the component versions available with the current and previous four sub-minor image releases.\n\nDataproc preview image versions\n\nNew minor versions of Dataproc\nimages are available in a `preview` version prior to release\nin the standard minor image version track. Use a preview image\nto test and validate your jobs against a new minor image version\nprior to adopting the standard minor image version in production.\nSee [Dataproc versioning](/dataproc/docs/concepts/versioning/overview)\nfor more information.\n\nUse custom images when necessary\n\nIf you have dependencies to add to the cluster, such as native\nPython libraries, or security hardening or virus protection software,\n[create a custom image](/dataproc/docs/guides/dataproc-images) from the **latest image**\nin your target minor image version track. This practice allows you to meet dependency requirements\nwhen you create clusters using your custom image. When you rebuild your custom image to\nupdate dependency requirements, use the latest available sub-minor image version within the minor image track.\n\nSubmit jobs to the Dataproc service\n\nSubmit jobs to the Dataproc service with a\n[jobs.submit](/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit)\ncall using the\n[gcloud CLI](/sdk/gcloud/reference/dataproc/jobs/submit)\nor the Google Cloud console. Set job and cluster permissions by granting\n[Dataproc roles](/dataproc/docs/concepts/iam/iam#roles). Use\ncustom roles to separate cluster access from job submit permissions.\n\nBenefits of submitting jobs to the Dataproc service:\n\n- No complicated networking settings required - the API is widely reachable\n- Easy to manage IAM permissions and roles\n- Track job status easily - no Dataproc job metadata to complicate results.\n\nIn production, run jobs that only depend on cluster-level\ndependencies at a fixed minor image version, (for example, `--image-version=2.0`). Bundle\ndependencies with jobs when the jobs are submitted. Submitting\nan [uber jar](https://imagej.net/Uber-JAR) to\nSpark or MapReduce is a common way to do this.\n\n- Example: If a job jar depends on `args4j` and `spark-sql`, with `args4j` specific to the job and `spark-sql` a cluster-level dependency, bundle `args4j` in the job's uber jar.\n\nControl initialization action locations\n\n[Initialization actions](/dataproc/docs/concepts/configuring-clusters/init-actions)\nallow you to automatically run scripts or install\ncomponents when you create a Dataproc cluster (see the\n[dataproc-initialization-actions](https://github.com/GoogleCloudPlatform/dataproc-initialization-actions)\nGitHub repository for common Dataproc initialization actions).\nWhen using cluster initialization actions in a production\nenvironment, copy initialization scripts to Cloud Storage\nrather than sourcing them from a public repository. This practice avoids running\ninitialization scripts that are subject to modification by others.\n\nMonitor Dataproc release notes\n\nDataproc regularly releases new sub-minor image versions.\nView or subscribe to [Dataproc release notes](/dataproc/docs/release-notes)\nto be aware of the latest Dataproc image version releases and other\nannouncements, changes, and fixes.\n\nView the staging bucket to investigate failures\n\n1. Look at your cluster's\n [staging bucket](/dataproc/docs/concepts/configuring-clusters/staging-bucket)\n to investigate cluster and job error messages.\n Typically, the staging bucket Cloud Storage location is shown in\n error messages, as shown in the **bold** text in the following sample error\n message:\n\n \u003cbr /\u003e\n\n ```\n ERROR:\n (gcloud.dataproc.clusters.create) Operation ... failed:\n ...\n - Initialization action failed. Failed action ... see output in:\n gs://dataproc-\u003cBUCKETID\u003e-us-central1/google-cloud-dataproc-metainfo/CLUSTERID/\u003cCLUSTER_ID\u003e\\dataproc-initialization-script-0_output\n \n ```\n\n \u003cbr /\u003e\n\n2. Use the gcloud CLI to view staging bucket contents:\n\n ```\n gcloud storage cat gs://STAGING_BUCKET\n ```\n Sample output: \n\n ```\n + readonly RANGER_VERSION=1.2.0\n ... Ranger admin password not set. Please use metadata flag - default-password\n ```\n\n \u003cbr /\u003e\n\nGet support\n\nGoogle Cloud supports your production OSS workloads and helps you meet your\nbusiness SLAs through [tiers of support](/support). Also, Google Cloud\n[Consulting Services](/consulting) can provide guidance on best practices\nfor your team's production deployments.\n\nFor more information\n\n- Read the Google Cloud blog [Dataproc best practices guide](https://cloud.google.com/blog/topics/developers-practitioners/dataproc-best-practices-guide).\n\n- View [Democratizing Dataproc](https://www.youtube.com/watch?v=2ksD7udWFys) on YouTube."]]