Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Saat Anda membuat cluster, HDFS akan digunakan sebagai sistem file default. Anda dapat
mengganti perilaku ini dengan menetapkan defaultFS sebagai bucket Cloud Storage. Secara
default, Dataproc juga membuat bucket staging Cloud Storage dan bucket sementara Cloud Storage di project Anda atau menggunakan kembali bucket staging dan bucket sementara yang dibuat Dataproc dari permintaan pembuatan cluster sebelumnya.
Bucket sementara: Digunakan untuk menyimpan data tugas dan cluster efemeral, seperti file histori Spark dan MapReduce.
Jika Anda tidak menentukan bucket sementara atau penyiapan saat membuat cluster, Dataproc akan menetapkan lokasi Cloud Storage di AS, ASIA, atau Uni Eropa untuk bucket sementara dan penyiapan cluster Anda sesuai dengan zona Compute Engine tempat cluster Anda di-deploy, lalu membuat dan mengelola bucket tingkat project per lokasi ini.
Bucket staging dan bucket sementara yang dibuat Dataproc digunakan bersama oleh berbagai cluster di region yang sama, dan dibuat dengan durasi retensi penghapusan sementara Cloud Storage yang ditetapkan ke 0 detik.
Bucket sementara berisi data sementara, dan memiliki TTL 90 hari.
Bucket penyiapan, yang dapat berisi data konfigurasi dan file dependensi yang diperlukan oleh beberapa cluster, tidak memiliki TTL. Namun, Anda dapat menerapkan aturan siklus proses ke
file dependensi Anda
(file dengan ekstensi nama file ".jar" yang berada di folder bucket penyiapan) untuk menjadwalkan penghapusan file dependensi Anda saat tidak lagi diperlukan oleh cluster Anda.
Membuat bucket sementara dan staging Anda sendiri
Daripada mengandalkan pembuatan bucket staging dan temp default, Anda dapat menentukan bucket Cloud Storage yang sudah ada yang akan digunakan Dataproc sebagai bucket staging dan temp cluster Anda.
Perintah gcloud
Jalankan perintah gcloud dataproc clusters create dengan tanda
--bucket
dan/atau
--temp-bucket
secara lokal di jendela terminal atau di
Cloud Shell
untuk menentukan bucket sementara dan/atau staging cluster Anda.
Di konsol Google Cloud , buka halaman Dataproc
Create a cluster. Pilih panel Sesuaikan cluster, lalu
gunakan kolom Penyimpanan file untuk menentukan atau memilih bucket penyiapan cluster.
Catatan: Saat ini, penentuan bucket sementara menggunakan Google Cloud konsol
tidak didukung.
Dataproc menggunakan struktur folder yang ditentukan untuk bucket Cloud Storage yang terlampir ke cluster. Dataproc juga mendukung pelampiran lebih dari satu
cluster ke bucket Cloud Storage. Struktur folder yang digunakan untuk menyimpan output driver tugas di Cloud Storage adalah:
cloud-storage-bucket-name
- google-cloud-dataproc-metainfo
- list of cluster IDs
- list of job IDs
- list of output logs for a job
Anda dapat menggunakan alat command line gcloud, Dataproc API, atau
konsolGoogle Cloud untuk mencantumkan nama bucket sementara dan penyiapan cluster.
Konsol
\Lihat detail cluster, yang mencakup nama bucket penyiapan cluster, di halaman
Clusters Dataproc di konsol Google Cloud .
Di halaman Google Cloud konsol
Cloud Storage Browser, filter hasil yang berisi "dataproc-temp-".
Perintah gcloud
Jalankan perintah
gcloud dataproc clusters describe
secara lokal di jendela terminal atau di
Cloud Shell.
Bucket sementara dan penyiapan yang terkait dengan cluster Anda tercantum dalam output.
Anda dapat menetapkan core:fs.defaultFS ke lokasi bucket di Cloud Storage (gs://defaultFS-bucket-name) untuk menetapkan Cloud Storage sebagai sistem file default. Hal ini juga menetapkan core:fs.gs.reported.permissions, izin yang dilaporkan yang ditampilkan oleh konektor Cloud Storage untuk semua file, ke 777.
Jika Cloud Storage tidak ditetapkan sebagai sistem file default, HDFS akan digunakan, dan properti core:fs.gs.reported.permissions akan menampilkan 700, nilai default.
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-08-22 UTC."],[[["\u003cp\u003eDataproc uses HDFS as the default filesystem when creating a cluster, but you can override this by setting a Cloud Storage bucket as the defaultFS.\u003c/p\u003e\n"],["\u003cp\u003eDataproc creates or reuses Cloud Storage staging and temp buckets for clusters, with the staging bucket storing job dependencies and the temp bucket storing ephemeral data.\u003c/p\u003e\n"],["\u003cp\u003eUsers can specify their own existing Cloud Storage buckets for staging and temp instead of relying on Dataproc's default creation, which will allow for more control over data storage.\u003c/p\u003e\n"],["\u003cp\u003eYou can use the gcloud CLI, REST API, or the Google Cloud console to define and also list the names of the staging and temp buckets associated with your cluster.\u003c/p\u003e\n"],["\u003cp\u003eWhen using Assured Workloads for regulatory compliance, the cluster, VPC network, and Cloud Storage buckets must be contained within that specific environment.\u003c/p\u003e\n"]]],[],null,["# Dataproc staging and temp buckets\n\nWhen you create a cluster, HDFS is used as the default filesystem. You can\noverride this behavior by setting the defaultFS as a Cloud Storage [bucket](/storage/docs/buckets). By\ndefault, Dataproc also creates a Cloud Storage staging and a\nCloud Storage temp bucket in your project or reuses existing\nDataproc-created staging and temp buckets from previous cluster\ncreation requests.\n\n- Staging bucket: Used to stage cluster job dependencies,\n [job driver output](/dataproc/docs/guides/dataproc-job-output),\n and cluster config files. Also receives output from\n [Snapshot diagnostic data collection](/dataproc/docs/support/diagnose-clusters#snapshot_diagnostic_data_collection).\n\n- Temp bucket: Used to store ephemeral cluster and jobs data,\n such as Spark and MapReduce history files. Also stores\n [checkpoint diagnostic data](/dataproc/docs/support/diagnose-clusters#checkpoint_diagnostic_data_collection)\n collected during the lifecycle of a cluster.\n\nIf you do not specify a staging or temp bucket when you create a cluster,\nDataproc sets a [Cloud Storage location in US, ASIA,\nor EU](/storage/docs/locations#location-mr) for your cluster's staging and temp buckets\naccording to the Compute Engine zone where your cluster is deployed,\nand then creates and manages these project-level, per-location buckets.\nDataproc-created staging and temp buckets are\nshared among clusters in the same region, and are created with a\nCloud Storage [soft delete retention](/storage/docs/soft-delete#retention-duration)\nduration set to 0 seconds.\n\nThe temp bucket contains ephemeral data, and has a TTL of 90 days.\nThe staging bucket, which can contain configuration data\nand dependency files needed by multiple clusters,\ndoes not have a TTL. However, you can [apply a lifecycle rule to\nyour dependency files](/storage/docs/lifecycle#matchesprefix-suffix)\n(files with a \".jar\" filename extension located in the staging bucket folder)\nto schedule the removal of your dependency files when they are no longer\nneeded by your clusters.\n| To locate the default Dataproc staging and temp buckets using the Google Cloud console **[Cloud Storage Browser](https://console.cloud.google.com/storage/browser)**, filter results using the \"dataproc-staging-\" and \"dataproc-temp-\" prefixes.\n\nCreate your own staging and temp buckets\n----------------------------------------\n\nInstead of relying on the creation of a default\nstaging and temp bucket, you can specify existing Cloud Storage buckets that\nDataproc will use as your cluster's staging and temp bucket.\n**Note:** When you use an [Assured Workloads environment](/assured-workloads/docs/deploy-resource) for regulatory compliance, the cluster, VPC network, and Cloud Storage buckets must be contained within the Assured Workloads environment. \n\n### gcloud command\n\nRun the `gcloud dataproc clusters create` command with the\n[`--bucket`](/sdk/gcloud/reference/dataproc/clusters/create#--bucket)\nand/or\n[`--temp-bucket`](/sdk/gcloud/reference/dataproc/clusters/create#--temp-bucket)\nflags locally in a terminal window or in\n[Cloud Shell](https://console.cloud.google.com/?cloudshell=true)\nto specify your cluster's staging and/or temp bucket. \n\n```\ngcloud dataproc clusters create cluster-name \\\n --region=region \\\n --bucket=bucket-name \\\n --temp-bucket=bucket-name \\\n other args ...\n```\n\n### REST API\n\nUse the [`ClusterConfig.configBucket`](/dataproc/docs/reference/rest/v1/ClusterConfig#FIELDS.config_bucket) and\n[`ClusterConfig.tempBucket`](/dataproc/docs/reference/rest/v1/ClusterConfig#FIELDS.temp_bucket)\nfields\nin a [clusters.create](/dataproc/docs/reference/rest/v1/projects.regions.clusters/create)\nrequest to specify your cluster's staging and temp buckets.\n\n### Console\n\nIn the Google Cloud console, open the Dataproc\n[Create a cluster](https://console.cloud.google.com/dataproc/clustersAdd)\npage. Select the Customize cluster panel, then\nuse the File storage field to specify or select the cluster's staging\nbucket.\n\nNote: Currently, specifying a temp bucket using the Google Cloud console\nis not supported.\n\nDataproc uses a defined folder structure for Cloud Storage buckets\nattached to clusters. Dataproc also supports attaching more than one\ncluster to a Cloud Storage bucket. The folder structure used for saving job\ndriver output in Cloud Storage is: \n\n```\ncloud-storage-bucket-name\n - google-cloud-dataproc-metainfo\n - list of cluster IDs\n - list of job IDs\n - list of output logs for a job\n```\n\nYou can use the `gcloud` command line tool, Dataproc API, or\nGoogle Cloud console to list the name of a cluster's staging and temp buckets. \n\n### Console\n\n- \\\\View cluster details, which includeas the name of the cluster's staging bucket, on the Dataproc [Clusters](https://console.cloud.google.com/project/_/dataproc/clusters) page in the Google Cloud console.\n- On the Google Cloud console **[Cloud Storage Browser](https://console.cloud.google.com/storage/browser)** page, filter results that contain \"dataproc-temp-\".\n\n### gcloud command\n\nRun the\n[`gcloud dataproc clusters describe`](/sdk/gcloud/reference/dataproc/clusters/describe)\ncommand locally in a terminal window or in\n[Cloud Shell](https://console.cloud.google.com/?cloudshell=true).\nThe staging and temp buckets associated with your cluster are listed in the\noutput. \n\n```\ngcloud dataproc clusters describe cluster-name \\\n --region=region \\\n...\nclusterName: cluster-name\nclusterUuid: daa40b3f-5ff5-4e89-9bf1-bcbfec ...\nconfig:\n configBucket: dataproc-...\n ...\n tempBucket: dataproc-temp...\n```\n\n### REST API\n\nCall [clusters.get](/dataproc/docs/reference/rest/v1/projects.regions.clusters/get)\nto list the cluster details, including the name of the cluster's staging and temp buckets. \n\n```\n{\n \"projectId\": \"vigilant-sunup-163401\",\n \"clusterName\": \"cluster-name\",\n \"config\": {\n \"configBucket\": \"dataproc-...\",\n...\n \"tempBucket\": \"dataproc-temp-...\",\n}\n```\n\ndefaultFS\n---------\n\nYou can set `core:fs.defaultFS` to a bucket location in Cloud Storage (`gs://`\u003cvar translate=\"no\"\u003edefaultFS-bucket-name\u003c/var\u003e) to set Cloud Storage as the default filesystem. This also sets `core:fs.gs.reported.permissions`, the reported permission returned by the Cloud Storage connector for all files, to `777`.\n| **Note:** When you use an [Assured Workloads environment](/assured-workloads/docs/deploy-resource) for regulatory compliance, the cluster, VPC network, and Cloud Storage buckets must be contained within the Assured Workloads environment.\n\nIf Cloud Storage is not set as the default filesystem, HDFS will be used, and the `core:fs.gs.reported.permissions` property will return `700`, the default value. \n\n```\ngcloud dataproc clusters create cluster-name \\\n --properties=core:fs.defaultFS=gs://defaultFS-bucket-name \\\n --region=region \\\n --bucket=staging-bucket-name \\\n --temp-bucket=temp-bucket-name \\\n other args ...\n```\n\n\u003cbr /\u003e\n\n| **Note:** Currently, console display of the defaultFS bucket is not supported."]]