Mantenha tudo organizado com as coleções
Salve e categorize o conteúdo com base nas suas preferências.
Quando você ativa o cache de cluster do Dataproc, o cluster armazena em cache os dados do Cloud Storage acessados com frequência pelos jobs do Spark.
Vantagens
Melhoria na performance:o armazenamento em cache pode melhorar a performance do job ao reduzir o tempo gasto para recuperar dados do armazenamento.
Custos de armazenamento reduzidos:como os dados ativos são armazenados em cache no disco local, menos chamadas de API são feitas para o armazenamento para recuperar dados.
Aplicabilidade do job do Spark: quando o cache de cluster está ativado em um cluster,
ele se aplica a todos os jobs do Spark executados no cluster, sejam eles enviados ao
serviço do Dataproc ou executados de forma independente no cluster.
Limitações e requisitos
O armazenamento em cache se aplica apenas a jobs do Spark do Dataproc.
Somente os dados do Cloud Storage são armazenados em cache.
O armazenamento em cache só se aplica a clusters que atendem aos seguintes requisitos:
Esse recurso está disponível nas versões de imagem2.0.72+, 2.1.20+ e 2.2.0+ do Dataproc no Compute Engine.
Cada nó de cluster precisa ter SSDs locais anexados com a interface NVME (Non-Volatile Memory Express). Discos permanentes (PDs, na sigla em inglês) não são compatíveis. Os dados são armazenados em cache apenas em SSDs locais NVME.
O painel Configurar cluster está selecionado. Na seção Melhorias de desempenho do Spark, selecione Ativar o armazenamento em cache do Google Cloud Storage.
Depois de confirmar e especificar os detalhes do cluster nos painéis de criação,
clique em Criar.
[[["Fácil de entender","easyToUnderstand","thumb-up"],["Meu problema foi resolvido","solvedMyProblem","thumb-up"],["Outro","otherUp","thumb-up"]],[["Difícil de entender","hardToUnderstand","thumb-down"],["Informações incorretas ou exemplo de código","incorrectInformationOrSampleCode","thumb-down"],["Não contém as informações/amostras de que eu preciso","missingTheInformationSamplesINeed","thumb-down"],["Problema na tradução","translationIssue","thumb-down"],["Outro","otherDown","thumb-down"]],["Última atualização 2025-08-22 UTC."],[[["\u003cp\u003eEnabling Dataproc cluster caching improves Spark job performance by caching frequently accessed Cloud Storage data on local SSDs, reducing data retrieval time and storage costs.\u003c/p\u003e\n"],["\u003cp\u003eCluster caching applies to all Spark jobs on the cluster, whether submitted to the Dataproc service or run independently, and this applies only to Cloud Storage data.\u003c/p\u003e\n"],["\u003cp\u003eCluster caching is only compatible with clusters meeting specific criteria, such as having one master and \u003ccode\u003en\u003c/code\u003e workers, supported image versions (\u003ccode\u003e2.0.72+\u003c/code\u003e, \u003ccode\u003e2.1.20+\u003c/code\u003e, \u003ccode\u003e2.2.0+\u003c/code\u003e), NVME local SSDs, and the default VM service account.\u003c/p\u003e\n"],["\u003cp\u003eYou can enable cluster caching during Dataproc cluster creation through the Google Cloud console, gcloud CLI, or the Dataproc API, using the property \u003ccode\u003edataproc:dataproc.cluster.caching.enabled=true\u003c/code\u003e.\u003c/p\u003e\n"]]],[],null,["# Cluster caching\n\nWhen you enable Dataproc cluster caching, the cluster caches\nCloud Storage data frequently accessed by your Spark jobs.\n\nBenefits\n--------\n\n- **Improved performance:** Caching can improve job performance by reducing the amount of time spent retrieving data from storage.\n- **Reduced storage costs:** Since hot data is cached on local disk, fewer API calls are made to storage to retrieve data.\n- **Spark job applicability**: When cluster caching is enabled on a cluster, it applies to all Spark jobs run on the cluster, whether submitted to the Dataproc service or run independently on the cluster.\n\nLimitations and requirements\n----------------------------\n\n- Caching applies to Dataproc Spark jobs only.\n- Only Cloud Storage data is cached.\n- Caching only applies to clusters that meet the following requirements:\n - The cluster has one master and `n` workers ([High Availability (HA)](/dataproc/docs/concepts/configuring-clusters/high-availability) and [single node](/dataproc/docs/concepts/configuring-clusters/single-node-clusters) clusters are not supported).\n - This feature is available in Dataproc on Compute Engine [image versions](/dataproc/docs/concepts/versioning/dataproc-version-clusters#supported-dataproc-image-versions) `2.0.72+`, `2.1.20+`, and `2.2.0+`.\n - Each cluster node must have [local SSDs](/dataproc/docs/concepts/compute/dataproc-local-ssds) attached with the [NVME (Non-Volatile Memory Express)](/compute/docs/disks/local-ssd#nvme) interface (Persistent Disks (PDs) are not supported). Data is cached on NVME local SSDs only.\n - The cluster uses the [default VM service account](/dataproc/docs/concepts/configuring-clusters/service-accounts#VM_service_account) for authentication. [Custom VM service accounts](/dataproc/docs/concepts/configuring-clusters/service-accounts#create_a_cluster_with_a_custom_vm_service_account) are not supported.\n\nEnable cluster caching\n----------------------\n\nYou can enable cluster caching when you create a Dataproc cluster\nusing the Google Cloud console, Google Cloud CLI, or the Dataproc API. \n\n### Google Cloud console\n\n- Open the Dataproc [**Create a cluster on Compute Engine**](https://console.cloud.google.com/dataproc/clustersAdd) page in the Google Cloud console.\n- The **Set up cluster** panel is selected. In the **Spark performance enhancements** section, select **Enable Google Cloud Storage caching**.\n- After confirming and specifying cluster details in the cluster create panels, click **Create**.\n\n### gcloud CLI\n\nRun the [gcloud dataproc clusters create](/sdk/gcloud/reference/dataproc/clusters/create)\ncommand locally in a terminal window or in\n[Cloud Shell](https://console.cloud.google.com/?cloudshell=true)\nusing the `dataproc:dataproc.cluster.caching.enabled=true`\n[cluster property](/dataproc/docs/concepts/configuring-clusters/cluster-properties#dataproc_service_properties_table).\n\nExample: \n\n```\ngcloud dataproc clusters create CLUSTER_NAME \\\n --region=REGION \\\n --properties dataproc:dataproc.cluster.caching.enabled=true \\\n --num-master-local-ssds=2 \\\n --master-local-ssd-interface=NVME \\\n --num-worker-local-ssds=2 \\\n --worker-local-ssd-interface=NVME \\\n other args ...\n \n```\n\n### REST API\n\nSet [SoftwareConfig.properties](/static/dataproc/docs/reference/rest/v1/ClusterConfig#SoftwareConfig.FIELDS.properties)\nto include the `\"dataproc:dataproc.cluster.caching.enabled\": \"true\"`\n[cluster property](/dataproc/docs/concepts/configuring-clusters/cluster-properties#dataproc_service_properties_table)\nas part of a\n[clusters.create](/dataproc/docs/reference/rest/v1/projects.regions.clusters/create)\nrequest."]]