Dataproc cluster configuration

In Cloud Data Fusion, cluster configuration refers to defining how your data processing pipelines utilize computational resources when running Spark jobs on Dataproc. This page describes the main approaches to cluster configuration.

Default ephemeral clusters (recommended)

Using the default clusters is the recommended approach for Cloud Data Fusion pipelines.

Cloud Data Fusion automatically provisions and manages ephemeral Dataproc clusters for each pipeline execution. It creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes.
Benefits of ephemeral clusters:
- Simplicity: you don't need to manually configure or manage the cluster.
- Cost-effectiveness: you only pay for the resources used during pipeline execution.

To adjust clusters and tune performance, see Cluster sizing.

Static clusters (for specific scenarios)

In the following scenarios, you can use static clusters:

Long-running pipelines: for pipelines that run continuously or for extended periods, a static cluster can be more cost-effective than repeatedly creating and tearing down ephemeral clusters.
Centralized cluster management: if your organization requires centralized control over cluster creation and management policies, static clusters can be used alongside tools like Terraform.
Cluster creation time: when the time it takes to create a new cluster for every pipeline is prohibitive for your use case.

However, static clusters require more manual configuration and involve managing the cluster lifecycle yourself.

To use a static cluster, you must set the following properties on the Dataproc cluster:

dataproc:dataproc.conscrypt.provider.enable=false
capacity-scheduler:yarn.scheduler.capacity.resource-calculator="org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator"

Cluster configuration options for static clusters

If you choose to use static clusters, Cloud Data Fusion offers configuration options for the following aspects:

Worker machine type: specify the virtual machine type for the worker nodes in your cluster. This determines the vCPUs and memory available for each worker.
Number of workers: define the initial number of worker nodes in your cluster. Dataproc might still autoscale this number, based on workload.
Zone: select your cluster's Google Cloud zone. Location can affect data locality and network performance.
Additional configurations: you can configure advanced options for your static cluster, such as preemption settings, network settings, and initialization actions.

Best practices

When creating a static cluster for your pipelines, use the following configurations.

Parameters	Description
`yarn.nodemanager.delete.debug-delay-sec`	Retains YARN logs. Recommended value: `86400` (equivalent to one day)
`yarn.nodemanager.pmem-check-enabled`	Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory. Recommended value: `false`
`yarn.nodemanager.vmem-check-enabled`	Enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory. Recommended value: `false`
`dataproc.scheduler.driver-size-mb`	The average memory footprint of the driver makes Dataproc queue the job if the master node lacks sufficient memory to run the driver process. This can impact job concurrency but can be mitigated by using a master node with more memory. Recommended value: `2048`

For more information, see Run a pipeline against an existing Dataproc cluster.

Reusing clusters

You can reuse Dataproc clusters between runs to improve processing time. Cluster reuse is implemented in a model similar to connection pooling or thread pooling. Any cluster is kept up and running for a specified time after the run is finished. When a new run is started, it will try to find an idle cluster available that matches the configuration of the compute profile. If one is present, it will be used, otherwise a new cluster will be started.

Considerations for reusing clusters

Clusters are not shared. Similar to the regular ephemeral cluster provisioning model, a cluster runs a single pipeline run at a time. A cluster is reused only if it is idle.
If you enable cluster reuse for all your runs, the necessary number of clusters to process all your runs will be created as needed. Similar to the ephemeral Dataproc provisioner, there is no direct control on the number of clusters created. You can still use Google Cloud quotes to manage resources. For example, if you run 100 runs with 7 maximum parallel runs, you will have up to 7 clusters at a given point of time.
Clusters are reused between different pipelines as soon as those pipelines are using the same profile and share the same profile settings. If profile customization is used, clusters will still be reused, but only if customizations are exactly the same, including all cluster settings like cluster labeling.
When cluster reuse is enabled, there are two main cost considerations:
- Less resources are used for cluster startup and initialization.
- More resources are used for clusters to sit idle between the pipeline runs and after the last pipeline run.

While it's hard to predict the cost effect of cluster reuse, you can employ a strategy to get maximum savings. The strategy is to identify a critical path for chained pipelines and enable cluster reuse for this critical path. This would ensure the cluster is immediately reused, no idle time is wasted and maximum performance benefits are achieved.

Enable Cluster Reuse

In the Compute Config section of deployed pipeline configuration or when creating new compute profile:

Enable Skip Cluster Delete.
Max Idle Time is the time up to which a cluster waits for the next pipeline to reuse it. The default Max Idle Time is 30 minutes. For Max Idle Time, consider the cost versus cluster availability for reuse. The higher the value of Max Idle Time, the more clusters sit idle, ready for a run.

Troubleshoot: Version compatibility

Problem: The version of your Cloud Data Fusion environment might not be compatible with the version of your Dataproc cluster.

Recommended: Upgrade to the latest Cloud Data Fusion version and use one of the supported Dataproc versions.

Earlier versions of Cloud Data Fusion are only compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, we recommend replacing it with one created with a supported version.

Cloud Data Fusion version	Dataproc version
6.10.1.1	2.2^**, 2.1, 2.0 ^
6.10	2.1, 2.0 ^*
6.9	2.1, 2.0, 1.5 ^*
6.7-6.8	2.0, 1.5 ^*
6.4-6.6	2.0 ^, 1.3 ^*
6.1-6.3	1.3^**

^* Cloud Data Fusion versions 6.4 and later are compatible with supported versions of Dataproc. Unless specific OS features are needed, the recommended practice is to specify the major.minor image version.
To specify the OS version used in your Dataproc cluster, the OS version must be compatible with one of the supported Dataproc versions for your Cloud Data Fusion in the preceding table.

^** Cloud Data Fusion versions 6.1 to 6.6 are compatible with unsupported Dataproc version 1.3.

^*** Certain issues are detected with this image version. This Dataproc image version is not recommended for production use.

Troubleshoot: Container exited with a non-zero exit code 3

Problem: An autoscaling policy isn't used, and the static Dataproc clusters are encountering memory pressure, causing an out of memory exception to appear in the logs: Container exited with a non-zero exit code 3.

Recommended: Increase the executor memory.

Increase the memory by adding a task.executor.system.resources.memory runtime argument to the pipeline. The following example runtime argument sets the memory to 4096 MB:

"task.executor.system.resources.memory": 4096

For more information, see Cluster sizing.

What's next

Refer to the How to change Dataproc image version.