This page provides information on Dataproc on Compute Engine VM out-of-memory (OOM) errors, and explains steps you can take to troubleshoot and resolve OOM errors.
OOM error effects
When Dataproc on Compute Engine VMs encounter out-of-memory (OOM) errors, the effects include the following conditions:
- Master and worker VMs freeze for a period of time. 
- Master VMs OOM errors cause jobs to fail with "task not acquired" errors. 
- Worker VM OOM errors cause a loss of the node on YARN HDFS, which delays Dataproc job execution. 
YARN memory controls
Apache YARN provides the following types of memory controls:
- Polling based (legacy)
- Strict
- Elastic
By default, Dataproc doesn't set
yarn.nodemanager.resource.memory.enabled to enable YARN memory controls, for
the following reasons:
- Strict memory control can cause the termination of containers when there is sufficient memory if container sizes aren't configured correctly.
- Elastic memory control requirements can adversely affect job execution.
- YARN memory controls can fail to prevent OOM errors when processes aggressively consume memory.
Dataproc memory protection
When a Dataproc cluster VM is under memory pressure, Dataproc memory protection terminates processes or containers until the OOM condition is removed.
Dataproc provides memory protection for the following cluster nodes in the following Dataproc on Compute Engine image versions:
| Role | 1.5 | 2.0 | 2.1 | 2.2 | 
|---|---|---|---|---|
| Master VM | 1.5.74+ | 2.0.48+ | all | all | 
| Worker VM | Not Available | 2.0.76+ | 2.1.24+ | all | 
| Driver Pool VM | Not Available | 2.0.76+ | 2.1.24+ | all | 
Identify and confirm memory protection terminations
You can use the following information to identify and confirm job terminations due to memory pressure.
Process terminations
- Processes that Dataproc memory protection terminates exit with code - 137or- 143.
- When Dataproc terminates a process due to memory pressure, the following actions or conditions can occur: - Dataproc increments the
dataproc.googleapis.com/node/problem_countcumulative metric, and sets thereasontoProcessKilledDueToMemoryPressure. See Dataproc resource metric collection.
- Dataproc writes a google.dataproc.oom-killerlog with the message:"A process is killed due to memory pressure: process name. To view these messages, enable Logging, then use the following log filter:resource.type="cloud_dataproc_cluster" resource.labels.cluster_name="CLUSTER_NAME" resource.labels.cluster_uuid="CLUSTER_UUID" jsonPayload.message:"A process is killed due to memory pressure:" 
 
- Dataproc increments the
Master node or driver node pool job terminations
- When a Dataproc master node or driver node pool job terminates due to memory pressure, the job fails with error - Driver received SIGTERM/SIGKILL signal and exited with INTcode. To view these messages, enable Logging, then use the following log filter:- resource.type="cloud_dataproc_cluster" resource.labels.cluster_name="CLUSTER_NAME" resource.labels.cluster_uuid="CLUSTER_UUID" jsonPayload.message:"Driver received SIGTERM/SIGKILL signal and exited with"- Check the
google.dataproc.oom-killerlog or thedataproc.googleapis.com/node/problem_countto confirm that Dataproc Memory Protection terminated the job (see Process terminations).
 - Solutions: - If the cluster has a
driver pool,
increase driver-required-memory-mbto actual job memory usage.
- If the cluster does not have a driver pool, recreate the cluster, lowering the maximum number of concurrent jobs running on the cluster.
- Use a master node machine type with increased memory.
 
- Check the
Worker node YARN container terminations
- Dataproc writes the following message in the YARN resource manager: - container id exited with code EXIT_CODE. To view these messages, enable Logging, then use the following log filter:- resource.type="cloud_dataproc_cluster" resource.labels.cluster_name="CLUSTER_NAME" resource.labels.cluster_uuid="CLUSTER_UUID" jsonPayload.message:"container" AND "exited with code" AND "which potentially signifies memory pressure on NODE 
- If a container exited with - code INT, check the- google.dataproc.oom-killerlog or the- dataproc.googleapis.com/node/problem_countto confirm that Dataproc Memory Protection terminated the job (see Process terminations).- Solutions: - Check that container sizes are configured correctly.
- Consider lowering yarn.nodemanager.resource.memory-mb. This property controls the amount of memory used for scheduling YARN containers.
- If job containers consistently fail, check if data skew is causing increased usage of specific containers. If so, repartition the job or increase worker size to accommodate additional memory requirements.