Stay organized with collections
Save and categorize content based on your preferences.
This document provides guidance on troubleshooting common issues that prevent
Google Cloud Serverless for Apache Spark Spark batch workloads and interactive sessions from starting.
Overview
Typically, when a batch or session fails to start, it reports the
following error message:
Driver compute node failed to initialize for batch in 600 seconds
This error message indicates that the Spark driver
couldn't start within the default timeout period of 600 seconds (10 minutes).
Common causes are related to service account permissions, resource availability,
network configuration, or Spark properties.
Batch and session start failure causes and troubleshooting steps
The following sections list common causes of batch and session start failures with
troubleshooting tips to help you resolve the issues.
Insufficient service account permissions
The service account used by your Serverless for Apache Spark batch or session requires specific
IAM roles
that include permissions for Serverless for Apache Spark operation and access
to Google Cloud resources. If the service account lacks the necessary roles,
the Spark driver for the batch or session can fail to initialize.
Required Worker role: The batch or session service account must have the
Dataproc Worker role (roles/dataproc.worker). This role contains
the minimum permissions needed for Serverless for Apache Spark to provision and
manage compute resources.
Data Access Permissions: If your Spark application reads from or
writes to Cloud Storage or BigQuery, the service
account needs roles related to those services:
Cloud Storage: The Storage Object Viewer role (roles/storage.objectViewer)
is needed for reading, and the Storage Object Creator role (roles/storage.objectCreator)
or Storage Object Admin role (roles/storage.admin) is needed for writing.
BigQuery: The BigQuery Data Viewer role (roles/bigquery.dataViewer)
is needed for reading and the BigQuery Data Editor role (roles/bigquery.dataEditor)
is needed for writing.
Logging Permissions: The service account needs a role with
permission to write logs to Cloud Logging. Typically, the
Logging Writer role (roles/logging.logWriter) is sufficient.
Go to the
IAM & Admin > IAM
page in the Google Cloud console, find the batch or session service account,
and then verify that it has the necessary roles needed for operations.
Grant any missing roles.
Insufficient quota
Exceeding project or region-specific quotas for Google Cloud Serverless for Apache Spark
or other Google Cloud resources can prevent new batches or session from starting.
You can also use the gcloud compute quotas list command to view
current usage and limits for your project and region:
gcloud compute quotas list --project=PROJECT_ID --filter="service:dataproc.googleapis.com"
If you repeatedly hit quota limits, consider requesting a quota
increase through the Google Cloud console.
Network configuration issues
Incorrect network settings, such as VPC configuration, Private Google Access,
or firewall rules, can block the Spark driver from initializing or connecting to
necessary services.
Troubleshooting tips:
Verify that the VPC network and subnet specified for your batch or session are
correctly configured and have sufficient IP addresses available.
If your batch or session needs to access Google APIs
and services without traversing the public internet, verify
Private Google Access is enabled for the subnet.
Review your VPC firewall rules to verify they don't
inadvertently block internal communication or egress to Google APIs or
external services that are required by your Spark application.
Invalid spark properties or application code issues
Misconfigured Spark properties, particularly those related to driver resources,
or issues within your Spark application code can lead to startup failures.
Troubleshooting tips:
Check spark.driver.memory and spark.driver.cores values.
Verify they are within reasonable limits and align with available DCUs.
Excessively large values for these properties can lead to resource
exhaustion and initialization failures. Remove any unnecessary or
experimental Spark properties to simplify debugging.
Try running a "Hello World" Spark application to determine if the issue
is with your environment setup or due to code complexity or errors.
Verify that all application JARs, Python files,
or dependencies specified for your batch or session are correctly
located in Cloud Storage and are accessible by the
batch or session service account.
Check logs
A critical step in diagnosing batch creation failures is to examine
the detailed logs in Cloud Logging.
Go to the Cloud Logging page
in the Google Cloud console.
Filter for Serverless for Apache Spark Batches or Sessions:
In the Resource drop-down, select Cloud Dataproc Batch or
Cloud Dataproc Session.
Filter by batch_id or session_id for the failed batch or session.
You can also filter by project_id and location (region).
Look for log entries with jsonPayload.component="driver".
These logs often contain specific error messages or stack traces that
can pinpoint the reason for the driver initialization failure
before the 600-second timeout occurs.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-28 UTC."],[],[],null,["# Troubleshoot batch and session creation failures\n\nThis document provides guidance on troubleshooting common issues that prevent\nGoogle Cloud Serverless for Apache Spark Spark batch workloads and interactive sessions from starting.\n\nOverview\n--------\n\nTypically, when a batch or session fails to start, it reports the\nfollowing error message: \n\n```\nDriver compute node failed to initialize for batch in 600 seconds\n```\n\nThis error message indicates that the Spark driver\ncouldn't start within the default timeout period of 600 seconds (10 minutes).\nCommon causes are related to service account permissions, resource availability,\nnetwork configuration, or Spark properties.\n\nBatch and session start failure causes and troubleshooting steps\n----------------------------------------------------------------\n\nThe following sections list common causes of batch and session start failures with\ntroubleshooting tips to help you resolve the issues.\n\n### Insufficient service account permissions\n\nThe service account used by your Serverless for Apache Spark batch or session requires specific\nIAM roles\nthat include permissions for Serverless for Apache Spark operation and access\nto Google Cloud resources. If the service account lacks the necessary roles,\nthe Spark driver for the batch or session can fail to initialize.\n\n- Required Worker role: The batch or session service account must have the **Dataproc Worker role** (`roles/dataproc.worker`). This role contains the minimum permissions needed for Serverless for Apache Spark to provision and manage compute resources.\n- Data Access Permissions: If your Spark application reads from or writes to Cloud Storage or BigQuery, the service account needs roles related to those services:\n - Cloud Storage: The **`Storage Object Viewer` role** (`roles/storage.objectViewer`) is needed for reading, and the **`Storage Object Creator` role** (`roles/storage.objectCreator`) or **`Storage Object Admin` role** (`roles/storage.admin`) is needed for writing.\n - BigQuery: The **`BigQuery Data Viewer` role** (`roles/bigquery.dataViewer`) is needed for reading and the **`BigQuery Data Editor` role** (`roles/bigquery.dataEditor`) is needed for writing.\n- Logging Permissions: The service account needs a role with permission to write logs to Cloud Logging. Typically, the **`Logging Writer` role** (`roles/logging.logWriter`) is sufficient.\n\nTroubleshooting tips:\n\n- Identify the batch or session [service account](/dataproc-serverless/docs/concepts/service-account). If not specified, it defaults to the [Compute Engine default service account](/compute/docs/access/service-accounts#default_service_account).\n- Go to the [**IAM \\& Admin \\\u003e IAM**](https://console.cloud.google.com/iam-admin/iam) page in the Google Cloud console, find the batch or session service account, and then verify that it has the necessary roles needed for operations. Grant any missing roles.\n\n### Insufficient quota\n\nExceeding project or region-specific quotas for Google Cloud Serverless for Apache Spark\nor other Google Cloud resources can prevent new batches or session from starting.\n\nTroubleshooting tips:\n\n- Review the [Google Cloud Serverless for Apache Spark quotas](/dataproc-serverless/quotas) page\n to understand limits on concurrent batches, DCUs, and shuffle storage.\n\n - You can also use the `gcloud compute quotas list` command to view current usage and limits for your project and region: \n\n ```\n gcloud compute quotas list --project=PROJECT_ID --filter=\"service:dataproc.googleapis.com\"\n ```\n- If you repeatedly hit quota limits, consider requesting a quota\n increase through the Google Cloud console.\n\n### Network configuration issues\n\nIncorrect network settings, such as VPC configuration, Private Google Access,\nor firewall rules, can block the Spark driver from initializing or connecting to\nnecessary services.\n\nTroubleshooting tips:\n\n- Verify that the VPC network and subnet specified for your batch or session are\n correctly configured and have sufficient IP addresses available.\n\n- If your batch or session needs to access Google APIs\n and services without traversing the public internet, verify\n Private Google Access is enabled for the subnet.\n\n | Serverless for Apache Spark batch workloads and interactive sessions run on VMs with internal IP addresses only and on a regional subnet with Private Google Access automatically enabled on the session subnet.\n\n \u003cbr /\u003e\n\n- Review your VPC firewall rules to verify they don't\n inadvertently block internal communication or egress to Google APIs or\n external services that are required by your Spark application.\n\n| **Tip:** To diagnose batch and network connectivity issues, also see [Troubleshoot batch and session connectivity](/dataproc-serverless/docs/support/troubleshoot-connectivity#network-configuration-issues).\n\n### Invalid spark properties or application code issues\n\nMisconfigured Spark properties, particularly those related to driver resources,\nor issues within your Spark application code can lead to startup failures.\n\nTroubleshooting tips:\n\n- Check [`spark.driver.memory` and `spark.driver.cores`](/dataproc-serverless/docs/concepts/properties#resource_allocation_properties) values.\n Verify they are within reasonable limits and align with available DCUs.\n Excessively large values for these properties can lead to resource\n exhaustion and initialization failures. Remove any unnecessary or\n experimental Spark properties to simplify debugging.\n\n- Try running a \"Hello World\" Spark application to determine if the issue\n is with your environment setup or due to code complexity or errors.\n\n- Verify that all application JARs, Python files,\n or dependencies specified for your batch or session are correctly\n located in Cloud Storage and are accessible by the\n batch or session service account.\n\nCheck logs\n----------\n\nA critical step in diagnosing batch creation failures is to examine\nthe detailed logs in Cloud Logging.\n\n1. Go to the [**Cloud Logging**](https://console.cloud.google.com/logs/viewer) page in the Google Cloud console.\n2. Filter for Serverless for Apache Spark Batches or Sessions:\n 1. In the **Resource** drop-down, select `Cloud Dataproc Batch` or `Cloud Dataproc Session`.\n 2. Filter by `batch_id` or `session_id` for the failed batch or session. You can also filter by `project_id` and `location` (region).\n3. Look for log entries with `jsonPayload.component=\"driver\"`. These logs often contain specific error messages or stack traces that can pinpoint the reason for the driver initialization failure before the 600-second timeout occurs."]]