This document describes Spark properties and how to set them. Serverless for Apache Spark uses Spark properties to determine the compute, memory, and disk resources to allocate to your batch workload. These property settings can affect workload quota consumption and cost. For more information, see Serverless for Apache Spark quotas and Serverless for Apache Spark pricing.
Set Spark batch workload properties
You can specify Spark properties when you submit a Serverless for Apache Spark Spark batch workload using the Google Cloud console, gcloud CLI, or the Dataproc API.
Console
- In the Google Cloud console, go to the Dataproc create batch page. 
- In the Properties section, click Add Property. 
- Enter the - Key(name) and- Valueof a supported Spark property.
gcloud
gcloud CLI batch submission example:
gcloud dataproc batches submit spark --properties=spark.checkpoint.compress=true \ --region=region \ other args ...
API
Set RuntimeConfig.properties with supported Spark properties as part of a batches.create request.
Supported Spark properties
Serverless for Apache Spark supports most Spark properties, but it
does not support YARN-related and shuffle-related Spark properties, such as
spark.master=yarn and spark.shuffle.service.enabled. If Spark application
code sets a YARN or shuffle property, the application will fail.
Runtime environment properties
Serverless for Apache Spark supports the following custom Spark properties for configuring runtime environment:
| Property | Description | 
|---|---|
| spark.dataproc.driverEnv.ENVIRONMENT_VARIABLE_NAME | Add ENVIRONMENT_VARIABLE_NAME to the driver process. You can specify multiple environment variables. | 
| spark.executorEnv.ENVIRONMENT_VARIABLE_NAME | Add ENVIRONMENT_VARIABLE_NAME to the executor process. You can specify multiple environment variables. | 
Tier property
| Property | Description | Default | 
|---|---|---|
| dataproc.tier | The tier on which a batch workload runs, either standardorpremium(see Google Cloud Serverless for Apache Spark tiers) Interactive
        sessions always run at the premiumdataproc.tier.
 | standard | 
Engine and runtime properties
| Property | Description | Default | 
|---|---|---|
| spark.dataproc.engine | The engine to use to run the batch workload or the interactive session: either lightningEngine(see Lightning Engine) or thedefaultengine.
 | 
 | 
| spark.dataproc.lightningEngine.runtime | The runtime to use when Lightning Engine is selected for a batch workload or interactive session: defaultornative(Native Query Execution). | default | 
Resource allocation properties
Serverless for Apache Spark supports the following Spark properties for configuring resource allocation:
| Property | Description | Default | Examples | 
|---|---|---|---|
| spark.driver.cores | The number of cores (vCPUs) to allocate to the Spark driver.
        Valid values: 4,8,16. | 4 | |
| spark.driver.memory | The amount of memory to allocate to the Spark driver process, specified in JVM memory string format with a size unit suffix ("m", "g" or "t"). Total driver memory per driver core, including driver memory overhead,
        which must be between  | 512m,2g | |
| spark.driver.memoryOverhead | The amount of additional JVM memory to allocate to the Spark driver process, specified in JVM memory string format with a size unit suffix ("m", "g" or "t"). This is non-heap memory associated with JVM overheads,
        internal strings, and other native overheads, and includes
        memory used by other driver processes, such as PySpark driver processes
        and memory used by other non-driver processes running in the container.
        The maximum memory size of the container in which the driver runs is
        determined by the sum of  Total driver memory per driver core, including driver memory overhead,
        must be between  | 10% of driver memory, except for PySpark batch workloads, which default to 40% of driver memory | 512m,2g | 
| spark.dataproc.driver.compute.tier | The compute tier to use on the driver. The Premium compute tier offers higher per-core performance, but it is billed at a higher rate. | standard | standard, premium | 
| spark.dataproc.driver.disk.size | The amount of disk space allocated to the driver,
        specified with a size unit suffix ("k", "m", "g" or "t").
        Must be at least 250GiB.
        If the Premium disk tier is selected on the driver, valid sizes are
        375g, 750g, 1500g, 3000g, 6000g, or 9000g. If the Premium
        disk tier and 16 driver cores are selected,
        the minimum disk size is 750g. | 100GiBper core | 1024g,2t | 
| spark.dataproc.driver.disk.tier | The disk tier to use for local and shuffle storage on the driver.
        The Premium disk tier offers better performance in IOPS and throughput, but
        it is billed at a higher rate. If the Premium disk tier is selected on
        the driver, the Premium compute tier also must be selected using spark.dataproc.driver.compute.tier=premium,
        and the amount of disk space must be specified usingspark.dataproc.executor.disk.size.If the Premium disk tier is selected, the driver allocates an additional 50GiB of disk space for system storage, which is not usable by user applications. | standard | standard, premium | 
| spark.executor.cores | The number of cores (vCPUs) to allocate to each Spark executor.
        Valid values: 4,8,16. | 4 | |
| spark.executor.memory | The amount of memory to to allocate to each Spark executor process, specified in JVM memory string format with a size unit suffix ("m", "g" or "t"). Total executor memory per executor core, including executor memory
      overhead, must be between  | 512m,2g | |
| spark.executor.memoryOverhead | The amount of additional JVM memory to allocate to the Spark executor process, specified in JVM memory string format with a size unit suffix ("m", "g" or "t"). This is non-heap memory used for JVM overheads, internal strings,
        and other native overheads, and includes PySpark executor memory and
        memory used by other non-executor processes running in the container.
        The maximum memory size of the container in which the executor runs is
        determined by the sum of  Total executor memory per executor core, including executor memory
        overhead, must be between  | 10% of executor memory, except for PySpark batch workloads, which default to 40% of executor memory | 512m,2g | 
| spark.dataproc.executor.compute.tier | The compute tier to use on the executors. The Premium compute tier offers higher per-core performance, but it is billed at a higher rate. | standard | standard, premium | 
| spark.dataproc.executor.disk.size | The amount of disk space allocated to each executor,
        specified with a size unit suffix ("k", "m", "g" or "t").
        Executor disk space may be used for shuffle data and to stage
        dependencies. Must be at least 250GiB.
        If the Premium disk tier is selected on the executor, valid sizes are
        375g, 750g, 1500g, 3000g, 6000g, or 9000g. If the Premium
        disk tier and 16 executor cores are selected,
        the minimum disk size is 750g. | 100GiBper core | 1024g,2t | 
| spark.dataproc.executor.disk.tier | The disk tier to use for local and shuffle storage on executors.
        The Premium disk tier offers better performance in IOPS and throughput, but
        it is billed at a higher rate. If the Premium disk tier is selected on the
        executor, the Premium compute tier also must be selected using spark.dataproc.executor.compute.tier=premium,
        and the amount of disk space must be specified usingspark.dataproc.executor.disk.size.If the Premium disk tier is selected, each executor is allocated an additional 50GiB of disk space for system storage, which is not usable by user applications. | standard | standard, premium | 
| spark.executor.instances | The initial number of executors to allocate. After a batch workload
        starts, autoscaling may change the number of active executors. Must be
        at least 2and at most2000. | 
Autoscaling properties
See Spark dynamic allocation properties for a list of Spark properties you can use to configure Serverless for Apache Spark autoscaling.
Logging properties
| Property | Description | Default | Examples | 
|---|---|---|---|
| spark.log.level | When set, overrides any user-defined log settings with the effect of
    a call to SparkContext.setLogLevel()at Spark startup. Valid
    log levels include:ALL,DEBUG,ERROR,FATAL,INFO,OFF,TRACE,
    andWARN. | INFO,DEBUG | |
| spark.executor.syncLogLevel.enabled | When set to true, the log level applied through
    theSparkContext.setLogLevel()method is
     propagated to all executors. | false | true,false | 
| spark.log.level.PackageName | When set, overrides any user-defined log settings
    with the effect of a call to SparkContext.setLogLevel(PackageName, level)at Spark startup. Valid log levels include:ALL,DEBUG,ERROR,FATAL,INFO,OFF,TRACE, andWARN. | spark.log.level.org.apache.spark=error | 
Scheduling properties
| Property | Description | Default | Examples | 
|---|---|---|---|
| spark.scheduler.excludeShuffleSkewExecutors | Exclude shuffle map skewed executors when scheduling, which can reduce long shuffle fetch wait times caused by shuffle write skew. | false | true | 
| spark.scheduler.shuffleSkew.minFinishedTasks | Minimum number of finished shuffle map tasks on an executor to treat as skew. | 10 | 100 | 
| spark.scheduler.shuffleSkew.maxExecutorsNumber | Maximum number of executors to treat as skew. Skewed executors are excluded from the current scheduling round. | 5 | 10 | 
| spark.scheduler.shuffleSkew.maxExecutorsRatio | Maximum ratio of total executors to treat as skew. Skewed executors are excluded from scheduling. | 0.05 | 0.1 | 
| spark.scheduler.shuffleSkew.ratio | A multiple of the average finished shuffle map tasks on an executor to treat as skew. | 1.5 | 2.0 | 
Other properties
| Property | Description | dataproc.diagnostics.enabled | Enable this property to run diagnostics on a batch workload failure or cancellation. If diagnostics are enabled, your batch workload continues to use compute resources after the workload is complete until diagnostics are finished. A URI pointing to the location of the diagnostics tar file is listed in the Batch.RuntimeInfo.diagnosticOutputUri API field. | 
|---|---|
| dataproc.gcsConnector.version | Use this property to upgrade to a Cloud Storage connector version that is different from the version installed with your batch workload's runtime version. | 
| dataproc.sparkBqConnector.version | Use this property to upgrade to a Spark BigQuery connector version that is different from the version installed with your batch workload's runtime version (see Use the BigQuery connector with Serverless for Apache Spark). | 
| dataproc.profiling.enabled | Set this property to trueto enable profiling for the
    Serverless for Apache Spark workload. | 
| dataproc.profiling.name | Use this property to set the name used to create a profile on the Profiler service. | 
| spark.jars | Use this property to set the comma-separated list of jars to include on the driver and executor classpaths | 
| spark.archives | Use this property to set the comma-separated list of archives to be extracted into the working directory of each executor. .jar, .tar.gz, .tgz and .zip are supported. For serverless interactive sessions add this property when creating an interactive session/template |