This document describes when and how to enable Native Query Execution to accelerate Serverless for Apache Spark batch workloads and interactive sessions.
Native Query Execution requirements
Serverless for Apache Spark Native Query Execution is available only with batch
workloads and interactive sessions using 1.2.26+
, 2.2.26+
, or a later
Spark runtime version
running in the Serverless for Apache Spark premium pricing tier.
Premium tier pricing is charged at a higher cost than standard tier pricing,
but there is no additional charge for Native Query Execution.
For pricing information, see Serverless for Apache Spark pricing.
Native Query Execution properties
This section lists required and optional Spark resource allocation properties you can use to enable and customize Native Query Execution for your batch workload or interactive session.
Required property settings
spark.dataproc.runtimeEngine=native
: The workload runtime engine must be set tonative
to override the defaultspark
runtime engine.spark.dataproc.spark.driver.compute.tier=premium
andspark.dataproc.executor.compute.tier=premium
: These pricing tier properties must be set to the premium pricing tier.
Optional resource allocation properties
spark.dataproc.driver.disk.tier
,spark.dataproc.driver.disk.size
,spark.dataproc.executor.disk.tier
andspark.dataproc.executor.disk.size
: Use these properties to set and configure the premium disk tier and size for the Spark driver and executor processes.Premium disk tiers use columnar instead of row-based shuffle to provide better performance. For better shuffle I/O throughput, use the driver and executor premium disk tiers with a sufficiently large disk size to accommodate shuffle files.
spark.driver.memory
,spark.driver.memoryOverhead
,spark.executor.memory
,spark.executor.memoryOverhead
, andspark.memory.offHeap.size
: Use these properties to tune memory provided to Spark driver and executor processes.You can configure memory in either of the following ways:
Option 1: Configure off-heap memory only (
spark.memory.offHeap.size
) with a specified value. Native Query Execution will use the specified value as off-heap memory, and allocate an additional1/7th
of the off-heap memory value as on-heap memory (spark.executor.memory
).Option 2: Configure both on-heap memory (
spark.executor.memory
) and off-heap memory (spark.memory.offHeap.size
). The amount you allocate to off-heap memory must be greater than the amount you allocate to on-heap memory.
If you don't configure both off-heap memory (
spark.memory.offHeap.size
) and on-heap memory (spark.executor.memory
), the Native Query Execution engine divides a default4g
amount of memory in a6:1
ratio between off-heap and on-heap memory.Recommendation: Allocate off-heap to on-heap memory in a
6:1
ratio.Examples:
Memory settings without Native Query Execution Recommended memory settings with Native Query Execution spark.executor.memory
spark.memory.offHeap.size
spark.executor.memory
7g 6g 1g 14g 12g 2g 28g 24g 4g 56g 48g 8g
Run the qualification tool
To identify batch workloads that can achieve faster runtimes with Native Query Execution (NQE), you can use the qualification tool. The tool analyzes Spark event logs to estimate potential runtime savings and identify any operations that are not supported by the NQE engine.
Google Cloud provides two methods for running the qualification analysis: qualification job and qualification script. The recommended approach for most users is the qualification job, which automates the discovery and analysis of batch workloads. The alternative qualification script is available for the specific use case of analyzing a known event log file. Choose the method that best fits your use case:
Qualification Job (Recommended): This is the primary and recommended method. It is a PySpark job that automatically discovers and analyzes recent batch workloads across one or more Google Cloud projects and regions. Use this method when you want to perform a broad analysis without needing to manually locate individual event log files. This approach is ideal for large-scale evaluation of NQE suitability.
Qualification Script (Alternative): This is an alternative method for advanced or specific use cases. It is a shell script that analyzes a single Spark event log file or all event logs within a specific Cloud Storage directory. Use this method if you have the Cloud Storage path to the event logs you want to analyze.
Qualification job
The qualification job simplifies large-scale analysis by programmatically scanning for Serverless for Apache Spark batch workloads and submitting a distributed analysis job. The tool evaluates jobs across your organization, eliminating the need to manually find and specify event log paths.
Grant IAM roles
For the qualification job to access batch workload metadata and read Spark event logs in Cloud Logging, the service account that runs the workload must have the following IAM roles granted in all projects to be analyzed:
Submit the qualification job
You submit the qualification job using the gcloud CLI tool. The job includes a PySpark script and a JAR file that are hosted in a public Cloud Storage bucket.
You can run the job in either of the following execution environments:
As a Serverless for Apache Spark batch workload. This is simple, stand-alone job execution.
As a job that runs on a Dataproc on Compute Engine cluster. This approach can be useful to integrate the job into a workflow.
Job arguments
Argument | Description | Required? | Default Value |
---|---|---|---|
--project-ids |
A single Project ID or a comma-separated list of Google Cloud project IDs to scan for batch workloads. | No | The project where the qualification job is running. |
--regions |
A single region or a comma-separated list of regions to scan within the specified projects. | No | All regions within the specified projects. |
--start-time |
The start date for filtering batches. Only batches created on or after this date (format: YYYY-MM-DD) will be analyzed. | No | No start date filter is applied. |
--end-time |
The end date for filtering batches. Only batches created on or before this date (format: YYYY-MM-DD) will be analyzed. | No | No end date filter is applied. |
--limit |
The maximum number of batches to analyze per region. The most recent batches are analyzed first. | No | All batches that match the other filter criteria are analyzed. |
--output-gcs-path |
The Cloud Storage path (for example, gs://your-bucket/output/ ) where the result files will be written. |
Yes | None. |
--input-file |
The Cloud Storage path to a text file for bulk analysis. If provided, this argument overrides all other scope-defining arguments (--project-ids , --regions , --start-time , --end-time , --limit ). |
No | None. |
Qualification job examples
A Serverless for Apache Spark batch job to perform simple, ad-hoc analysis. Job arguments are listed after the
--
separator.gcloud dataproc batches submit pyspark gs://qualification-tool/performance-boost-qualification.py \ --project=PROJECT_ID \ --region=REGION \ --jars=gs://qualification-tool/dataproc-perfboost-qualification-1.2.jar \ -- \ --project-ids=COMMA_SEPARATED_PROJECT_IDS \ --regions=COMMA_SEPARATED_REGIONS \ --limit=MAX_BATCHES \ --output-gcs-path=gs://BUCKET
A Serverless for Apache Spark batch job to analyze up to 50 of the most recent batches found in
sample_project
in theus-central1
region, The results are written to bucket in Cloud Storage. Job arguments are listed after the--
separator.gcloud dataproc batches submit pyspark gs://qualification-tool/performance-boost-qualification.py \ --project=PROJECT_ID \ --region=US-CENTRAL1 \ --jars=gs://qualification-tool/dataproc-perfboost-qualification-1.2.jar \ -- \ --project-ids=PROJECT_ID \ --regions=US-CENTRAL1 \ --limit=50 \ --output-gcs-path=gs://BUCKET/
A Dataproc on Compute Engine job submitted to a Dataproc cluster for bulk analysis in a large-scale, repeatable, or automated analysis workflow. Job arguments are placed in an INPUT_FILE that is uploaded to a BUCKET in Cloud Storage. This method is ideal for scanning different date ranges or batch limits across different projects and regions in a single run.
gcloud dataproc jobs submit pyspark gs://qualification-tool/performance-boost-qualification.py \ --cluster=CLUSTER_NAME \ --region=REGION \ --jars=gs://qualification-tool/dataproc-perfboost-qualification-1.2.jar \ -- \ --input-file=gs://INPUT_FILE \ --output-gcs-path=gs://BUCKET
Notes:
INPUT_FILE: Each line in the file represents a distinct analysis request and uses a format of single-letter flags followed by their values, such as,
-p PROJECT-ID -r REGION -s START_DATE -e END_DATE -l LIMITS
.Example input file content:
-p project1 -r us-central1 -s 2024-12-01 -e 2024-12-15 -l 100 -p project2 -r europe-west1 -s 2024-11-15 -l 50
These arguments direct the tool to analyze the following two scopes:
- Up to 100 batches in project1 in the
us-central1
region created between December 1, 2025 and December 15, 2025. - Up to 50 batches in project2 in the
europe-west1
region created on or after November 15, 2025.
- Up to 100 batches in project1 in the
Qualification script
Use this method if you have the direct Cloud Storage path to a specific
Spark event log that you want to analyze. This approach requires you to download
and run a shell script, run_qualification_tool.sh
, on a local machine or a
Compute Engine VM that is configured with access to the event log file
in Cloud Storage.
Perform the following steps to run the script against Serverless for Apache Spark batch workload event files.
1.Copy the
run_qualification_tool.sh
into a local directory that contains the Spark event files to analyze.
Run the qualification script to analyze one event file or a set of event files contained in the script directory.
./run_qualification_tool.sh -f EVENT_FILE_PATH/EVENT_FILE_NAME \ -o CUSTOM_OUTPUT_DIRECTORY_PATH \ -k SERVICE_ACCOUNT_KEY \ -x MEMORY_ALLOCATEDg \ -t PARALLEL_THREADS_TO_RUN
Flags and values:
-f
(required): See Spark event file locations to locate Spark workload event files.EVENT_FILE_PATH (required unless EVENT_FILE_NAME is specified): Path of the event file to analyze. If not provided, the event file path is assumed to be the current directory.
EVENT_FILE_NAME (required unless EVENT_FILE_PATH is specified): Name of the event file to analyze. If not provided, the event files found recursively in the
EVENT_FILE_PATH
are analyzed.
-o
(optional): If not provided, the tool creates or uses an existingoutput
directory under the current directory to place output files.- CUSTOM_OUTPUT_DIRECTORY_PATH: Output directory path to output files.
-k
(optional):- SERVICE_ACCOUNT_KEY: The service account key in JSON format if needed to access the EVENT_FILE_PATH.
-x
(optional):- MEMORY_ALLOCATED: Memory in gigabytes to allocate to the tool. By default, the tool uses 80% of the free memory available in the system and all available machine cores.
-t
(optional):- PARALLEL_THREADS_TO_RUN: The number of parallel threads for the tool to execute. By default, the tool executes all cores.
Example command usage:
./run_qualification_tool.sh -f gs://dataproc-temp-us-east1-9779/spark-job-history \ -o perfboost-output -k /keys/event-file-key -x 34g -t 5
In this example, the qualification tool traverses the
gs://dataproc-temp-us-east1-9779/spark-job-history
directory, and analyzes Spark event files contained in this directories and its subdirectories. Access to the directory is provided the/keys/event-file-key
. The tool uses34 GB memory
for execution, and runs5
parallel threads.Spark event file locations
Perform any of the following steps to find the Spark event files for Serverless for Apache Spark batch workloads:
In Cloud Storage, find the
spark.eventLog.dir
for the workload, then download it.- If you can't find the
spark.eventLog.dir
, set thespark.eventLog.dir
to a Cloud Storage location, and then rerun the workload and download thespark.eventLog.dir
.
- If you can't find the
If you have configured Spark History Server for the batch job:
- Go to the Spark History Server, then select the workload.
- Click Download in the Event Log column.
Qualification tool output files
Once the qualification job or script analysis is complete, the qualification tool
places the following output files in a
perfboost-output
directory in the current directory:
AppsRecommendedForBoost.tsv
: A tab-separated list of applications recommended for use with Native Query Execution.UnsupportedOperators.tsv
: A tab-separated list of applications not recommended for use with Native Query Execution.
AppsRecommendedForBoost.tsv
output file
The following table shows the contents of a sample AppsRecommendedForBoost.tsv
output file. It contains a row for each analysed application.
Sample AppsRecommendedForBoost.tsv
output file:
applicationId | applicationName | rddPercentage | unsupportedSqlPercentage | totalTaskTime | supportedTaskTime | supportedSqlPercentage | recommendedForBoost | expectedRuntimeReduction |
---|---|---|---|---|---|---|---|---|
app-2024081/batches/083f6196248043938-000 | projects/example.com:dev/locations/us-central1 6b4d6cae140f883c0 11c8e |
0.00% | 0.00% | 548924253 | 548924253 | 100.00% | TRUE | 30.00% |
app-2024081/batches/60381cab738021457-000 | projects/example.com:dev/locations/us-central1 474113a1462b426bf b3aeb |
0.00% | 0.00% | 514401703 | 514401703 | 100.00% | TRUE | 30.00% |
Column descriptions:
applicationId
: TheApplicationID
of the Spark application. Use this to identify the corresponding batch workload.applicationName
: The name of the Spark application.rddPercentage
: The percentage of RDD operations in the application. RDD operations are not supported by Native Query Execution.unsupportedSqlPercentage:
Percentage of SQL operations not supported by Native Query Execution.totalTaskTime
: Cumulative task time of all tasks executed during the application run.supportedTaskTime
: The total task time supported by Native Query Execution.
The following columns provide important information to help you determine if Native Query Execution can benefit your batch workload:
supportedSqlPercentage
: The percentage of SQL operations supported by native query execution. The higher the percentage, the greater the runtime reduction that can be achieved by running the application with Native Query Execution.recommendedForBoost
: IfTRUE
, running the application with Native Query Execution is recommended. IfrecommendedForBoost
isFALSE
, don't use Native Query Execution on the batch workload.expectedRuntimeReduction
: The expected percentage reduction in application runtime when you run the application with Native Query Execution.
UnsupportedOperators.tsv
output file.
The UnsupportedOperators.tsv
output file contains a list of operators used in
workload applications that are not supported by Native Query Execution.
Each row in the output file lists an unsupported operator.
Column descriptions:
unsupportedOperator
: The name of the operator that is not supported by Native Query Execution.cumulativeCpuMs
: The number of CPU milliseconds consumed during the execution of the operator. This value reflects the relative importance of the operator in the application.count
: The number of times the operator is used in the application.
Use Native Query Execution
You can use Native Query Execution with you application by setting Native Query Execution properties when you create the batch workload, interactive session, or session template that runs you application.
Use Native Query Execution with batch workloads
You can use the Google Cloud console, Google Cloud CLI, or Dataproc API to enable Native Query Execution on a batch workload.
Console
Use the Google Cloud console to enable Native Query Execution on a batch workload.
In the Google Cloud console:
- Go to Dataproc Batches.
- Click Create to open the Create batch page.
Select and fill in the following fields to configure the batch for Native Query Execution:
- Container:
- Runtime version: Select
1.2
,2.2
or a highermajor.minor
version number. See Supported Serverless for Apache Spark runtime versions.
- Runtime version: Select
- Executor and Driver Tier Configuration:
- Select
Premium
for all tiers (Driver Compute Tier, Execute Compute Tier).
- Select
- Properties: Enter
Key
(property name) andValue
pairs to specify Native Query Execution properties:Key Value spark.dataproc.runtimeEngine
native
- Container:
Fill in, select, or confirm other batch workloads settings. See Submit a Spark batch workload.
Click Submit to run the Spark batch workload.
gcloud
Set the following gcloud CLI
gcloud dataproc batches submit spark
command flags to configure a batch workload for Native Query Execution:
gcloud dataproc batches submit spark \ --project=PROJECT_ID \ --region=REGION \ --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \ --class=org.apache.spark.examples.SparkPi \ --properties=spark.dataproc.runtimeEngine=native,spark.dataproc.driver.compute.tier=premium,spark.dataproc.executor.compute.tier=premium \ OTHER_FLAGS_AS_NEEDED
Notes:
- PROJECT_ID: Your Google Cloud project ID. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
- REGION: An available Compute Engine region to run the workload.
- OTHER_FLAGS_AS_NEEDED: See Submit a Spark batch workload.
API
Set the following Dataproc API fields to configure a batch workload for Native Query Execution:
RuntimeConfig.properties: Set the following Native Query Execution properties:
"spark.dataproc.runtimeEngine":"native" "spark.dataproc.driver.compute.tier":"premium" "spark.dataproc.executor.compute".tier:"premium"
Notes:
- See Submit a Spark batch workload to set other batch workload API fields.
When to use Native Query Execution
Use Native Query Execution in the following scenarios:
Spark Dataframe APIs, Spark Dataset APIs, and Spark SQL queries that read data from Parquet and ORC files. The output file format doesn't affect Native Query Execution performance.
Workloads recommended by the Native Query Execution qualification tool.
When not to use Native Query Execution
Inputs of the following data types:
- Byte: ORC and Parquet
- Timestamp: ORC
- Struct, Array, Map: Parquet
Limitations
Enabling Native Query Execution in the following scenarios can cause exceptions, Spark incompatibilities, or workload fallback to the default Spark engine.
Fallbacks
Native Query Execution in the following the execution can result in workload fallback to the Spark execution engine, resulting in regression or failure.
ANSI: If ANSI mode is enabled, execution falls back to Spark.
Case-sensitive mode: Native Query Execution supports the Spark default case-insensitive mode only. If case-sensitive mode is enabled, incorrect results can occur.
Partitioned table scan: Native Query Execution supports the partitioned table scan only when the path contains the partition information, otherwise the workload falls back to the Spark execution engine.
Incompatible behavior
Incompatible behavior or incorrect results can result when using Native query execution in the following cases:
JSON functions: Native Query Execution supports strings surrounded by double quotes, not single quotes. Incorrect results occur with single quotes. Using "*" in the path with the
get_json_object
function returnsNULL
.Parquet read configuration:
- Native Query Execution treats
spark.files.ignoreCorruptFiles
as set to the defaultfalse
value, even when set totrue
. - Native Query Execution ignores
spark.sql.parquet.datetimeRebaseModeInRead
, and returns only the Parquet file contents. Differences between the legacy hybrid (Julian Gregorian) calendar and the Proleptic Gregorian calendar are not considered. Spark results can differ.
- Native Query Execution treats
NaN
: Not supported. Unexpected results can occur, for example, when usingNaN
in a numeric comparison.Spark columnar reading: A fatal error can occur due since the Spark columnar vector is incompatible with Native Query Execution.
Spill: When shuffle partitions are set to a large number, the spill-to-disk feature can trigger an
OutOfMemoryException
. If this occurs, reducing the number of partitions can eliminate this exception.