Profile Google Cloud Serverless for Apache Spark resource usage

This document describes how to profile Google Cloud Serverless for Apache Spark resource usage. Cloud Profiler continuously gathers and reports application CPU usage and memory allocation information. You can enable profiling when you submit a batch or create a session workload by using the profiling properties listed in the following table. Google Cloud Serverless for Apache Spark appends related JVM options to the spark.driver.extraJavaOptions and spark.executor.extraJavaOptions configurations used for the workload.

Option	Description	Value	Default
`dataproc.profiling.enabled`	Enable profiling of the workload	`true` or `false`	`false`
`dataproc.profiling.name`	Profile name on the Profiler service	`PROFILE_NAME`	spark-`WORKLOAD_TYPE`-`WORKLOAD_ID`, where: WORKLOAD_TYPE is set to `batch` or `session` WORKLOAD_ID is set to `batchId` or `sessionId`

Notes:

Serverless for Apache Spark sets the profiler version to either the batch UUID or the session UUID.
Profiler supports the following Spark workload types: Spark, PySpark, SparkSql, and SparkR.
A workload must run for more than three minutes to allow Profiler to collect and upload data to a project.
You can override profiling options submitted with a workload by constructing a SparkConf, and then setting extraJavaOptions in your code. Note that setting extraJavaOptions properties when the workload is submitted doesn't override profiling options submitted with the workload.

For an example of profiler options used with a batch submission, see the PySpark batch workload example.

Enable profiling

Complete the following steps to enable profiling on a workload:

Enable the Profiler.
If you are using a custom VM service account, grant the Cloud Profiler Agent role to the custom VM service account. This role contains required Profiler permissions.
Set profiling properties when you submit a batch workload or create a session template.

PySpark batch workload example

The following example uses the gcloud CLI to submit a PySpark batch workload with profiling enabled.

gcloud dataproc batches submit pyspark PYTHON_WORKLOAD_FILE \
    --region=REGION \
    --properties=dataproc.profiling.enabled=true,dataproc.profiling.name=PROFILE_NAME \
    --  other args

Two profiles are created:

PROFILE_NAME-driver to profile spark driver tasks
PROFILE_NAME-executor to profile spark executor tasks

View profiles

You can view profiles from Profiler in the Google Cloud console.

What's next

See Monitor and troubleshoot Serverless for Apache Spark workloads.

Profile Google Cloud Serverless for Apache Spark resource usage Stay organized with collections Save and categorize content based on your preferences.

Enable profiling

PySpark batch workload example

View profiles

What's next

Profile Google Cloud Serverless for Apache Spark resource usage