Compare to Google Cloud Serverless for Apache Spark

For Google Cloud customers who rely on Apache Spark to run their data processing and analytics workloads, a key decision is choosing between Dataproc on Compute Engine (referred to as "Dataproc" in this document) and Serverless for Apache Spark. While both services offer a managed, highly scalable, production-ready, and secure Spark environment that is OSS-compatible with full support for data formats, these two platforms differ fundamentally in how the underlying infrastructure is managed and billed.

This document compares Google Cloud Serverless for Apache Spark to Dataproc and lists their features and capabilities to help you decide on the best Spark workload solution.

Compare Serverless for Apache Spark to Dataproc

If you want to provision and manage infrastructure, and then execute workloads on Spark and other open source processing frameworks, use Dataproc on Compute Engine. The following table lists key differences between the Dataproc on Compute Engine and Serverless for Apache Spark.

Capability Serverless for Apache Spark Dataproc on Compute Engine
Processing frameworks Batch workloads and interactive sessions: Spark Spark. Other open source frameworks, such as Hive, Flink, Trino, and Kafka
Serverless Yes No
Startup time 50s 120s
Infrastructure control No Yes
Resource management Serverless YARN
GPU support Yes Yes
Interactive sessions Yes No
Custom containers Yes No
VM access (SSH) No Yes
Java versions Java 17, 21 Java 17 and previous versions

Decide on the best Spark service

This section outlines the core strengths and primary use cases for each service to help you select the best service for your Spark workloads.

Overview

Dataproc and Serverless for Apache Spark differ in the degree of control, infrastructure management, and billing mode that each offer.

  • Dataproc-managed Spark: Dataproc offers Spark-clusters-as-a-service, running managed Spark on your Compute Engine infrastructure. You pay for cluster uptime.
  • Serverless for Apache Spark: Serverless for Apache Spark offers Spark-jobs-as-a-service, running Spark on fully managed Google Cloud infrastructure. You pay for job runtime.

Due to these differences, each service is best suited in the following use cases:

Service Use cases
Dataproc Long-running, shared environments
Workloads requiring granular control over infrastructure
Migrating legacy Hadoop and Spark environments
Serverless for Apache Spark Different dedicated job environments
Scheduled batch workloads
Code management prioritized over infrastructure management

Key differences

Feature Dataproc Serverless for Apache Spark
Management model Cluster-based. You provision and manage clusters. Fully managed, serverless execution environment.
Control & customization Greater control over cluster configuration, machine types, and software. Ability to use spot VMs, and reuse reservations and Compute Engine resource capacity. Suitable for workloads that have a dependency on specific VM shapes, such as CPU architectures. Less infrastructure control, with focus on submitting code and specifying Spark parameters.
Use cases Long-running, shared clusters, migrating existing Hadoop and Spark workloads with custom configs, workloads requiring deep customization. Ad-hoc queries, interactive analysis, new Spark pipelines, and workloads with unpredictable resource needs.
Operational overhead Higher overhead that requires cluster management, scaling, and maintenance. Lower overhead. Google Cloud manages the infrastructure, scaling, and provisioning, enabling a NoOps model. Gemini Cloud Assist makes troubleshooting easier while Serverless for Apache Spark autotuning helps provide optimal performance.
Efficiency model Efficiency gained by sharing clusters across jobs and teams, with a sharing, multi-tenancy model. No idle compute overhead: compute resource allocation only when the job is running. No startup and shutdown cost. Shared interactive sessions supported for improved efficiency.
Location control Clusters are zonal. The zone can be auto-selected during cluster creation. Serverless for Apache Spark supports regional workloads without extra cost to provide extra reliability and obtainability.
Cost Billed for the time the cluster is running, including startup and teardown, based on the number of nodes. Includes Dataproc license fee plus infrastructure cost. Billed only for the duration of the Spark job execution, not including startup and teardown, based on resources consumed. Billed as Data Compute Units (DCU) used and other infrastructure costs.
Committed Usage Discounts (CUDs) Compute Engine CUDs apply to all resource usage. BigQuery spend-based CUDs apply to Serverless for Apache Spark jobs.
Image and runtime control Users can pin to minor and subminor Dataproc image versions. Users can pin to minor Serverless for Apache Spark runtime verions; subminor versions are managed by Serverless for Apache Spark.
Resource management YARN Serverless
GPU support Yes Yes
Interactive sessions No Yes
Custom containers No Yes
VM access (SSH) Yes No
Java versions Previous versions supported Java 17, 21
Startup time 120s 50s

When to choose Dataproc

Dataproc is a managed service you can use to run Apache Spark and other open source data processing frameworks. It offers a high degree of control and flexibility, making it the preferred choice in the following scenarios:

  • Migrating existing Hadoop and Spark workloads: Supports migrating on-premises Hadoop or Spark clusters to Google Cloud. Replicate existing configurations with minimal code changes, particularly when using older Spark versions.
  • Deep customization and control: Lets you customize cluster machine types, disk sizes, and network configurations. This level of control is critical for performance tuning and optimizing resource utilization for complex, long-running jobs.
  • Long-running and persistent clusters: Supports continuous, long-running Spark jobs and persistent clusters for multiple teams and projects.
  • Diverse open source ecosystem: Provides a unified environment to run data processing pipelines running Hadoop ecosystem tools, such as Hive, Pig, or Presto, with your Spark workloads.
  • Security compliance: Enables control over infrastructure to meet specific security or compliance standards, such as safeguarding personally identifiable information (PII) or protected health information (PHI).
  • Infrastructure flexibility: Offers Spot VMs and the ability to reuse reservations and Compute Engine resource capacity to balance resource use and facilitate your cloud infrastructure strategy.

When to choose Serverless for Apache Spark

Serverless for Apache Spark abstracts away the complexities of cluster management, allowing you to focus on Spark code. This makes it an excellent choice for use in the following data processing scenarios:

  • Ad-hoc and interactive analysis: For data scientists and analysts who run interactive queries and exploratory analysis using Spark, the serverless model provides a quick way to get started without focusing on infrastructure.
  • Spark-based applications and pipelines: When building new data pipelines or applications on Spark, Serverless for Apache Spark can significantly accelerate development by removing the operational overhead of cluster management.
  • Workloads with sporadic or unpredictable demand: For intermittent Spark jobs or jobs with fluctuating resource requirements, Serverless for Apache Spark autoscaling and pay-per-use pricing (charges apply to job resource consumption) can significantly reduce costs.
  • Developer productivity focus: By eliminating the need for cluster provisioning and management, Serverless for Apache Spark speeds the creation of business logic, provides faster insights, and increases productivity.
  • Simplified operations and reduced overhead: Serverless for Apache Spark infrastructure management reduces operational burdens and costs.

Summing up

The decision whether to use Dataproc or Serverless for Apache Spark depends on your workload requirements, operational preferences, and preferred level of control.

  • Choose Dataproc when you need maximum control, need to migrate Hadoop or Spark workloads, or require a persistent, customized, shared cluster environment.
  • Choose Serverless for Apache Spark for its ease of use, cost-efficiency for intermittent workloads, and its ability to accelerate development for new Spark applications by removing the overhead of infrastructure management.

After evaluating the factors listed in this section, select the most efficient and cost-effective service to run Spark to unlock the full potential of your data.