Dataproc

Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. Use Dataproc for data lake modernization, ETL, and secure data science, at scale, integrated with Google Cloud, at a fraction of the cost.

Try Dataproc free Go to console

Open: Run open source data analytics at scale, with enterprise grade security
Flexible: Use serverless, or manage clusters on Google Compute and Kubernetes
Intelligent: Enable data users through integrations with Vertex AI, BigQuery, and Dataplex
Secure: Configure advanced security such as Kerberos, Apache Ranger and Personal Authentication
Cost-effective: Realize 54% lower TCO compared to on-prem data lakes with per-second pricing

Dataproc icon in the center of a ring of logos: Apache Spark, Presto, Hive, Jupyter, Hadoop, Flink, Apache Pig

VIDEO

Dataproc supports popular OSS like Apache Spark, Presto, Flink, and more.

01:23

Benefits

Modernize your open source data processing

Serverless deployment, logging, and monitoring let you focus on your data and analytics, not on your infrastructure. Reduce TCO of Apache Spark management by up to 54%. Build and train models 5X faster.

Intelligent and seamless OSS for data science

Enable data scientists and data analysts to seamlessly perform data science jobs through native integrations with BigQuery, Dataplex, Vertex AI, and OSS notebooks like JupyterLab.

Enterprise security integrated with Google Cloud

Security features such as default at-rest encryption, OS Login, VPC Service Controls, and customer-managed encryption keys (CMEK). Enable Hadoop Secure Mode via Kerberos by adding a security configuration.

Key features

Fully managed and automated big data open source software

Serverless deployment, logging, and monitoring let you focus on your data and analytics, not on your infrastructure. Reduce TCO of Apache Spark management by up to 54%. Enable data scientists and engineers to build and train models 5X faster, compared to traditional notebooks, through integration with Vertex AI Workbench. The Dataproc Jobs API makes it easy to incorporate big data processing into custom applications, while Dataproc Metastore eliminates the need to run your own Hive metastore or catalog service.

Containerize Apache Spark jobs with Kubernetes

Build your Apache Spark jobs using Dataproc on Kubernetes so you can use Dataproc with Google Kubernetes Engine (GKE) to provide job portability and isolation.

Enterprise security integrated with Google Cloud

When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos by adding a Security Configuration. Additionally, some of the most commonly used Google Cloud-specific security features used with Dataproc include default at-rest encryption, OS Login, VPC Service Controls, and customer-managed encryption keys (CMEK).

The best of open source with the best of Google Cloud

Dataproc lets you take the open source tools, algorithms, and programming languages that you use today, but makes it easy to apply them on cloud-scale datasets. At the same time, Dataproc has out-of-the-box integration with the rest of the Google Cloud analytics, database, and AI ecosystem. Data scientists and engineers can quickly access data and build data applications connecting Dataproc to BigQuery, Vertex AI, Spanner, Pub/Sub, or Data Fusion.

View all features

Thumbnail of a bank building with a spreadsheet to the left and a mobile phone to the right

VIDEO

Demo: See how Dataproc and Cloud Storage can help accelerate loan processing

3:39

Customers

Learn from customers using Dataproc

Blog post

Broadcom modernizes its data lake with Dataproc and unlocks flexible data management

5-min read

Case study

Dataproc provides Wayfair high-performance, low-maintenance access to unstructured data at scale.

8-min read

Video

Vodafone Group moves 600 on-premises Apache Hadoop servers to the cloud.

47:17

Case study

Twitter moved from on-premises Hadoop to Google Cloud to more cost-effectively store and query data.

49:57

Case study

Pandora migrated 7 PB+ of data from their on-prem Hadoop to Google Cloud to help scale and lower costs.

50:51

Case study

Spinning up and down Dataproc clusters helped METRO reduce infrastructure costs by 30% to 50%.

5-min read

See all customers

What's new

Serverless Spark is now Generally Available. Sign up for preview for other Spark on Google Cloud services.

Blog post

Serverless Spark jobs made seamless for all data users Learn more

Blog post

Converging architectures: Bringing data lakes and data warehouses together Read the blog

Blog post

New Dataproc best practices guide Learn more

Blog post

New GA Dataproc features extend data science and ML capabilities Learn more

Documentation

Serverless Spark

Submit Spark jobs which auto-provision and auto-scale. More details with the quickstart link below.

Dataproc initialization actions

Add other OSS projects to your Dataproc clusters with pre-built initialization actions.

Open source connectors

Libraries and tools for Apache Hadoop interoperability.

Dataproc Workflow Templates

The Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism for managing and executing workflows.

Use cases

Use case

Move your Hadoop and Spark clusters to the cloud

Enterprises are migrating their existing on-premises Apache Hadoop and Spark clusters over to Dataproc to manage costs and unlock the power of elastic scale. With Dataproc, enterprises get a fully managed, purpose-built cluster that can autoscale to support any data or analytics processing job.

Best practice

Apache Spark migration guide

Don’t rewrite your Spark code in Google Cloud.

Learn more

Best practice

Migrate HDFS data to Google Cloud

Learn when and how you should migrate your on-premises HDFS data to Google Cloud Storage.

Learn more

Best practice

Moving security controls from on-premises to Dataproc

Migrate existing security controls to Dataproc to help achieve enterprise and industry compliance.

Learn more

Use case

Data science on Dataproc

Create your ideal data science environment by spinning up a purpose-built Dataproc cluster. Integrate open source software like Apache Spark, NVIDIA RAPIDS, and Jupyter notebooks with Google Cloud AI services and GPUs to help accelerate your machine learning and AI development.

Tutorial

Use Dataproc and Apache Spark ML for machine learning

Integrate Dataproc with other Google Cloud services to build an end-to-end data science experience.

Learn more

Best practice

IT governed open source data science with Dataproc Hub

Learn how Dataproc Hub can provide your data scientist all the open source tools they need in an IT governed and cost control way.

Learn more

Tutorial

Dataproc meets TensorFlow on YARN

Learn how to orchestrate distributed TensorFlow with TonY.

Learn more

View all technical guides

All features

Serverless Spark	Deploy Spark applications and pipelines that autoscale without any manual infrastructure provisioning or tuning.
Resizable clusters	Create and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options.
Autoscaling clusters	Dataproc autoscaling provides a mechanism for automating cluster resource management and enables automatic addition and subtraction of cluster workers (nodes).
Cloud integrated	Built-in integration with Cloud Storage, BigQuery, Dataplex, Vertex AI, Composer, Bigtable, Cloud Logging, and Cloud Monitoring, giving you a more complete and robust data platform.
Automatic or manual configuration	Dataproc automatically configures hardware and software but also gives you manual control.
Developer tools	Multiple ways to manage a cluster, including an easy-to-use web UI, the Cloud SDK, RESTful APIs, and SSH access.
Initialization actions	Run initialization actions to install or customize the settings and libraries you need when your cluster is created.
Optional components	Use optional components to install and configure additional components on the cluster. Optional components are integrated with Dataproc components and offer fully configured environments for Zeppelin, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem.
Custom containers and images	Dataproc serverless Spark can be provisioned with custom docker containers. Dataproc clusters can be provisioned with a custom image that includes your pre-installed Linux operating system packages.
Flexible virtual machines	Clusters can use custom machine types and preemptible virtual machines to make them the perfect size for your needs.
Workflow templates	Dataproc workflow templates provide a flexible and easy-to-use mechanism for managing and executing workflows. A workflow template is a reusable workflow configuration that defines a graph of jobs with information on where to run those jobs.
Automated policy management	Standardize security, cost, and infrastructure policies across a fleet of clusters. You can create policies for resource management, security, or network at a project level. You can also make it easy for users to use the correct images, components, metastore, and other peripheral services, enabling you to manage your fleet of clusters and serverless Spark policies in the future.
Smart alerts	Dataproc recommended alerts allow customers to adjust the thresholds for the pre-configured alerts to get alerts on idle, runaway clusters, jobs, overutilized clusters and more. Customers can further customize these alerts and even create advanced cluster and job management capabilities. These capabilities allow customers to manage their fleet at scale.
Dataproc on Google Distributed Cloud (GDC)	Dataproc on GDC enables you to run Spark on the GDC Edge Appliance in your data center. Now you can use the same Spark applications on Google Cloud as well as on sensitive data in your data center.
Multi-regional Dataproc Metastore	Dataproc Metastore is a fully managed, highly available Hive metastore (HMS) with fine-grained access control. Multi-regional Dataproc Metastore provides active-active DR and resilience against regional outages.

Pricing

Dataproc pricing is based on the number of vCPU and the duration of time that they run. While pricing shows hourly rate, we charge down to the second, so you only pay for what you use.

Ex: A cluster with 6 nodes (1 main + 5 workers) of 4 CPUs each ran for 2 hours would cost $0.48. Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48

Please see pricing page for details.

View pricing details

Partners

Dataproc integrates with key partners to complement your existing investments and skill sets.