Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. Use Dataproc for data lake modernization, ETL, and secure data science, at scale, integrated with Google Cloud, at a fraction of the cost.
Enterprise security integrated with Google Cloud
Security features such as default at-rest encryption, OS Login, VPC Service Controls, and customer-managed encryption keys (CMEK). Enable Hadoop Secure Mode via Kerberos by adding a security configuration.
Submit Spark jobs which auto-provision and auto-scale. More details with the quickstart link below.
Dataproc initialization actions
Add other OSS projects to your Dataproc clusters with pre-built initialization actions.
Open source connectors
Libraries and tools for Apache Hadoop interoperability.
Dataproc Workflow Templates
The Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism for managing and executing workflows.
Enterprises are migrating their existing on-premises Apache Hadoop and Spark clusters over to Dataproc to manage costs and unlock the power of elastic scale. With Dataproc, enterprises get a fully managed, purpose-built cluster that can autoscale to support any data or analytics processing job.
Create your ideal data science environment by spinning up a purpose-built Dataproc cluster. Integrate open source software like Apache Spark, NVIDIA RAPIDS, and Jupyter notebooks with Google Cloud AI services and GPUs to help accelerate your machine learning and AI development.
|Serverless Spark||Deploy Spark applications and pipelines that autoscale without any manual infrastructure provisioning or tuning.|
|Resizable clusters||Create and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options.|
|Autoscaling clusters||Dataproc autoscaling provides a mechanism for automating cluster resource management and enables automatic addition and subtraction of cluster workers (nodes).|
|Cloud integrated||Built-in integration with Cloud Storage, BigQuery, Dataplex, Vertex AI, Composer, Cloud Bigtable, Cloud Logging, and Cloud Monitoring, giving you a more complete and robust data platform.|
|Automatic or manual configuration||Dataproc automatically configures hardware and software but also gives you manual control.|
|Developer tools||Multiple ways to manage a cluster, including an easy-to-use web UI, the Cloud SDK, RESTful APIs, and SSH access.|
|Initialization actions||Run initialization actions to install or customize the settings and libraries you need when your cluster is created.|
|Optional components||Use optional components to install and configure additional components on the cluster. Optional components are integrated with Dataproc components and offer fully configured environments for Zeppelin, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem.|
|Custom containers and images||Dataproc serverless Spark can be provisioned with custom docker containers. Dataproc clusters can be provisioned with a custom image that includes your pre-installed Linux operating system packages.|
|Flexible virtual machines||Clusters can use custom machine types and preemptible virtual machines to make them the perfect size for your needs.|
|Workflow templates||Dataproc workflow templates provide a flexible and easy-to-use mechanism for managing and executing workflows. A workflow template is a reusable workflow configuration that defines a graph of jobs with information on where to run those jobs.|
|Automated policy management||Standardize security, cost, and infrastructure policies across a fleet of clusters. You can create policies for resource management, security, or network at a project level. You can also make it easy for users to use the correct images, components, metastore, and other peripheral services, enabling you to manage your fleet of clusters and serverless Spark policies in the future.|
|Smart alerts||Dataproc recommended alerts allow customers to adjust the thresholds for the pre-configured alerts to get alerts on idle, runaway clusters, jobs, overutilized clusters and more. Customers can further customize these alerts and even create advanced cluster and job management capabilities. These capabilities allow customers to manage their fleet at scale.|
|Dataproc on Google Distributed Cloud (GDC)||Dataproc on GDC enables you to run Spark on the GDC Edge Appliance in your data center. Now you can use the same Spark applications on Google Cloud as well as on sensitive data in your data center.|
|Multi-regional Dataproc Metastore||Dataproc Metastore is a fully managed, highly available Hive metastore (HMS) with fine-grained access control. Multi-regional Dataproc Metastore provides active-active DR and resilience against regional outages.|