To get the most out of the Dataproc, it's helpful to understand its fundamental building blocks. This guide explains Dataproc core concepts and features and the benefits these features provide.
The cluster-based model
This is the standard, infrastructure-centric way of using Dataproc. It gives you full control over a dedicated set of virtual machines for your data processing tasks.
- Clusters: A cluster is your personal data processing engine, made up of Google Cloud virtual machines. You create a cluster to run open-source frameworks such as Apache Spark and Apache Hadoop. You have full control over cluster size, machine types, and configuration.
- Jobs: A job is a specific task, such as a PySpark script or Hadoop query. Instead of running a job directly on a cluster, you submit the job to the Dataproc service, which manages job execution for you. You can submit multiple jobs to the cluster.
- Workflow Templates: A workflow template is a reusable definition that orchestrates a series of jobs (a workflow). It can define dependencies between jobs, for example to run a machine learning job only after a data cleaning job successfully completes. The templated workflow can run on an existing cluster or on a temporary (ephemeral) cluster that is created to run the workflow, and then deleted after the workflow completes. You can use the template to run the defined workflow whenever needed.
- Autoscaling policies: An autoscaling policy contains rules that you define to add or remove worker machines from a cluster based upon cluster workload in order to dynamically optimize cluster cost and performance.
The serverless model
Serverless for Apache Spark is the modern, automated-execution Dataproc model. It lets you run jobs without provisioning, managing, or scaling the underlying infrastructure: Serverless for Apache Spark handles the details for you.
- Batches: A batch (also called a batch workload) is the serverless equivalent of a Dataproc job. You submit your code, such as a Spark job, to the service. Serverless for Apache Spark provisions the necessary resources on-demand, runs the job, and then tears them down. You don't create or manage cluster or job resources; the service does the work for you.
- Interactive sessions: Interactive sessions provide a live, on-demand environment for exploratory data analysis, typically within a Jupyter notebook. Interactive sessions provide the convenience of a temporary, serverless workspace that you can use to run queries and develop code without having to provision and manage cluster and notebook resources.
- Session templates: A session template is a reusable configuration you can use to define interactive sessions. The template contains session settings, such as Spark properties and library dependencies. You use the template to create interactive session environments for development, typically within a Jupyter notebook.
Metastore services
Dataproc provides managed services for handling metadata, which is the data about your data.
- Metastore: A metastore acts is a central catalog for data schema, such as table and column names and data types. A metastore allows different services, clusters, and jobs to understand the structure of your data. Typically, the catalog is stored in Cloud Storage.
- Federation: Metadata federation is an advanced feature that lets you access and query data from multiple metastores as if you were accessing a single, unified metastore.
Notebook and development environments
Dataproc notebooks and IDEs link to integrated development environments where you can write and execute your code.
- BigQuery Studio & Workbench: These are unified analytics and notebook environments. They allow you to write code (for example in a Jupyter notebook) and use a Dataproc cluster or serverless session as the powerful backend engine to execute your code on large datasets.
- Dataproc JupyterLab Plugin: This official JupyterLab extension acts as a control panel for Dataproc inside your notebook environment. It simplifies your workflow by allowing you to browse, create, and manage clusters and submit jobs without having to leave the Jupyter interface. Learn More
- Dataproc Spark Connect Python Connector: This Python library streamlines the process of using Spark Connect with Dataproc. It handles authentication and endpoint configuration, making it much simpler to connect your local Python environment, such as a notebook or IDE, to a remote Dataproc cluster for interactive development. Learn More
Environment customization
Dataproc offers tools and components for customizing your environment to fit specific needs. The Utilities section in the Google Cloud console contains helpful tools for customizing your Dataproc environment.