Skip to main content
Google Cloud
Documentation Technology areas
  • AI and ML
  • Application development
  • Application hosting
  • Compute
  • Data analytics and pipelines
  • Databases
  • Distributed, hybrid, and multicloud
  • Generative AI
  • Industry solutions
  • Networking
  • Observability and monitoring
  • Security
  • Storage
Cross-product tools
  • Access and resources management
  • Costs and usage management
  • Google Cloud SDK, languages, frameworks, and tools
  • Infrastructure as code
  • Migration
Related sites
  • Google Cloud Home
  • Free Trial and Free Tier
  • Architecture Center
  • Blog
  • Contact Sales
  • Google Cloud Developer Center
  • Google Developer Center
  • Google Cloud Marketplace
  • Google Cloud Marketplace Documentation
  • Google Cloud Skills Boost
  • Google Cloud Solution Center
  • Google Cloud Support
  • Google Cloud Tech Youtube Channel
/
  • English
  • Deutsch
  • Español
  • Español – América Latina
  • Français
  • Indonesia
  • Italiano
  • Português
  • Português – Brasil
  • 中文 – 简体
  • 中文 – 繁體
  • 日本語
  • 한국어
Console Sign in
  • Dataproc
Guides Reference Samples Resources
Contact Us Start free
Google Cloud
  • Documentation
    • Guides
    • Reference
    • Samples
    • Resources
  • Technology areas
    • More
  • Cross-product tools
    • More
  • Related sites
    • More
  • Console
  • Contact Us
  • Start free
  • Discover
  • Product overview
  • Components
    • Overview
    • Delta Lake
    • Docker
    • Flink
    • HBase
    • Hive WebHCat
    • Hudi
    • Iceberg
    • Jupyter
    • Presto
    • Ranger
      • Install Ranger
      • Use Ranger with Kerberos
      • Use Ranger with caching and downscoping
      • Back up and restore a Ranger schema
    • Solr
    • Trino
    • Zeppelin
    • Zookeeper
  • Services
  • Compute options
    • Machine types
    • GPUs
    • Minimum CPU platform
    • Secondary workers
    • Local solid state drives
    • Boot disks
  • Versioning
    • Overview
    • 2.2.x release versions
    • 2.1.x release versions
    • 2.0.x release versions
    • Cluster image version lists
  • Frequently asked questions
  • Get started
  • Run Spark on Dataproc
    • Use the console
    • Use the command line
    • Use the REST APIs Explorer
      • Create a cluster
      • Run a Spark job
      • Update a cluster
      • Delete a cluster
    • Use client libraries
    • Run Spark using Kubernetes
  • Create
  • Set up a project
  • Use Dataproc templates
  • Create Dataproc clusters
    • Create a cluster
    • Create a high availability cluster
    • Create a node group cluster
    • Create a partial cluster
    • Create a single-node cluster
    • Create sole-tenant cluster
    • Create a zero-scale cluster
    • Recreate a cluster
    • Create a custom image
  • Create Kubernetes clusters
    • Overview
    • Release versions
    • Recreate a cluster
    • Create node pools
    • Create a custom image
  • Create an Apache Iceberg table with metadata in BigQuery metastore
  • Develop
  • Apache Hadoop
  • Apache HBase
  • Apache Hive and Kafka
  • Apache Spark
    • Configure
      • Manage Spark dependencies
      • Customize Spark environment
      • Enable concurrent writes
      • Enhance Spark performance
      • Tune Spark
    • Connect
      • Use the Spark BigQuery connector
      • Use the Cloud Storage connector
      • Use the Spark Spanner connector
    • Run
      • Use HBase
      • Use Monte Carlo simulation
      • Use Spark ML
      • Use Spark Scala
  • Use Notebooks
    • Overview
    • Run a Jupyter notebook on a Dataproc cluster
    • Run a genomics analysis on a notebook
    • Use the JupyterLab plugin on Dataproc Serverless
  • Python
    • Configure environment
    • Use Cloud Client Libraries
  • Trino
  • Deploy
  • Run jobs
    • Life of a job
    • Submit a job
    • Restart jobs
    • View job history
  • Use workflow templates
    • Overview
    • Parameterization
    • Use YAML files
    • Use cluster selectors
    • Use inline workflows
  • Orchestrate workflows
    • Workflow scheduling solutions
    • Use Dataproc workflow templates
    • Use Cloud Composer
    • Use Cloud Functions
    • Use Cloud Scheduler
  • Tune performance
    • Optimize Spark performance
    • Dataproc metrics
    • Create metric alerts
    • Profile resource usage
  • Manage
  • Manage clusters
    • Start and stop a cluster
    • Update and delete a cluster
    • Rotate clusters
    • Configure clusters
      • Set cluster properties
      • Select region
      • Autoselect zone
      • Define initialization actions
      • Prioritize VM types
      • Schedule cluster deletion
    • Scale clusters
      • Scale clusters
      • Autoscale clusters
    • Manage data
      • Hadoop data storage
      • Select storage type
      • Cache cluster data
      • Offload shuffle data
    • Manage networks
      • Configure a network
  • Manage Kubernetes clusters
    • Scale clusters
    • Delete a cluster
  • Access clusters
    • Use SSH
    • Connect to web interfaces
    • Use Component Gateway
    • Set Workforce access
  • Manage metadata and labels
    • Enable Spark data lineage
    • Enable Hive data lineage
    • Set metadata
    • Set labels for filtering
    • Use secure tags
  • Connect to Dataproc
    • Migrate Hadoop
    • Connect with BigQuery
      • BigQuery connector
      • Hive-BigQuery connector
      • Code samples
    • Connect with Bigtable
    • Connect with Cloud Storage
    • Connect with Pub/Sub Lite
  • Production best practices
  • Secure and control access
  • Security best practices
  • Authenticate users
    • Authenticate to Dataproc
    • Authenticate personal clusters
  • Assign roles and permissions
    • Dataproc roles and permissions
    • Dataproc principals
    • Granular IAM
    • Assign roles for Kubernetes
  • Create service accounts
  • Secure clusters
    • Secure multi-tenancy via Kerberos
    • Secure multi-tenancy via service accounts
    • Encrypt memory
    • Manage data encryption keys
    • Enable Ranger authorization service
    • Use the Secret Manager credential provider
    • Create and secure a Hive metastore cluster
  • Create custom constraints
  • Check billing
  • Troubleshoot
  • Overview
  • Analyze logs
    • Dataproc logs
    • Job output logs
    • Audit logs
  • Troubleshoot clusters
    • View cluster diagnostic data
    • Troubleshoot cluster creation issues
    • Diagnose Kubernetes clusters
    • Enable Kubernetes logging
  • Troubleshoot jobs
    • Troubleshoot jobs
    • Troubleshoot memory errors
    • Troubleshoot job delays
    • View job history
    • Troubleshoot Workflow templates
  • AI and ML
  • Application development
  • Application hosting
  • Compute
  • Data analytics and pipelines
  • Databases
  • Distributed, hybrid, and multicloud
  • Generative AI
  • Industry solutions
  • Networking
  • Observability and monitoring
  • Security
  • Storage
  • Access and resources management
  • Costs and usage management
  • Google Cloud SDK, languages, frameworks, and tools
  • Infrastructure as code
  • Migration
  • Google Cloud Home
  • Free Trial and Free Tier
  • Architecture Center
  • Blog
  • Contact Sales
  • Google Cloud Developer Center
  • Google Developer Center
  • Google Cloud Marketplace
  • Google Cloud Marketplace Documentation
  • Google Cloud Skills Boost
  • Google Cloud Solution Center
  • Google Cloud Support
  • Google Cloud Tech Youtube Channel
  • Home
  • Dataproc
  • Documentation
  • Guides

Migrate Hadoop
Stay organized with collections Save and categorize content based on your preferences.

To migrate Apache Hadoop workflows and data to Google Cloud and Dataproc, see the following documents:

  • Migrating On-Premises Hadoop Infrastructure to Google Cloud
  • Migrating HDFS Data from On-Premises to Google Cloud

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-05-10 UTC.

  • Why Google

    • Choosing Google Cloud
    • Trust and security
    • Modern Infrastructure Cloud
    • Multicloud
    • Global infrastructure
    • Customers and case studies
    • Analyst reports
    • Whitepapers
  • Products and pricing

    • See all products
    • See all solutions
    • Google Cloud for Startups
    • Google Cloud Marketplace
    • Google Cloud pricing
    • Contact sales
  • Support

    • Google Cloud Community
    • Support
    • Release Notes
    • System status
  • Resources

    • GitHub
    • Getting Started with Google Cloud
    • Google Cloud documentation
    • Code samples
    • Cloud Architecture Center
    • Training and Certification
    • Developer Center
  • Engage

    • Blog
    • Events
    • X (Twitter)
    • Google Cloud on YouTube
    • Google Cloud Tech on YouTube
    • Become a Partner
    • Google Cloud Affiliate Program
    • Press Corner
  • About Google
  • Privacy
  • Site terms
  • Google Cloud terms
  • Manage cookies
  • Our third decade of climate action: join us
  • Sign up for the Google Cloud newsletter Subscribe
  • English
  • Deutsch
  • Español
  • Español – América Latina
  • Français
  • Indonesia
  • Italiano
  • Português
  • Português – Brasil
  • 中文 – 简体
  • 中文 – 繁體
  • 日本語
  • 한국어