Create a Dataproc cluster by using the gcloud CLI
This page shows you how to use the Google Cloud CLI gcloud command-line tool to create a Dataproc cluster, run a Apache Spark job in the cluster, then modify the number of workers in the cluster.
You can find out how to do the same or similar tasks with Quickstarts Using the API Explorer, the Google Cloud console in Create a Dataproc cluster by using the Google Cloud console, and using the client libraries in Create a Dataproc cluster by using client libraries.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataproc API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataproc API.
Create a cluster
To create a cluster called example-cluster
, run the following command:
gcloud dataproc clusters create example-cluster --region=REGION
The command output confirms cluster creation:
Waiting for cluster creation operation...done. Created [... example-cluster]
For information on selecting a region, see
Available regions & zones.
To see a list of available regions, you can run the
gcloud compute regions list
command.
To learn about regional endpoints, see
Regional endpoints.
Submit a job
To submit a sample Spark job that calculates a rough value for pi
, run the
following command:
gcloud dataproc jobs submit spark --cluster example-cluster \ --region=REGION \ --class org.apache.spark.examples.SparkPi \ --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000
This command specifies the following:
- You want to run a
spark
job on theexample-cluster
cluster in the specified region - The
class
containing the main method for the job's pi-calculating application - The location of the jar file containing your job's code
- Any parameters you want to pass to the job—in this case the number of
tasks, which is
1000
The job's running and final output is displayed in the terminal window:
Waiting for job output... ... Pi is roughly 3.14118528 ... Job finished successfully.
Update a cluster
To change the number of workers in the cluster to five, run the following command:
gcloud dataproc clusters update example-cluster \ --region=REGION \ --num-workers 5
The command output displays your cluster's details. For example:
workerConfig: ... instanceNames: - example-cluster-w-0 - example-cluster-w-1 - example-cluster-w-2 - example-cluster-w-3 - example-cluster-w-4 numInstances: 5 statusHistory: ... - detail: Add 3 workers.
To decrease the number of worker nodes to the original value, use the same command:
gcloud dataproc clusters update example-cluster \ --region=REGION \ --num-workers 2
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
To delete your
example-cluster
, run theclusters delete
command:gcloud dataproc clusters delete example-cluster \ --region=REGION
To confirm and complete the cluster deletion, press y and then press Enter when prompted.
What's next
- Learn how to write and run a Spark Scala job.