Use Spark SQL with Dataproc Metastore

This page shows you an example of using Spark SQL with a Dataproc Metastore service. In this example, you launch a Spark SQL session on a Dataproc cluster and run some sample commands to create a database and table.

Before you begin

Connect to Spark SQL

To start using Spark SQL, use SSH to connect to the Dataproc cluster that's associated with your Dataproc Metastore service. After you connect to the cluster with SSH, you can run Spark commands to manage your metadata.

To connect to Spark SQL

  1. In the Google Cloud console, go to the VM Instances page.
  2. In the list of virtual machine instances, click SSH in the row of the Dataproc VM instance that you want to connect to.

A browser window opens in your home directory on the node with an output similar to the following:

Connected, host fingerprint: ssh-rsa ...
Linux cluster-1-m 3.16.0-0.bpo.4-amd64 ...

To start Hive and create a database and table, run the following commands in the SSH session:

  1. Start the Spark shell.

  2. Create a database called myDB.

    spark.sql("create database myDB");
  3. Use the database you created.

    spark.sql("use myDB");
  4. Create a table called myTable.

    spark.sql("create table bar(id int,name string)");
  5. List the tables under myDatabase.

    spark.sql("show tables").show();
  6. Show the table rows in the table you created.

    desc myTable;

Running these commands shows an output similar to the following:


scala>spark.sql("create database myDB");

scala>spark.sql("use myTable");

scala>spark.sql("create table myTable(id int,name string)");

scala>spark.sql("show tables").show();

|    myDB|  myTable|      false|
|      id|      int|   null|
|    name|   string|   null|

What's next