Dataproc optional Delta Lake component

You can install additional components like Delta Lake when you create a Dataproc cluster using the Optional components feature. This page describes how you can optionally install the Delta Lake component on a Dataproc cluster.

When installed on a Dataproc cluster, the Delta Lake component installs Delta Lake libraries and configures Spark and Hive in the cluster to work with Delta Lake.

Compatible Dataproc image versions

You can install the Delta Lake component on Dataproc clusters created with Dataproc image version 2.2.46 and later image versions.

See Supported Dataproc versions for the Delta Lake component version included in Dataproc image releases.

When you create a Dataproc cluster with the Delta Lake component enabled, the following Spark properties are configured to work with Delta Lake.

Config file Property Default value
/etc/spark/conf/spark-defaults.conf spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
/etc/spark/conf/spark-defaults.conf spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog

Install the component

Install the component when you create a Dataproc cluster using the Google Cloud console, Google Cloud CLI, or the Dataproc API.

Console

  1. In the Google Cloud console, go to the Dataproc Create a cluster page.

    Go to Create a cluster

    The Set up cluster panel is selected.

  2. In the Components section, under Optional components, select Delta Lake and other optional components to install on your cluster.

gcloud CLI

To create a Dataproc cluster that includes the Delta Lake component, use the gcloud dataproc clusters create command with the --optional-components flag.

gcloud dataproc clusters create CLUSTER_NAME \
    --optional-components=DELTA \
    --region=REGION \
    ... other flags

Notes:

  • CLUSTER_NAME: Specify the name of the cluster.
  • REGION: Specify a Compute Engine region where the cluster will be located.

REST API

The Delta Lake component can be specified through the Dataproc API using the SoftwareConfig.Component as part of a clusters.create request.

Usage examples

This section provides data read and write examples using Delta Lake tables.

Delta Lake table

Write to a Delta Lake table

You can use the Spark DataFrame to write data to a Delta Lake table. The following examples create a DataFrame with sample data, create a my_delta_table Delta Lake table In Cloud Storage, and then write the data to the Delta Lake table.

PySpark

# Create a DataFrame with sample data.
data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])

# Create a Delta Lake table in Cloud Storage.
spark.sql("""CREATE TABLE IF NOT EXISTS my_delta_table (
    id integer,
    name string)
USING delta
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'""")

# Write the DataFrame to the Delta Lake table in Cloud Storage.
data.writeTo("my_delta_table").append()

Scala

// Create a DataFrame with sample data.
val data = Seq((1, "Alice"), (2, "Bob")).toDF("id", "name")

// Create a Delta Lake table in Cloud Storage.
spark.sql("""CREATE TABLE IF NOT EXISTS my_delta_table (
    id integer,
    name string)
USING delta
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'""")

// Write the DataFrame to the Delta Lake table in Cloud Storage.
data.write.format("delta").mode("append").saveAsTable("my_delta_table")

Spark SQL

CREATE TABLE IF NOT EXISTS my_delta_table (
    id integer,
    name string)
USING delta
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table';

INSERT INTO my_delta_table VALUES ("1", "Alice"), ("2", "Bob");

Read from a Delta Lake table

The following examples read the my_delta_table and display its contents.

PySpark

# Read the Delta Lake table into a DataFrame.
df = spark.table("my_delta_table")

# Display the data.
df.show()

Scala

// Read the Delta Lake table into a DataFrame.
val df = spark.table("my_delta_table")

// Display the data.
df.show()

Spark SQL

SELECT * FROM my_delta_table;

Hive with Delta Lake

Write to a Delta Table in Hive.

The Dataproc Delta Lake optional component is pre-configured to work with Hive external tables.

For more information, see Hive connector.

Run the examples in a beeline client.

beeline -u jdbc:hive2://

Create a Spark Delta Lake table.

The Delta Lake table must be created using Spark before a Hive external table can reference it.

CREATE TABLE IF NOT EXISTS my_delta_table (
    id integer,
    name string)
USING delta
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table';

INSERT INTO my_delta_table VALUES ("1", "Alice"), ("2", "Bob");

Create a Hive external table.

SET hive.input.format=io.delta.hive.HiveInputFormat;
SET hive.tez.input.format=io.delta.hive.HiveInputFormat;

CREATE EXTERNAL TABLE deltaTable(id INT, name STRING)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table';

Notes:

  • The io.delta.hive.DeltaStorageHandler class implements the Hive data source APIs. It can load a Delta table and extract its metadata. If the table schema in the CREATE TABLE statement is not consistent with the underlying Delta Lake metadata, an error is thrown.

Read from a Delta Lake table in Hive.

To read data from a Delta table, use a SELECT statement:

SELECT * FROM deltaTable;

Drop a Delta Lake table.

To drop a Delta table, use the DROP TABLE statement:

DROP TABLE deltaTable;