You can install additional components like Delta Lake when you create a Dataproc cluster using the Optional components feature. This page describes how you can optionally install the Delta Lake component on a Dataproc cluster.
When installed on a Dataproc cluster, the Delta Lake component installs Delta Lake libraries and configures Spark and Hive in the cluster to work with Delta Lake.
Compatible Dataproc image versions
You can install the Delta Lake component on Dataproc clusters created with Dataproc image version 2.2.46 and later image versions.
See Supported Dataproc versions for the Delta Lake component version included in Dataproc image releases.
Delta Lake related properties
When you create a Dataproc cluster with the Delta Lake component enabled, the following Spark properties are configured to work with Delta Lake.
Config file | Property | Default value |
---|---|---|
/etc/spark/conf/spark-defaults.conf |
spark.sql.extensions |
io.delta.sql.DeltaSparkSessionExtension |
/etc/spark/conf/spark-defaults.conf |
spark.sql.catalog.spark_catalog |
org.apache.spark.sql.delta.catalog.DeltaCatalog |
Install the component
Install the component when you create a Dataproc cluster using the Google Cloud console, Google Cloud CLI, or the Dataproc API.
Console
- In the Google Cloud console, go to the Dataproc
Create a cluster page.
The Set up cluster panel is selected.
- In the Components section, under Optional components, select Delta Lake and other optional components to install on your cluster.
gcloud CLI
To create a Dataproc cluster that includes the Delta Lake component,
use the
gcloud dataproc clusters create
command with the --optional-components
flag.
gcloud dataproc clusters create CLUSTER_NAME \ --optional-components=DELTA \ --region=REGION \ ... other flags
Notes:
- CLUSTER_NAME: Specify the name of the cluster.
- REGION: Specify a Compute Engine region where the cluster will be located.
REST API
The Delta Lake component can be specified through the Dataproc API using the SoftwareConfig.Component as part of a clusters.create request.
Usage examples
This section provides data read and write examples using Delta Lake tables.
Delta Lake table
Write to a Delta Lake table
You can use the Spark DataFrame
to write data to a Delta Lake table. The following examples create a DataFrame
with sample data, create a my_delta_table
Delta Lake table In
Cloud Storage, and then write the data to the Delta Lake table.
PySpark
# Create a DataFrame with sample data.
data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
# Create a Delta Lake table in Cloud Storage.
spark.sql("""CREATE TABLE IF NOT EXISTS my_delta_table (
id integer,
name string)
USING delta
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'""")
# Write the DataFrame to the Delta Lake table in Cloud Storage.
data.writeTo("my_delta_table").append()
Scala
// Create a DataFrame with sample data.
val data = Seq((1, "Alice"), (2, "Bob")).toDF("id", "name")
// Create a Delta Lake table in Cloud Storage.
spark.sql("""CREATE TABLE IF NOT EXISTS my_delta_table (
id integer,
name string)
USING delta
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'""")
// Write the DataFrame to the Delta Lake table in Cloud Storage.
data.write.format("delta").mode("append").saveAsTable("my_delta_table")
Spark SQL
CREATE TABLE IF NOT EXISTS my_delta_table (
id integer,
name string)
USING delta
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table';
INSERT INTO my_delta_table VALUES ("1", "Alice"), ("2", "Bob");
Read from a Delta Lake table
The following examples read the my_delta_table
and display its contents.
PySpark
# Read the Delta Lake table into a DataFrame.
df = spark.table("my_delta_table")
# Display the data.
df.show()
Scala
// Read the Delta Lake table into a DataFrame.
val df = spark.table("my_delta_table")
// Display the data.
df.show()
Spark SQL
SELECT * FROM my_delta_table;
Hive with Delta Lake
Write to a Delta Table in Hive.
The Dataproc Delta Lake optional component is pre-configured to work with Hive external tables.
For more information, see Hive connector.
Run the examples in a beeline client.
beeline -u jdbc:hive2://
Create a Spark Delta Lake table.
The Delta Lake table must be created using Spark before a Hive external table can reference it.
CREATE TABLE IF NOT EXISTS my_delta_table (
id integer,
name string)
USING delta
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table';
INSERT INTO my_delta_table VALUES ("1", "Alice"), ("2", "Bob");
Create a Hive external table.
SET hive.input.format=io.delta.hive.HiveInputFormat;
SET hive.tez.input.format=io.delta.hive.HiveInputFormat;
CREATE EXTERNAL TABLE deltaTable(id INT, name STRING)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table';
Notes:
- The
io.delta.hive.DeltaStorageHandler
class implements the Hive data source APIs. It can load a Delta table and extract its metadata. If the table schema in theCREATE TABLE
statement is not consistent with the underlying Delta Lake metadata, an error is thrown.
Read from a Delta Lake table in Hive.
To read data from a Delta table, use a SELECT
statement:
SELECT * FROM deltaTable;
Drop a Delta Lake table.
To drop a Delta table, use the DROP TABLE
statement:
DROP TABLE deltaTable;