Dataproc optional Iceberg component

You can install additional components like Iceberg when you create a Dataproc cluster using the Optional components feature. This page describes how you can optionally install the Iceberg component on a Dataproc cluster.

Overview

Apache Iceberg is an open table format for large analytic datasets. It brings the reliability and simplicity of SQL tables to Big Data, while making it possible for engines such as Spark, Trino, PrestoDB, Flink, and Hive to safely work with the same tables at the same time.

When installed on a Dataproc cluster, the Apache Iceberg component installs Iceberg libraries and configures Spark and Hive to work with Iceberg on the cluster.

Key Iceberg features

Iceberg features include the following:

  • Schema evolution: Add, remove, or rename columns without rewriting the entire table.
  • Time travel: Query historical table snapshots for auditing or rollback purposes.
  • Hidden partitioning: Optimize data layout for faster queries without exposing partition details to users.
  • ACID transactions: Ensure data consistency and prevent conflicts.

Compatible Dataproc image versions

You can install the Iceberg component on Dataproc clusters created with 2.2.47 and later image versions. The Iceberg version installed on the cluster is listed in the 2.2 release versions page.

When you create a Dataproc with Iceberg cluster, the following Spark and Hive properties are configured to work with Iceberg.

Config file Property Default value
/etc/spark/conf/spark-defaults.conf spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.driver.extraClassPath /usr/lib/iceberg/lib/iceberg-spark-runtime-spark-version_scala-version.jar
spark.executor.extraClassPath /usr/lib/iceberg/lib/iceberg-spark-runtime-spark-version_scala-version.jar
/etc/hive/conf/hive-site.xml hive.aux.jars.path file:///usr/lib/iceberg/lib/iceberg-hive-runtime.jar
iceberg.engine.hive.enabled true

Install the Iceberg optional component

Install the Iceberg component when you create a Dataproc cluster. The Dataproc cluster image version list pages show the Iceberg component version included in the latest Dataproc cluster image versions.

Google Cloud console

To create a Dataproc cluster that installs the Iceberg component, complete the following steps in the Google Cloud console:

  1. Open the Dataproc Create a cluster page. The Set up cluster panel is selected.
  2. In the Components section, under Optional components, select the Iceberg component.
  3. Confirm or specify other cluster settings, then click Create.

Google Cloud CLI

To create a Dataproc cluster that installs the Iceberg component, use the gcloud dataproc clusters create command with the --optional-components flag.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --optional-components=ICEBERG \
     other flags ...

Replace the following:

REST API

To create a Dataproc cluster that installs the Iceberg optional component, specify the Iceberg SoftwareConfig.Component as part of a clusters.create request.

Use Iceberg tables with Spark and Hive

After creating a Dataproc cluster that has the Iceberg optional component installed on the cluster, you can use Spark and Hive to read and write Iceberg table data.

Spark

Configure a Spark session for Iceberg

You can use the gcloud CLI command locally, or the spark-shell or pyspark REPLs (Read-Eval-Print Loops) running on the dataproc cluster master node to enable Iceberg's Spark extensions and set up the Spark catalog to use Iceberg tables.

gcloud

Run the following gcloud CLI example in a local terminal window or in Cloud Shell to submit a Spark job and set Spark properties to configure the Spark session for Iceberg.

gcloud dataproc jobs submit spark  \
    --cluster=CLUSTER_NAME \
    --region=REGION \
    --properties="spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
    --properties="spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog" \
    --properties="spark.sql.catalog.CATALOG_NAME.type=hadoop" \
    --properties="spark.sql.catalog.CATALOG_NAME.warehouse=gs://BUCKET/FOLDER" \
     other flags ...

Replace the following:

  • CLUSTER_NAME: The cluster name.
  • REGION: The Compute Engine region.
  • CATALOG_NAME: Iceberg catalog name.
  • BUCKET and FOLDER: The Iceberg catalog location in Cloud Storage.

spark-shell

To configure a Spark session for Iceberg using the spark-shell REPL on the Dataproc cluster, complete the following steps:

  1. Use SSH to connect to the master node of the Dataproc cluster.

  2. Run the following command in the SSH session terminal to configure the Spark session for Iceberg.

spark-shell \
    --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
    --conf "spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog" \
    --conf "spark.sql.catalog.CATALOG_NAME.type=hadoop" \
    --conf "spark.sql.catalog.CATALOG_NAME.warehouse=gs://BUCKET/FOLDER"

Replace the following:

  • CLUSTER_NAME: The cluster name.
  • REGION: The Compute Engine region.
  • CATALOG_NAME: Iceberg catalog name.
  • BUCKET and FOLDER: The Iceberg catalog location in Cloud Storage.

pyspark shell

To configure a Spark session for Iceberg using the pyspark REPL on the Dataproc cluster, complete the following steps:

  1. Use SSH to connect to the master node of the Dataproc cluster.

  2. Run the following command in the SSH session terminal to configure the Spark session for Iceberg:

pyspark \
    --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
    --conf "spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog" \
    --conf "spark.sql.catalog.CATALOG_NAME.type=hadoop" \
    --conf "spark.sql.catalog.CATALOG_NAME.warehouse=gs://BUCKET/FOLDER"

Replace the following:

  • CLUSTER_NAME: The cluster name.
  • REGION: The Compute Engine region.
  • CATALOG_NAME: Iceberg catalog name.
  • BUCKET and FOLDER: The Iceberg catalog location in Cloud Storage.

Write data to an Iceberg Table

You can write data to an Iceberg table using Spark. The following code snippets create a DataFrame with sample data, create an Iceberg table in Cloud Storage, and then write the data to the Iceberg table.

PySpark

# Create a DataFrame with sample data.
data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])

# Create an Iceberg table in Cloud Storage.
spark.sql("""CREATE TABLE IF NOT EXISTS CATALOG_NAME.NAMESPACE.TABLE_NAME (
    id integer,
    name string)
USING iceberg
LOCATION 'gs://BUCKET/FOLDER/NAMESPACE/TABLE_NAME'""")

# Write the DataFrame to the Iceberg table in Cloud Storage.
data.writeTo("CATALOG_NAME.NAMESPACE.TABLE_NAME").append()

Scala

// Create a DataFrame with sample data.
val data = Seq((1, "Alice"), (2, "Bob")).toDF("id", "name")

// Create an Iceberg table in Cloud Storage.
spark.sql("""CREATE TABLE IF NOT EXISTS CATALOG_NAME.NAMESPACE.TABLE_NAME (
    id integer,
    name string)
USING iceberg
LOCATION 'gs://BUCKET/FOLDER/NAMESPACE/TABLE_NAME'""")

// Write the DataFrame to the Iceberg table in Cloud Storage.
data.writeTo("CATALOG_NAME.NAMESPACE.TABLE_NAME").append()

Read data from an Iceberg Table

You can read data from an Iceberg table using Spark. The following code snippets read the table, and then display its contents.

PySpark

# Read Iceberg table data into a DataFrame.
df = spark.read.format("iceberg").load("CATALOG_NAME.NAMESPACE.TABLE_NAME")
# Display the data.
df.show()

Scala

// Read Iceberg table data into a DataFrame.
val df = spark.read.format("iceberg").load("CATALOG_NAME.NAMESPACE.TABLE_NAME")

// Display the data.
df.show()

Spark SQL

SELECT * FROM CATALOG_NAME.NAMESPACE.TABLE_NAME

Hive

Create an Iceberg Table in Hive

Dataproc clusters pre-configure Hive to work with Iceberg.

To run the code snippets in this section, complete the following steps;

  1. Use SSH to connect to the master node of your Dataproc cluster.

  2. Bring up beeline in the SSH terminal window.

    beeline -u jdbc:hive2://
    

You can create an unpartitioned or partitioned Iceberg table in Hive.

Unpartitioned table

Create an unpartitioned Iceberg table in Hive.

CREATE TABLE my_table (
  id INT,
  name STRING,
  created_at TIMESTAMP
) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';

Partitioned table

Create a partitioned Iceberg table in Hive by specifying the partition columns in the PARTITIONED BY clause.

CREATE TABLE my_partitioned_table (
  id INT,
  name STRING
) PARTITIONED BY (date_sk INT)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';

Insert data into an Iceberg Table in Hive

You can insert data into an Iceberg table using standard Hive INSERT statements.

SET hive.execution.engine=mr;

INSERT INTO my_table
SELECT 1, 'Alice', current_timestamp();

Limitations

  • The MR (MapReduce) execution engine only is supported for DML (data manipulation language) operations.
  • MR execution is deprecated in Hive 3.1.3.

Read data from an Iceberg Table in Hive

To read data from an Iceberg table, use a SELECT statement.

SELECT * FROM my_table;

Drop an Iceberg table in Hive.

To drop an Iceberg table in Hive, use the DROP TABLE statement.

DROP TABLE my_table;

For more information