You can install additional components like Iceberg when you create a Dataproc cluster using the Optional components feature. This page describes how you can optionally install the Iceberg component on a Dataproc cluster.
Overview
Apache Iceberg is an open table format for large analytic datasets. It brings the reliability and simplicity of SQL tables to Big Data, while making it possible for engines such as Spark, Trino, PrestoDB, Flink, and Hive to safely work with the same tables at the same time.
When installed on a Dataproc cluster, the Apache Iceberg component installs Iceberg libraries and configures Spark and Hive to work with Iceberg on the cluster.
Key Iceberg features
Iceberg features include the following:
- Schema evolution: Add, remove, or rename columns without rewriting the entire table.
- Time travel: Query historical table snapshots for auditing or rollback purposes.
- Hidden partitioning: Optimize data layout for faster queries without exposing partition details to users.
- ACID transactions: Ensure data consistency and prevent conflicts.
Compatible Dataproc image versions
You can install the Iceberg component on Dataproc clusters created with 2.2.47 and later image versions. The Iceberg version installed on the cluster is listed in the 2.2 release versions page.
Iceberg related properties
When you create a Dataproc with Iceberg cluster, the following Spark and Hive properties are configured to work with Iceberg.
Config file | Property | Default value |
---|---|---|
/etc/spark/conf/spark-defaults.conf |
spark.sql.extensions |
org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions |
spark.driver.extraClassPath |
/usr/lib/iceberg/lib/iceberg-spark-runtime-spark-version_scala-version.jar |
|
spark.executor.extraClassPath |
/usr/lib/iceberg/lib/iceberg-spark-runtime-spark-version_scala-version.jar |
|
/etc/hive/conf/hive-site.xml |
hive.aux.jars.path |
file:///usr/lib/iceberg/lib/iceberg-hive-runtime.jar |
iceberg.engine.hive.enabled |
true |
Install the Iceberg optional component
Install the Iceberg component when you create a Dataproc cluster. The Dataproc cluster image version list pages show the Iceberg component version included in the latest Dataproc cluster image versions.
Google Cloud console
To create a Dataproc cluster that installs the Iceberg component, complete the following steps in the Google Cloud console:
- Open the Dataproc Create a cluster page. The Set up cluster panel is selected.
- In the Components section, under Optional components, select the Iceberg component.
- Confirm or specify other cluster settings, then click Create.
Google Cloud CLI
To create a Dataproc cluster that installs the Iceberg component, use the
gcloud dataproc clusters create
command with the --optional-components
flag.
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --optional-components=ICEBERG \ other flags ...
Replace the following:
- CLUSTER_NAME: The new cluster name.
- REGION: The cluster region.
REST API
To create a Dataproc cluster that installs the Iceberg optional component,
specify the Iceberg
SoftwareConfig.Component
as part of a
clusters.create
request.
Use Iceberg tables with Spark and Hive
After creating a Dataproc cluster that has the Iceberg optional component installed on the cluster, you can use Spark and Hive to read and write Iceberg table data.
Spark
Configure a Spark session for Iceberg
You can use the gcloud CLI command locally, or the spark-shell
or pyspark
REPLs (Read-Eval-Print Loops) running on the dataproc cluster master node
to enable Iceberg's Spark extensions and set up the Spark catalog to use Iceberg
tables.
gcloud
Run the following gcloud CLI example in a local terminal window or in Cloud Shell to submit a Spark job and set Spark properties to configure the Spark session for Iceberg.
gcloud dataproc jobs submit spark \ --cluster=CLUSTER_NAME \ --region=REGION \ --properties="spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --properties="spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog" \ --properties="spark.sql.catalog.CATALOG_NAME.type=hadoop" \ --properties="spark.sql.catalog.CATALOG_NAME.warehouse=gs://BUCKET/FOLDER" \ other flags ...
Replace the following:
- CLUSTER_NAME: The cluster name.
- REGION: The Compute Engine region.
- CATALOG_NAME: Iceberg catalog name.
- BUCKET and FOLDER: The Iceberg catalog location in Cloud Storage.
spark-shell
To configure a Spark session for Iceberg using the spark-shell
REPL on the
Dataproc cluster, complete the following steps:
Use SSH to connect to the master node of the Dataproc cluster.
Run the following command in the SSH session terminal to configure the Spark session for Iceberg.
spark-shell \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.CATALOG_NAME.type=hadoop" \ --conf "spark.sql.catalog.CATALOG_NAME.warehouse=gs://BUCKET/FOLDER"
Replace the following:
- CLUSTER_NAME: The cluster name.
- REGION: The Compute Engine region.
- CATALOG_NAME: Iceberg catalog name.
- BUCKET and FOLDER: The Iceberg catalog location in Cloud Storage.
pyspark shell
To configure a Spark session for Iceberg using the pyspark
REPL on the
Dataproc cluster, complete the following steps:
Use SSH to connect to the master node of the Dataproc cluster.
Run the following command in the SSH session terminal to configure the Spark session for Iceberg:
pyspark \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.CATALOG_NAME.type=hadoop" \ --conf "spark.sql.catalog.CATALOG_NAME.warehouse=gs://BUCKET/FOLDER"
Replace the following:
- CLUSTER_NAME: The cluster name.
- REGION: The Compute Engine region.
- CATALOG_NAME: Iceberg catalog name.
- BUCKET and FOLDER: The Iceberg catalog location in Cloud Storage.
Write data to an Iceberg Table
You can write data to an Iceberg table using Spark. The following code snippets create a
DataFrame
with sample data, create an Iceberg table in Cloud Storage,
and then write the data to the Iceberg table.
PySpark
# Create a DataFrame with sample data. data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"]) # Create an Iceberg table in Cloud Storage. spark.sql("""CREATE TABLE IF NOT EXISTS CATALOG_NAME.NAMESPACE.TABLE_NAME ( id integer, name string) USING iceberg LOCATION 'gs://BUCKET/FOLDER/NAMESPACE/TABLE_NAME'""") # Write the DataFrame to the Iceberg table in Cloud Storage. data.writeTo("CATALOG_NAME.NAMESPACE.TABLE_NAME").append()
Scala
// Create a DataFrame with sample data. val data = Seq((1, "Alice"), (2, "Bob")).toDF("id", "name") // Create an Iceberg table in Cloud Storage. spark.sql("""CREATE TABLE IF NOT EXISTS CATALOG_NAME.NAMESPACE.TABLE_NAME ( id integer, name string) USING iceberg LOCATION 'gs://BUCKET/FOLDER/NAMESPACE/TABLE_NAME'""") // Write the DataFrame to the Iceberg table in Cloud Storage. data.writeTo("CATALOG_NAME.NAMESPACE.TABLE_NAME").append()
Read data from an Iceberg Table
You can read data from an Iceberg table using Spark. The following code snippets read the table, and then display its contents.
PySpark
# Read Iceberg table data into a DataFrame. df = spark.read.format("iceberg").load("CATALOG_NAME.NAMESPACE.TABLE_NAME") # Display the data. df.show()
Scala
// Read Iceberg table data into a DataFrame. val df = spark.read.format("iceberg").load("CATALOG_NAME.NAMESPACE.TABLE_NAME") // Display the data. df.show()
Spark SQL
SELECT * FROM CATALOG_NAME.NAMESPACE.TABLE_NAME
Hive
Create an Iceberg Table in Hive
Dataproc clusters pre-configure Hive to work with Iceberg.
To run the code snippets in this section, complete the following steps;
Use SSH to connect to the master node of your Dataproc cluster.
Bring up
beeline
in the SSH terminal window.beeline -u jdbc:hive2://
You can create an unpartitioned or partitioned Iceberg table in Hive.
Unpartitioned table
Create an unpartitioned Iceberg table in Hive.
CREATE TABLE my_table ( id INT, name STRING, created_at TIMESTAMP ) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
Partitioned table
Create a partitioned Iceberg table in Hive by specifying the partition columns in the
PARTITIONED BY
clause.
CREATE TABLE my_partitioned_table ( id INT, name STRING ) PARTITIONED BY (date_sk INT) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
Insert data into an Iceberg Table in Hive
You can insert data into an Iceberg table using standard Hive INSERT
statements.
SET hive.execution.engine=mr; INSERT INTO my_table SELECT 1, 'Alice', current_timestamp();
Limitations
- The MR (MapReduce) execution engine only is supported for DML (data manipulation language) operations.
- MR execution is deprecated in Hive
3.1.3
.
Read data from an Iceberg Table in Hive
To read data from an Iceberg table, use a SELECT
statement.
SELECT * FROM my_table;
Drop an Iceberg table in Hive.
To drop an Iceberg table in Hive, use the DROP TABLE
statement.
DROP TABLE my_table;