Big Data 15 min read

Using Iceberg Catalogs with HiveCatalog and HadoopCatalog: Table Creation, Data Ingestion, and Querying

This article explains the concept of Iceberg catalogs, compares HiveCatalog and HadoopCatalog, and provides step‑by‑step Spark examples for downloading the Iceberg jar, creating tables, loading data, querying, and examining the underlying metadata and directory structures.

Big Data Technology Architecture

Nov 24, 2021

Using Iceberg Catalogs with HiveCatalog and HadoopCatalog: Table Creation, Data Ingestion, and Querying

Iceberg manages tables through a catalog component that handles create, drop, and rename operations; the two supported catalogs are HiveCatalog, which stores the table metadata location in a metastore, and HadoopCatalog, which records the metadata path directly in a file system directory.

First, download the Iceberg Spark runtime JAR (e.g., iceberg-spark-runtime-0.8.0-incubating.jar) and place it in Spark’s classpath, then start Spark shell with

./spark-shell --jars ../../iceberg-spark-runtime-0.8.0-incubating.jar

HiveCatalog example : after creating a Hive database, import the necessary Iceberg and Spark classes, configure the Hive metastore URIs, define a Spark schema, convert it to an Iceberg schema, build a partition spec, and create the table with catalog.createTable(name, icebergSchema, spec). The created table can be inspected in Hive using show tables and show create table. Data is inserted with

df.write.format("iceberg").mode("append").save("hive_iceberg.action_logs_35")

and queried via

spark.read.format("iceberg").load("hive_iceberg.action_logs_35").show()

HadoopCatalog example : import org.apache.iceberg.hadoop.HadoopCatalog and a Hadoop Configuration, instantiate the catalog with the HDFS base path, and create the table similarly. Data is written with

df.write.format("iceberg").mode("append").save("hdfs://.../hadoop_iceberg/action_logs")

and read with the same Spark format.

The Iceberg table layout consists of a data directory holding Parquet files and a metadata directory containing JSON and Avro files that record the schema, partition spec, snapshots, manifest lists, and file‑level metadata. For HadoopCatalog, a version‑hint.text file points to the latest snapshot metadata; HiveCatalog stores this pointer in the metastore property metadata_location.

HiveCatalog relies on the metastore for snapshot tracking and works across various storage systems (HDFS, S3, OSS) but introduces a strong dependency on the metastore, whereas HadoopCatalog depends only on HDFS atomic rename semantics, avoiding metastore coupling but limiting it to HDFS‑based storage. The article argues for choosing HadoopCatalog in environments that primarily use HDFS and prefer reduced metastore reliance.

In summary, the guide demonstrates how to set up Iceberg with both catalog types, perform basic CRUD operations, and understand the underlying metadata structures, helping readers decide which catalog best fits their data‑lake architecture.

Spark Iceberg HadoopCatalog HiveCatalog

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.