Using Iceberg Catalogs with HiveCatalog and HadoopCatalog: Table Creation, Data Ingestion, and Querying
This article explains the concept of Iceberg catalogs, compares HiveCatalog and HadoopCatalog, and provides step‑by‑step Spark examples for downloading the Iceberg jar, creating tables, loading data, querying, and examining the underlying metadata and directory structures.
Iceberg manages tables through a catalog component that handles create, drop, and rename operations; the two supported catalogs are HiveCatalog, which stores the table metadata location in a metastore, and HadoopCatalog, which records the metadata path directly in a file system directory.
First, download the Iceberg Spark runtime JAR (e.g., iceberg-spark-runtime-0.8.0-incubating.jar) and place it in Spark’s classpath, then start Spark shell with
./spark-shell --jars ../../iceberg-spark-runtime-0.8.0-incubating.jar.
HiveCatalog example : after creating a Hive database, import the necessary Iceberg and Spark classes, configure the Hive metastore URIs, define a Spark schema, convert it to an Iceberg schema, build a partition spec, and create the table with catalog.createTable(name, icebergSchema, spec). The created table can be inspected in Hive using show tables and show create table. Data is inserted with
df.write.format("iceberg").mode("append").save("hive_iceberg.action_logs_35")and queried via
spark.read.format("iceberg").load("hive_iceberg.action_logs_35").show().
HadoopCatalog example : import org.apache.iceberg.hadoop.HadoopCatalog and a Hadoop Configuration, instantiate the catalog with the HDFS base path, and create the table similarly. Data is written with
df.write.format("iceberg").mode("append").save("hdfs://.../hadoop_iceberg/action_logs")and read with the same Spark format.
The Iceberg table layout consists of a data directory holding Parquet files and a metadata directory containing JSON and Avro files that record the schema, partition spec, snapshots, manifest lists, and file‑level metadata. For HadoopCatalog, a version‑hint.text file points to the latest snapshot metadata; HiveCatalog stores this pointer in the metastore property metadata_location.
HiveCatalog relies on the metastore for snapshot tracking and works across various storage systems (HDFS, S3, OSS) but introduces a strong dependency on the metastore, whereas HadoopCatalog depends only on HDFS atomic rename semantics, avoiding metastore coupling but limiting it to HDFS‑based storage. The article argues for choosing HadoopCatalog in environments that primarily use HDFS and prefer reduced metastore reliance.
In summary, the guide demonstrates how to set up Iceberg with both catalog types, perform basic CRUD operations, and understand the underlying metadata structures, helping readers decide which catalog best fits their data‑lake architecture.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
