How to Build a High‑Performance Lakehouse with StarRocks and Apache Hive
This guide walks through the core concepts of Apache Hive, its architecture and key features, then shows how to integrate Hive with StarRocks via the Hive Catalog, construct ODS/DWD/DWS/ADS tables, enable DataCache, use materialized views, and handle automatic partition detection for fast lakehouse analytics.
Apache Hive Overview
Apache Hive is a distributed, fault‑tolerant data‑warehouse system built on Hadoop. It stores table and partition metadata in the Hive Metastore (HMS), enabling SQL‑based read/write of petabyte‑scale datasets on HDFS, S3, ADLS, GCS, and other storage systems.
Hive Architecture & Key Features
HiveServer2 (HS2) : Supports concurrent clients, JDBC/ODBC, and authentication.
Hive Metastore Server (HMS) : Central metadata store compatible with Spark, Presto, and other engines.
Hive ACID : Full ACID support for ORC tables and insert‑only support for other formats.
Hive Iceberg : Native StorageHandler for Apache Iceberg tables, suited for cloud‑native workloads.
Security & Observability : Kerberos authentication, integration with Apache Ranger and Apache Atlas.
Hive LLAP : Low‑latency, interactive SQL with data caching.
Query Optimization : Cost‑Based Optimizer (CBO) based on Apache Calcite.
Advantages of Hive
Provides fundamental data‑warehouse capabilities such as databases, tables, and partitions.
Supports multiple execution engines (MapReduce, Tez, Spark) for performance tuning.
Extensible via UDFs and integrates with Hadoop ecosystem tools (Flume, Sqoop, Oozie).
Well‑suited for batch processing, reporting, and ETL workloads.
StarRocks Hive Catalog Integration
StarRocks is an MPP database that delivers fast, real‑time analytics on both local and lake data. Using the StarRocks Hive Catalog, users can query data stored in Hive, Iceberg, Hudi, Delta Lake, etc., without data migration. The catalog bridges storage (lake) and compute (StarRocks), providing a unified "single source of truth" for BI/AI workloads.
Data Model & Sample DDL
create database orders;
-- ODS tables (raw data)
CREATE EXTERNAL TABLE IF NOT EXISTS ods_orders_text (
order_id STRING,
user_id STRING,
order_time STRING,
product_id STRING,
quantity INT,
price DECIMAL(10,2),
order_status STRING
) COMMENT '订单操作数据存储表' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
CREATE TABLE IF NOT EXISTS ods_orders (
order_id INT,
user_id INT,
order_time STRING,
product_id INT,
quantity INT,
price DOUBLE,
order_status STRING
) COMMENT '订单操作数据存储表' PARTITIONED BY (order_date STRING) STORED AS PARQUET;
CREATE TABLE IF NOT EXISTS dim_products (
product_id INT,
product_name STRING,
category_id INT,
price DECIMAL(10,2),
product_description STRING
) COMMENT '产品维度表' STORED AS PARQUET;
CREATE TABLE IF NOT EXISTS dim_categories (
category_id INT,
category_name STRING,
category_description STRING
) COMMENT '分类维度表' STORED AS PARQUET;
-- DWD fact table
CREATE TABLE IF NOT EXISTS dwd_order_facts (
order_id STRING,
user_id STRING,
order_time STRING,
product_id STRING,
quantity INT,
price DECIMAL(10,2),
order_status STRING,
product_name STRING,
category_id STRING,
category_name STRING
) COMMENT '订单事实表' PARTITIONED BY (order_date DATE) STORED AS PARQUET;Data Construction Workflow
Directly query Hive tables.
Enable DataCache to accelerate Hive reads.
Combine DataCache with asynchronous materialized views for further speed‑up.
Configuration Steps
scp hive-site.xml hdfs-site.xml core-site.xml sr@node:/home/disk1/sr/fe/conf
scp hdfs-site.xml core-site.xml sr@node:/home/disk1/sr/be/conf
./bin/stop_be.sh
./bin/start_be.sh --daemon
./bin/stop_fe.sh
./bin/start_fe.sh --daemonHive Catalog Usage Example
CREATE EXTERNAL CATALOG `hive_catalog_krb5_sr`
PROPERTIES (
"hive.metastore.type" = "hive",
"hive.metastore.uris" = "thrift://cs02.starrocks.com:9083",
"type" = "hive"
);
SET catalog hive_catalog_krb5;
USE orders;
SELECT * FROM dws_order_aggregates;DataCache Configuration
SET enable_scan_datacache = true;BE configuration (be.conf):
datacache_disk_path = /data2/datacache
datacache_enable = true
datacache_disk_size = 200GAsynchronous Materialized View
CREATE MATERIALIZED VIEW dwd_order_facts_mv
PARTITION BY str2date(order_date,'%Y-%m-%d')
DISTRIBUTED BY HASH(`order_id`) BUCKETS 12
PROPERTIES ("replication_num" = "3")
REFRESH ASYNC START('2024-08-01 01:00:00') EVERY (INTERVAL 1 DAY) AS
SELECT
o.order_date,
o.order_id,
o.user_id,
o.order_time,
o.product_id,
o.quantity,
o.price,
COALESCE(o.order_status,'UNKNOWN') AS order_status,
p.product_name,
p.category_id,
c.category_name
FROM hive_catalog_krb5.orders.ods_orders o
JOIN hive_catalog_krb5.orders.dim_products p ON o.product_id = p.product_id
JOIN hive_catalog_krb5.orders.dim_categories c ON p.category_id = c.category_id
WHERE o.price > 0;Automatic Partition Detection
After new data (e.g., 2024‑08‑06) is loaded into the Hive ODS table, refresh the materialized view. The view automatically detects the new partition.
# Load new ODS data for 2024‑08‑06
LOAD DATA LOCAL INPATH '/home/disk1/sr/ods_orders_0806.txt' INTO TABLE ods_orders_text;
INSERT OVERWRITE TABLE ods_orders PARTITION (order_date="2024-08-06")
SELECT * FROM ods_orders_text WHERE order_date >= "2024-08-06 00:00:00";
REFRESH MATERIALIZED VIEW dwd_order_facts_mv;Summary
In most scenarios, combining the StarRocks Hive Catalog with DataCache efficiently satisfies lake‑analysis requirements.
Asynchronous materialized views simplify ETL pipelines, reduce complexity, and maintain high query performance.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
