Designing Hive Data Warehouse Schemas: Fact & Dimension Tables, Partitioning, Tag Aggregation, and ID Mapping
This article explains how to design Hive data warehouse schemas, covering fact and dimension table modeling, partitioned storage strategies, tag aggregation techniques, and ID‑mapping implementations using Hive SQL and UDFs to support user profiling and analytics.
Hive Data Warehouse
Building a user profile requires a data warehouse to store user tag data. Hive, a Hadoop‑based data‑warehouse tool that relies on HDFS, provides an SQL‑like language for querying data stored in HDFS. In practice, Hive is used to store label and user‑feature tables.
According to W.H. Inmon, a data warehouse is "a subject‑oriented, integrated, non‑volatile, time‑variant collection of data to support management decision‑making." 面向主题: Business databases focus on transaction processing and are isolated, while a data warehouse organizes data by subject. 集成: Data is extracted, cleaned, and transformed (ETL) before loading, not merely copied. 非易失: Warehouse data represents historical snapshots and is rarely modified after loading. 随时间变化: Data is periodically appended with a time attribute.
The ETL flow into the warehouse is illustrated below.
During warehouse modeling, fact tables and dimension tables are created.
Fact tables describe business processes and include: 事务事实表: Transaction fact tables (single‑transaction or multi‑transaction) record specific business events such as orders or payments. 周期快照事实表: Periodic snapshot tables measure business state over fixed intervals (e.g., last year’s payments, last 30‑day login days). 累计快照事实表: Cumulative snapshot tables capture intervals between events (e.g., time from purchase to payment).
Dimension tables describe attributes of facts (e.g., product price, brand, model). Slowly changing dimensions are handled by overwriting, versioning, partitioning, or using a link (chain) table.
In the profiling system, Hive stores dimension and fact tables for tags, crowds, and service‑layer data.
Partition Storage
Storing all user tags in a single wide table leads to long ETL runtimes and difficulty adding new tag types.
Solutions include partitioned storage, script performance tuning, and building intermediate tables based on shared data sources.
A partitioned‑storage solution separates tag tables by attribute categories such as demographic, behavior, consumption, risk control, and social attributes:
dw.userprofile_attritube_all (demographic)
dw.userprofile_action_all (behavior)
dw.userprofile_consume_all (consumption)
dw.userprofile_riskmanage_all (risk control)
dw.userprofile_social_all (social)
Creating a wide demographic table in Hive using partitioned tables reduces scan time because queries can target specific partitions.
Tag Aggregation
After inserting tags into partitioned tables, a user’s full tag set is scattered across partitions. Aggregating tags into a single field simplifies analysis.
Example aggregation table definition:
CREATE TABLE `dw.userprofile_userlabel_map_all`
(
`userid` string COMMENT 'userid',
`userlabels` map<string,string> COMMENT 'tagsmap',
)
COMMENT 'userid 用户标签汇聚'
PARTITIONED BY ( `data_date` string COMMENT '数据日期' )A custom UDF cast_to_json converts collected tags into a JSON string. The insertion statement (simplified) is:
insert overwrite table dw.userprofile_userlabel_map_all partition(data_date= "data_date")
select userid,
cast_to_json(concat_ws(',',collect_set(concat(labelid,':',labelweight)))) as userlabels
from "用户各维度的标签表"
where data_date= " data_date "
group by useridID‑MAP
When building user tags, it is crucial to map different identity sources (e.g., userid, cookieid) to the same real user. This eliminates data silos and enables cross‑device behavior analysis.
Slowly changing dimensions are often used for ID‑mapping because a user can have multiple devices and a device can belong to multiple users over time.
Creating a link‑table (chain table) records the history of userid‑cookieid associations with start and end dates.
CREATE TABLE `dw.cookie_user_zippertable`(
`userid` string COMMENT '账号ID',
`cookieid` string COMMENT '设备ID',
`start_date` string COMMENT 'start_date',
`end_date` string COMMENT 'end_date')
COMMENT 'id-map拉链表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'Daily ETL jobs insert new mappings and update existing records. Example insertion logic:
INSERT OVERWRITE TABLE dw.cookie_user_zippertable
SELECT t.*
FROM (
SELECT t1.userid,
t1.cookieid,
t1.start_date,
CASE WHEN t1.end_date = '99991231' AND t2.userid IS NOT NULL THEN '${data_date}'
ELSE t1.end_date END AS end_date
FROM dw.cookie_user_zippertable t1
LEFT JOIN (SELECT * FROM ods.cookie_user_signin WHERE data_date='${data_date}') t2
ON t1.userid = t2.userid
UNION
SELECT userid,
cookieid,
'${data_date}' AS start_date,
'99991231' AS end_date
FROM ods.cookie_user_signin
WHERE data_date = '${data_date}'
) tQuerying the link table with a specific date returns the active device IDs for a given user.
select *
from dw.cookie_user_zippertable
where start_date <= '20190801' and end_date >= '20190801'Example: retrieve cookie IDs for userid '32101029' on 2019‑08‑01.
select cookieid
from dw.cookie_user_zippertable
where userid='32101029' and start_date <= '20190801' and end_date >= '20190801'When joining userid and cookieid many‑to‑many relationships, apply appropriate filters to avoid data explosion.
Conclusion
This article presented a practical solution for linking userid and cookieid in Hive, illustrating schema design, partitioned storage, tag aggregation, and ID‑mapping techniques that can be extended to cross‑platform user behavior integration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
