Design and Architecture of a Full‑Chain Data Warehouse for Information Security
The article presents a comprehensive design of an end‑to‑end data warehouse for information‑security governance, detailing background motivations, multi‑layer data architecture, dimension modeling, bus‑matrix mapping, real‑time (lambda/kappa) processing, data‑dictionary integration, and future directions toward unified streaming‑batch solutions.
Background – In information‑security business, massive heterogeneous data (features, policies, user behavior) must be analyzed and validated, requiring a "full‑link" data warehouse that integrates all business‑line data into a dense, highly‑integrated data mesh, turning data into proactive security production capacity.
Data Layering – The warehouse is divided into six layers:
Seq
Data Layer
Abbreviation
Purpose
1
Raw Data Layer
RAW
Snapshot of source‑system data, stored daily with full detail.
2
Basic Data Layer
ODS
Business‑concept organized data with standardized names and codes.
3
General Data Layer
DWD
Fine‑grained aggregated layer built on star or snowflake models; metrics and dimensions are standardized.
4
Aggregated Data Layer
DWS
Data marts for specific business needs, designed with star or snowflake schemas.
5
Dimension Layer
DIM
Dimension tables providing rich attributes, historical traceability, and consistency across common dimensions.
6
Temporary Layer
TMP
Transient tables to reduce computation difficulty and improve runtime efficiency.
Dimension Modeling – Two mainstream approaches (normalized vs. dimensional) are compared. Normalized warehouses require heavy upfront work but yield stable long‑term maintenance; dimensional modeling is more agile, suits frequently changing business, and demands less expertise. Four key steps are outlined: selecting business processes, declaring grain, identifying dimensions, and confirming facts.
Bus Matrix – The bus matrix acts as a map of the warehouse, linking each business process (rows) with common dimensions (columns). It provides a macro view of which processes share which dimensions, enabling quick alignment of data requirements with warehouse structures.
Overall Architecture – The warehouse is split into three logical parts:
General warehouse: stores cross‑business capability data (e.g., hunter‑risk system, cloud authentication).
Business warehouse: built for specific industry‑level analyses.
Subject warehouse: unified, cross‑business subject areas (traffic, content, user, etc.) based on consistent dimensions.
This three‑tier design mirrors the IKEA analogy: a public floor (general warehouse) for developers and a dedicated floor (business warehouse) for analysts.
Real‑Time Evolution – Discusses Lambda (batch + stream) and Kappa (stream‑only) architectures. Lambda offers flexibility but incurs double‑engine maintenance and data inconsistency; Kappa simplifies the stack by using a message queue (e.g., Kafka) and Flink, enabling stream‑to‑Hive writes and automatic small‑file compaction.
Data Dictionary – Serves as the core metadata service (Hive Metastore) that supplies schema information to streaming platforms, enabling zero‑code configuration for feature extraction, model training, and online inference.
Future Outlook – The team is exploring data‑lake‑based stream‑batch integration to replace the current Hive + Kafka pattern, and addressing emerging security challenges such as unstructured image/text attacks, requiring new data‑structuring and linkage solutions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
