JD Retail Traffic Data Warehouse Architecture and Processing Practices
This article presents a comprehensive technical overview of JD.com’s retail traffic data processing pipeline, detailing the multi‑layer data warehouse architecture, real‑time and offline data flows, a large‑scale back‑fill case using Iceberg and OLAP, data‑skew detection and mitigation techniques, and future directions involving unified Flink‑Spark streaming‑batch solutions.
Guest and Context : The session, organized by DataFunTalk, features JD data architect Wang Jingjing who shares JD’s traffic‑scene data processing solutions.
Overview : The talk is divided into three parts – JD retail traffic data warehouse architecture, data processing in JD retail scenarios, and future exploration of the architecture.
Traffic Definition and Sources : Traffic refers to the collection of user actions on JD pages, sourced from mobile, PC, offline stores, external procurement, and partners.
Data Collection and Ingestion : Different terminals use distinct collection methods (SDK for native apps, JS for PC/H5). Data is written both in real‑time (via white‑list‑configured Kafka) and offline (to CFS distributed file system), with monitoring of file size and IP to prevent loss.
Data Warehouse Layering : JD’s warehouse consists of five layers – BDM (raw business data, permanently stored), FDM (business‑format conversion and back‑fill), GDM (standardized domain models), ADM (public data layer split into ADM‑D for detailed data and ADM‑S for aggregated data), and APP (application layer for dashboards, providing pre‑computed and OLAP services). A dimension layer stores common dimension data.
Offline Warehouse Structure : The offline stack includes a foundational layer (integrating raw logs by channel and type), a public layer (ADM‑D and ADM‑S for unified metrics), and an application layer (thin, focusing on query efficiency with pre‑computation for hot dimensions and OLAP for cold data).
Real‑Time Warehouse Structure : Real‑time data flows through RDDM (channel‑, site‑, and log‑type‑based streams) and RADM (business‑level aggregation such as product detail, source‑to‑destination, and path trees), eventually feeding a metric market with unified APIs.
Case Study – "刷岗" (Back‑fill) : JD implements a back‑fill process using Iceberg tables and OLAP. The workflow builds Iceberg tables, computes MD5 hashes of changing fields to detect differences, performs upserts, and merges the traffic product table with fact tables. This approach reduces latency compared to Hive and supports ACID transactions.
Data Skew Challenges and Mitigation : Skew arises from uneven key distribution. JD monitors data in real‑time, identifies outliers using a 3‑sigma rule to set skew thresholds, calculates bucket counts accordingly, and schedules jobs based on column‑resource health to maximize utilization.
Future Exploration : JD is exploring a unified Flink + Spark batch‑stream architecture to replace the traditional Lambda model, aiming for a single codebase that supports both real‑time and offline calculations while maintaining consistent metrics. Challenges include CDC latency and small‑file management.
Q&A Highlights : Answers cover the impact of bucket‑based monitoring, Spark’s minute‑level latency, and strategies for row‑level upserts and partition management.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.