Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology
This article details the evolution of 58.com’s commercial data warehouse across three phases—1.0, 2.0, and 3.0—covering its scale, four‑layer architecture, migration from legacy Hadoop‑MapReduce pipelines to Flume/Kafka and Flink streaming, code optimizations, monitoring, and productization for real‑time business insights.
Introduction – Early data warehouses relied on Oracle, but growing data volumes forced a shift to distributed big‑data technologies. 58.com’s data warehouse team built a Hadoop‑based platform from scratch and iterated through three major versions.
Warehouse Scale – Daily data growth exceeds 25 TB, with over 2 000 scheduled jobs consuming roughly one‑third of the cluster resources. The team consists of more than 15 engineers.
Four‑Layer Architecture
ODS (Source Layer): data collection, transmission, and raw sources for offline, real‑time, and multidimensional analysis.
DWD (Detail Layer): business, customer, advertising, and site‑wide behavior data warehouses.
DWA (Aggregation Layer): common dimensions and metrics to reduce development effort.
APP Layer: business scenarios, dashboards, monitoring platforms, effect data, and feature mining.
Commercial Warehouse 1.0
Business background : initial stage with sparse data; rapid data explosion (≈100% month‑over‑month growth).
Technical status : data transfer via rsync, scheduling with dsap (cron‑like), processing with MapReduce – all suffered stability and timeliness issues.
Scheduling upgrade : introduced file‑dependency method, evolving into the 58DP tool platform.
Code upgrade : retained MapReduce for ODS/DWD, shifted to Hive SQL for DWA and APP layers.
Transmission upgrade : replaced rsync with Apache Flume + Kafka for real‑time ingestion.
Code optimization : setup and DistributedCache tweaks for MR jobs; Hive optimizations at the APP layer.
Metric standards : unified data and calculation standards defined for downstream applications.
Monitoring : added table‑level, job‑completion, and metric‑drift monitors.
Traffic source classification : used SPM parameters and Nginx logs to distinguish sources.
Commercial Warehouse 2.0
Addressed data richness and real‑time requirements; integrated Flume + Kafka for streaming.
Built full‑site behavior data pipelines for PC/M and APP, standardizing parameter transmission.
Challenges: complex parameter flows, high debugging cost, low matching rate.
Solution: state‑machine linking using CookieID/IMEI, achieving >95% matching, lower development cost, and high extensibility.
Real‑time stack upgraded from Kafka + SparkStreaming + Druid to the newer Flink architecture.
Commercial Warehouse 3.0
Systematization : unified DWD “chimney” warehouses into a data‑mid platform for reuse.
Productization : standardized ODS formats (e.g., LEGO ad system), built data funnels to DWD, and delivered downstream products such as Effect Data, Merchant Advisor, and Smart Dashboard.
Technical analysis : real‑time A/B testing, insight platforms, and monitoring.
Conclusion – The commercial data warehouse has progressed through three stages: 1.0 focused on stability, accuracy, and timeliness; 2.0 enhanced data richness and real‑time capabilities; 3.0 emphasized systematic, product‑oriented architecture and advanced analytics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
