Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Using Hadoop, Flume, Kafka, Spark, and Flink
This article details the three‑stage evolution of 58.com’s commercial data warehouse, describing its massive scale, four‑layer architecture, technical challenges, migrations from MapReduce to Hive and Flink, real‑time streaming upgrades, and the resulting improvements in stability, accuracy, and timeliness.
Early on, commercial data warehouses were built on Oracle, but as data volumes grew, traditional solutions could not handle the scale, prompting a shift to distributed big‑data technologies.
Warehouse Scale : The 58 data warehouse now ingests over 25 TB of new data daily, runs more than 2 000 jobs, consumes about one‑third of the platform’s resources, and is operated by a team of 15+ engineers.
Four‑Layer Architecture :
ODS (Source Layer): data collection, transmission, and offline/real‑time sources.
DWD (Detail Layer): business, customer, advertising, and user‑behavior data warehouses.
DWA (Aggregation Layer): common dimensions and metrics to reduce development cost.
APP Layer: business scenarios, analysis themes, OLAP engines (e.g., Smart Data, Merchant Advisor, Monitoring, Effect Data, Feature Mining).
Commercial Warehouse 1.0
Business Background : Initial stage with limited data and rapid data growth (≈100% month‑over‑month).
Technical Situation : Data transfer relied on rsync (batch‑only), scheduling used dsap (crontab‑like, no dependencies), and processing was done with MapReduce, resulting in low development efficiency.
Scheduling Upgrade :
To address dsap’s instability, a file‑dependency method was introduced, eventually evolving into the 58DP tool platform.
Code Upgrade :
ODS and DWD layers continued using MapReduce, while DWA and APP layers migrated to Hive SQL.
Transmission Upgrade :
rsync was replaced by Apache Flume + Kafka to solve timeliness issues.
Code Optimization :
MapReduce jobs in ODS/DWD were optimized with setup and DistributedCache tweaks; Hive queries in the APP layer were also tuned.
Metric Definition :
A unified data standard and metric logic were published at the application layer.
Monitoring :
Added monitoring for downstream data availability, job completion times, and metric fluctuations.
Traffic Source Classification :
Used parameters (SPM) and Nginx logs to distinguish sources when parameters were insufficient.
Summary of Phase 1 : Focused on data stability, accuracy, and timeliness.
Commercial Warehouse 2.0
Background : 1.0 suffered from insufficient data richness and limited real‑time capabilities.
Full‑Site Behavior Data Construction :
PC/Mobile: Standard parameter passing from impression to list/detail/behavior pages.
APP: Data transmitted via sidDict parameter.
Key Problems :
High cost of ensuring correct parameter propagation across many processes.
Frequent business‑line iterations causing high debugging cost.
Low user‑behavior matching rate, hitting a bottleneck.
Solution: Adopt a state‑machine approach using user identifiers (Cookie, ID, IMEI) to link data, achieving >95% matching, reduced development cost, high maintainability, and strong extensibility.
Real‑Time Enhancements :
Adopted Kafka + Spark Streaming + Druid, later upgraded to Flink for real‑time AB testing and insight platforms.
Commercial Warehouse 3.0
Systematic Integration : Consolidated independent DWD “chimney” warehouses into a unified data middle platform, improving data reuse.
Productization : Defined standard ODS formats, mapped legacy data, built a data funnel to produce DWD tables, and delivered products such as promotion effect dashboards, merchant advisor, and smart tree analytics.
Technical Stack : Real‑time AB testing, insight platform, and monitoring built on Flink, Kafka, and Druid.
Conclusion : Over three phases, the 58 commercial data warehouse progressed from establishing stability, accuracy, and timeliness (Phase 1) to enriching data and enhancing real‑time capabilities (Phase 2), and finally to systematizing and productizing the platform (Phase 3).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
