How Baidu Waimai Scaled Traffic Analysis with Apache Kylin: A Deep Dive
This article presents a detailed case study of Baidu Waimai's traffic analysis platform, outlining the data challenges of high dimensionality and volume, the evaluation of OLAP engines, the adoption of Apache Kylin for pre‑computation, the end‑to‑end data modeling, cube construction, incremental builds, and integration with Saiku‑Mondrian reporting, while sharing practical lessons and performance gains.
Background and Business Scenario
Baidu Waimai's traffic analysis platform monitors user navigation paths across multiple dimensions such as region, city, business district, device, version, and channel. The platform supports funnel, trend, comparison, and distribution analyses, requiring hourly and regional granularity and full‑path visibility.
Data Challenges
The platform faces three major challenges:
High dimensionality: 9 basic dimensions plus 36 path dimensions, leading to a combinatorial explosion.
Massive data volume: 250 million raw log rows per day, 7 million path rows daily, and up to 200 million rows of full‑path data retained for three months, reaching the hundred‑billion‑row scale.
Complex query scenarios: diverse analytical needs (funnel, trend, UV/UV‑distinct, order amount) demand both fast response (seconds) and flexible aggregation.
Technology Evaluation and Selection
Two OLAP architectures were compared:
Mpp (Impala, Presto, Spark) : supports arbitrary SQL and high flexibility but cannot guarantee sub‑second response at large scale.
Pre‑computation (Kylin, Druid) : performs heavy aggregation during data loading, offering stable second‑level query latency at the cost of flexibility.
Given Baidu Waimai's need for billion‑row support, second‑level latency, and primarily aggregated queries, the pre‑computation approach was chosen.
Why Kylin Over Druid
Kylin provides native SQL/JDBC support, precise COUNT(DISTINCT) via bitmap, simple segment refresh, and better handling of high‑dimensional cubes, whereas Druid focuses on real‑time ingestion and approximate distinct counts, making Kylin a better fit for the platform's daily batch aggregation workflow.
Apache Kylin Overview
Apache Kylin is an open‑source distributed OLAP engine built on Hadoop. It pre‑computes multi‑dimensional cubes from Hive tables using MapReduce or Spark, stores the results in HBase, and serves queries via standard SQL with sub‑second response.
Data Modeling and Cube Construction
The platform adopts a star schema with a fact table fact_flow and a dimension table dim_path. To avoid exponential cube growth, the 36 path dimensions are collapsed into a single PATH_ID, reducing total dimensions from over 40 to 11.
Kylin's dimension optimization techniques—aggregation groups, mandatory dimensions (e.g., INDEX_DAY, PATH_ID), hierarchical dimensions (region, city, business district), and derived dimensions for path names—are applied to keep the cube size manageable.
Key steps:
Define the star model in Kylin's Model UI, linking fact_flow to dim_path via PATH_ID.
Configure dimensions: normal dimensions for most fields, derived dimensions for path attributes.
Configure measures: PV (COUNT), UV (COUNT DISTINCT on cuid), and various monetary sums.
Schedule daily incremental cube builds via Kylin’s REST API after the ETL populates fact_flow in Hive.
Incremental Build, Segment Management, and Cleanup
Kylin partitions cubes into time‑based Segments. Daily builds create a new Segment for the previous day's data. To prevent Segment fragmentation, automatic merge is set to combine every 7 days, and a retention threshold of ~100 days discards older Segments.
Extended Applications
Beyond basic funnel analysis, the platform plans to:
Support full‑path analysis (potentially >700 k path combinations) with further dimension optimization.
Perform distribution analyses across city, version, channel, and business district, tracking metrics such as order UV, conversion rates, and subsidies.
Integration with Saiku‑Mondrian Reporting System
The existing reporting stack (Saiku + Mondrian + Impala) was extended to include a Kylin‑backed engine for fixed‑query workloads, reducing cluster pressure and improving latency.
Key integration challenges and solutions:
Adding Kylin dialect support to Mondrian 4.4 (referencing an open‑source GitHub project).
Fixing a COUNT(DISTINCT) SQL generation bug in Mondrian for Kylin.
Addressing a Kylin 2.0 array‑out‑of‑bounds bug (resolved in Kylin 2.1).
Enabling left‑join support in Mondrian to handle nullable fields without massive view rewrites.
Upgrading Mondrian schemas from version 3 to 4 and adapting view‑based queries to Kylin‑compatible models.
Modifying Saiku 3.14 to store schema metadata in MySQL and enforce user/metric permissions.
After these adaptations, users can drag‑and‑drop dimensions and measures in Saiku, generating MDX that Mondrian translates to SQL for Kylin, delivering results in seconds without stressing the Impala cluster.
Conclusion
By adopting Apache Kylin, Baidu Waimai achieved scalable, low‑latency traffic analysis on a hundred‑billion‑row dataset, simplified data modeling through dimension reduction, and off‑loaded repetitive reporting queries from Impala. Ongoing work focuses on full‑path analytics and richer distribution analyses.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
