How Vipshop Scaled Real‑Time OLAP: From GreenPlum to Presto, Kylin, and Redis
Vipshop faced massive data growth that broke traditional RDBMS, causing slow OLAP queries, inefficient ETL, and long development cycles, so it iteratively rebuilt its analytics stack—adding Hadoop/Hive, a self‑service UI, Presto, Kylin, and Redis—to achieve sub‑second query responses, higher concurrency, and a flexible, low‑latency BI solution.

Challenges of Massive Real‑Time OLAP
Vipshop’s traffic grew from millions to hundreds of millions of users, increasing data volume by more than 100×. Traditional RDBMS could no longer store or compute the data, leading to three critical problems:
Slow OLAP query response (often >100 s).
ETL pipelines became bottlenecks, taking hours instead of minutes.
Feature development cycles lengthened because new dimensions or metrics required large data refreshes.
Evolution of the OLAP Architecture
Phase 0 – Greenplum + Tableau
Greenplum was used as an MPP data warehouse and Tableau for BI. This worked for modest data sizes but horizontal scaling limits of Greenplum caused storage and compute exhaustion as data grew.
Phase 1 – Hadoop/Hive + Greenplum
Batch preprocessing moved to Hadoop/Hive; results were synchronized to Greenplum for ad‑hoc analysis. This eliminated the storage bottleneck but introduced data‑sync complexity and highlighted Tableau’s inability to support flexible self‑service analytics.
Phase 2 – Self‑Service UI + SQL Parser
A drag‑and‑drop UI let users compose dimensions and metrics. The UI generated a logical data description that a custom SQL parser translated into executable SQL. The parser exposed the need for a faster, more flexible query engine.
Phase 3 – Adoption of Presto
Presto (Facebook’s open‑source MPP OLAP engine) was introduced because it:
Executes queries in‑memory across a distributed cluster.
Supports joins on HDFS metadata without moving data.
Provides low‑latency query execution.
Benchmarking on identical hardware showed Presto reduced typical query time from >100 s (Greenplum) to ~10 s – a 70 % performance gain.
Phase 3.5 – Redis In‑Memory Cache
Redis was added as a first‑level cache. Query flow:
Check Redis for an exact SQL‑to‑result match.
If miss, fall back to Kylin (MOLAP) or Presto (ROLAP).
Overall cache hit rate reached 15 % (up to 60 % for hot topics), freeing compute resources.
Phase 4 – Hybrid Presto + Kylin Architecture
Kylin (eBay’s open‑source MOLAP engine) handles high‑cardinality cube queries, while Presto serves the remaining ROLAP workload. The routing logic (Redis → Kylin → Presto) covers ~90 % of analysis scenarios and reduces core metric latency to a few seconds.
Engine‑Specific Enhancements
Presto Improvements
Added /*+ HINT */ syntax to let users choose join strategies (replicated vs distributed) at compile time.
Implemented join‑skew detection and mitigation to avoid data‑skew hotspots.
Enabled multi‑cluster load balancing for horizontal scaling.
Rewrote the Kafka connector to support hot‑topic updates and offset‑based reads, including protobuf payload handling.
Developed a Kylin connector that automatically pushes sub‑queries matching existing Kylin cubes to Kylin, reducing the amount of data processed by Presto.
Kylin Improvements
Integrated Presto as a fallback for massive dimension lookups that previously caused OOM in Kylin.
Extended Kylin with Vipshop‑specific business logic (e.g., custom dimension hierarchies).
Implemented a “CUBE Advisor” that analyses dimension‑metric combinations and recommends optimal cube designs, improving cube hit rates.
Provided APIs for ETL scheduler integration to refresh cube data and manage cache lifecycles.
Future Directions
Planned upgrades focus on finer‑grained data pruning and real‑time capabilities:
Replace ORC row‑group indexes with row‑ID level indexes to prune data at scan time and increase query concurrency.
Explore streaming cubing in Kylin to ingest real‑time data into cubes.
Investigate Lambda architecture‑style “streaming + batch” cubing for seamless real‑time and offline cube fusion.
Research next‑generation engines that tightly couple raw and aggregated data while scaling compute horizontally without stateful nodes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
