Big Data 14 min read

OLAP Engine Selection and Challenges in Large-Scale Data at Youku

This article explores the challenges big data brings to traditional data technologies and reviews various OLAP solutions—including MPP, batch processing, pre‑computation, and Hadoop‑based engines—while detailing Youku’s specific business scenarios and how different OLAP engines are selected to meet performance, scalability, and real‑time analysis requirements.

DataFunSummit

Nov 12, 2020

OLAP Engine Selection and Challenges in Large-Scale Data at Youku

Introduction Data‑driven decision making is essential for development, product, and operations. The article examines the challenges big data poses to traditional data technologies and surveys the OLAP engines used at Youku to support diverse analytical needs.

Challenges of Big Data Processing billions of rows with traditional MySQL can take minutes, which is unacceptable for real‑time analytics. Two primary mitigation strategies are discussed: increasing concurrency (MPP or parallel instances) and pre‑computing results.

Concurrency‑Based Solutions 1. MPP Architecture : Engines like Greenplum run multiple PostgreSQL instances, distributing queries across nodes. While offering parallelism, MPP suffers from limited horizontal scalability and hardware failure risks. 2. Batch‑Processing Architecture : Frameworks such as MapReduce and Spark assign tasks to a subset of nodes, decoupling computation from storage. This improves scalability but introduces disk I/O overhead.

Complementarity of Batch and MPP Batch processing provides robust, scalable offline data cleaning, whereas MPP delivers faster interactive queries on cleaned data. The two approaches are often combined to balance speed and reliability.

MPP on Hadoop Technologies like Impala and Presto run MPP‑style queries on HDFS, bridging the gap between Hadoop’s batch layer and low‑latency analytics.

Pre‑Computation Solutions Apache Kylin builds cubes on HBase, enabling sub‑second query responses after an upfront modeling step. Druid offers built‑in roll‑up and time‑series storage, providing fast OLAP queries without an external KV store.

OLAP Landscape Summary The surveyed OLAP solutions fall into two categories: (1) concurrency‑based (MPP and batch) and (2) pre‑computation (Kylin, Druid). Each has trade‑offs in latency, scalability, and flexibility.

Youku Business Scenarios 1. Real‑time API & Monitoring : Uses a custom pre‑computation system to serve high‑QPS, minute‑level latency features. 2. BI Reporting : Combines batch cleaning with MPP warehouses for complex offline analysis. 3. Ad‑hoc Real‑time Debugging : Employs an ELK‑style stack (Elasticsearch + custom BI) for fast fault isolation.

Conclusion Choosing the right OLAP engine depends on workload characteristics—throughput, latency, and query complexity. Youku leverages a mix of MPP, batch, and pre‑computation technologies to meet its diverse analytical requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Analytics Big Data data-warehouse OLAP MPP Youku Precomputation

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.