Big Data 13 min read

Choosing OLAP Solutions for Large-Scale Data at Youku

The article examines the challenges big data brings to traditional technologies and surveys major OLAP solutions—MPP, batch processing, and pre‑computation—including Greenplum, Druid, Kylin, and Hadoop‑based engines, then outlines Youku’s specific use‑case selections for real‑time APIs, BI reporting, and ad‑hoc analysis.

DataFunTalk
DataFunTalk
DataFunTalk
Choosing OLAP Solutions for Large-Scale Data at Youku

Data‑driven decision making is now essential across development, product, and operations, and Youku faces massive data volumes that require robust OLAP engines to support diverse analytical needs.

The article first outlines the challenges big data poses to traditional technologies, such as the inability of a single MySQL instance to process billions of rows within acceptable latency.

Two broad strategies are presented to address these challenges: increasing concurrency (through MPP or batch‑processing architectures) and pre‑computing results.

MPP Architecture – exemplified by Greenplum, which runs multiple PostgreSQL instances managed by a master node, distributing queries across all nodes. While MPP offers parallelism, it suffers from limited horizontal scalability and hardware fault tolerance.

Batch‑Processing Architecture – represented by MapReduce and Spark, which assign tasks to a subset of nodes, allowing better scalability but incurring disk I/O overhead.

The two approaches complement each other: batch processing handles offline data cleaning, while MPP provides faster interactive queries on the cleaned data.

Pre‑computation – solutions like Apache Kylin define cubes that pre‑aggregate data and store results in HBase, achieving sub‑second query latency at the cost of flexibility. Druid offers built‑in storage with roll‑up capabilities, serving as an alternative pre‑computation engine.

An overview diagram categorises OLAP solutions into two groups: those that increase concurrency (MPP and batch) and those that rely on pre‑computation.

Youku’s Scenario‑Based OLAP Selection

1. Real‑time API & monitoring – uses a custom pre‑computation system built on Kafka, Flink, and an internal KV store to meet minute‑level latency for features like exposure counts.

2. BI reporting – combines batch processing (for data cleaning and DWS layer creation) with MPP warehouses to support complex, low‑frequency analytical queries.

3. Ad‑hoc real‑time analysis – adopts an ELK‑style stack (log collection, cleaning, Elasticsearch) for fault‑diagnosis queries that require fast, flexible search and aggregation.

The article concludes with a summary of the discussed OLAP options and thanks the audience.

Data EngineeringBig DataOLAPMPPYoukuPrecomputation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.