Databases 11 min read

How Spark Turns Traditional Databases into Powerful OLAP Engines

This article examines why traditional relational databases like MySQL struggle with analytical workloads, compares ROLAP and MOLAP approaches, explains Spark’s architecture and its advantages for OLAP, and details how Alibaba Cloud’s DRDS HTAP leverages a Spark‑based engine to deliver real‑time distributed query processing.

dbaplus Community

Nov 4, 2018

How Spark Turns Traditional Databases into Powerful OLAP Engines

1. Limitations of Traditional Databases

MySQL dominates internet services but is designed for OLTP with a single‑thread‑per‑request model, making complex analytical (OLAP) queries inefficient. Even with thread pools and priority queues, the fundamental one‑thread‑one‑query design prevents effective multi‑core parallelism.

Scaling hardware alone does not improve OLAP performance because the architecture cannot exploit parallelism.

Distributed sharding solutions such as Alibaba Cloud DRDS provide scale‑out capabilities, supporting distributed transactions, smooth scaling, read/write separation, and permission management.

2. Approaches to OLAP

Two main strategies exist for handling OLAP workloads:

ROLAP – copy data to a separate data warehouse built on a distributed MPP (Massively Parallel Processing) architecture, where data is sharded across nodes.

MOLAP – pre‑aggregate data into a data cube. For example, enumerating combinations of (year, item, city) creates eight aggregated cells that are computed in advance, dramatically reducing query cost. However, MOLAP is limited by the predefined modeling.

3. Spark and Spark SQL

Spark is a leading big‑data framework offering both programmatic APIs (Java, Scala, Python) and a declarative SQL interface. It runs primarily in memory, does not depend on Hadoop storage, and uses the Resilient Distributed Dataset (RDD) model, which assumes deterministic computation and can recompute lost partitions.

In Spark’s execution plan, operators like filter and map can be pipelined within a partition, while operators such as join require a shuffle that creates a “pipeline‑breaker” and introduces separate stages.

During execution, stages are processed according to their dependencies; independent stages can run in parallel, and each stage executes its operators in a pipeline fashion.

Spark SQL is a higher‑level wrapper over the RDD API. The following image shows a native API example:

4. Why Use Spark?

Execution models can be categorized as:

Single‑node execution – exemplified by MySQL, insufficient for complex analytics.

Single‑node parallel (SMP) – exemplified by PostgreSQL, lacks true scale‑out.

Distributed parallel (MPP) – used by modern data warehouses and big‑data frameworks, providing massive parallelism and, depending on fault‑tolerance design, either distributed or batch processing.

Compared with Hive, Spark SQL meets the performance and ecosystem requirements for our use case.

5. DRDS HTAP in Practice

Alibaba Cloud DRDS read‑only instances embed the Fireworks engine, a Spark‑based distributed MPP engine customized for HTAP workloads.

The engine is transparent to users: the optimizer decides whether a query needs distributed execution, dispatches the plan to the Fireworks cluster, and returns the result.

DRDS HTAP does not introduce new storage; it leverages the primary‑secondary binlog replication (millisecond latency) to provide near‑real‑time query capability without adding load to the primary database.

Q&A

Q: If the optimizer receives a complex query that needs Spark computation, does it translate it into Spark’s logical plan, and how is the integration handled?

A: DRDS currently invokes Spark SQL directly, avoiding deep internal integration. Spark’s optimizer may perform an additional optimization pass, but the final execution plan usually aligns with expectations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Data Warehouse HTAP OLAP Databases Spark MPP

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.