Distributed Databases for OLAP: MPP, Hadoop Ecosystem, and Like‑Mesa (ClickHouse/Palo) Overview
This article examines the evolution and classification of distributed databases for OLAP workloads, comparing traditional RDBMS, MPP solutions such as Teradata and Greenplum, Hadoop‑based ecosystems, and newer architectures like ClickHouse and Palo, while highlighting their architectural traits, strengths, and limitations.
With the rapid growth of large‑scale Internet applications, distributed databases have become a hot topic, especially for banks that face performance bottlenecks with single‑node databases and are considering distributed solutions.
Ivan previously discussed the necessity of distributed databases for banks and received requests to cover MPP products such as Teradata and Greenplum, which are often overlooked in OLTP‑focused discussions.
This article incorporates OLAP‑oriented distributed data, dissecting "distributed databases" from two dimensions: a horizontal overview of five categories and a vertical deep‑dive into OLTP‑oriented implementations.
1. RDBMS: Common Roots
Since the 1990s, relational databases (RDBMS) like Sybase, Oracle, and DB2 have dominated, offering ANSI SQL interfaces and ACID transaction guarantees. Subsequent distributed databases trade off these guarantees for other capabilities.
Distributed data storage is defined by two essential conditions: horizontal scalability for high performance and inexpensive hardware plus software for high reliability.
The five categories identified are:
NoSQL
NewSQL
MPP
Hadoop ecosystem
Like‑Mesa
Note: Technologies such as Kafka and Zookeeper are distributed storage but are excluded from the "database" discussion due to distinct use cases.
2. Distributed Databases for OLAP Scenarios
During the 1990‑2000s, banks built nationwide centralized systems using RDBMS. As data accumulated, analytical (OLAP) needs emerged, prompting the rise of MPP solutions.
1. MPP
Massively Parallel Processing (MPP) systems use a Share‑Nothing architecture; typical products include Teradata (TD) and Greenplum (GPDB). GPDB follows a master‑segment design where the master handles metadata and scheduling, while segments perform parallel data processing.
Key architectural features:
Horizontal scalability across many nodes
Software‑based high reliability with standby master and mirrored segments
Performance limitations observed in large clusters include:
Straggler nodes that slow batch processing
Concurrency does not increase with node count, leading to performance drops at high query concurrency
Master node can become a bottleneck for query routing
Consequently, MPP (at least GPDB) faces scalability constraints in very large deployments.
2. Hadoop Ecosystem
With the advent of the big‑data era, Hadoop lowered the cost of analytical systems. Its stack (HDFS, Spark, Hive, Impala, etc.) emphasizes high‑throughput batch processing and provides SQL‑like interfaces.
Architectural strengths:
High data‑throughput processing
Key drawbacks:
Lower batch processing efficiency compared to MPP due to different data layout strategies
Inability to seamlessly integrate with traditional EDW implementation methodologies (e.g., difficulty handling slowly changing dimensions with “ladder‑table” designs)
Insufficient concurrency for interactive queries, similar to MPP limitations
Hadoop’s design favors scalability over low‑latency, high‑concurrency OLAP workloads.
3. Like‑Mesa
Google’s Mesa (2014) introduced near‑real‑time analytical warehousing using BigTable, Colossus, and MapReduce. Open‑source derivatives include ClickHouse and Palo.
Architecture highlights:
Frontend handles query compilation and coordination; Backend performs execution and storage
Data is pre‑aggregated (materialized roll‑ups) to accelerate query response
Focus on low‑latency, high‑concurrency online queries, diverging from traditional MPP batch‑centric designs
Both ClickHouse and Palo target advertising and time‑series analytics, with ongoing development.
References:
[1] Wikipedia: Massively Parallel Processing
[2] Greenplum architecture overview
[3] Apache HAWQ blog
[4] Comparison of MPP and batch frameworks
[5] Gupta et al., "Mesa: Geo‑replicated, near real‑time, scalable data warehousing"
[6] ClickHouse website
[7] Palo GitHub repository
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
