Databases 17 min read

Distributed Databases for OLAP: MPP, Hadoop Ecosystem, and Like‑Mesa (ClickHouse/Palo) Overview

This article examines the evolution and classification of distributed databases for OLAP workloads, comparing traditional RDBMS, MPP solutions such as Teradata and Greenplum, Hadoop‑based ecosystems, and newer architectures like ClickHouse and Palo, while highlighting their architectural traits, strengths, and limitations.

Architecture Digest

Jun 22, 2018

Distributed Databases for OLAP: MPP, Hadoop Ecosystem, and Like‑Mesa (ClickHouse/Palo) Overview

With the rapid growth of large‑scale Internet applications, distributed databases have become a hot topic, especially for banks that face performance bottlenecks with single‑node databases and are considering distributed solutions.

Ivan previously discussed the necessity of distributed databases for banks and received requests to cover MPP products such as Teradata and Greenplum, which are often overlooked in OLTP‑focused discussions.

This article incorporates OLAP‑oriented distributed data, dissecting "distributed databases" from two dimensions: a horizontal overview of five categories and a vertical deep‑dive into OLTP‑oriented implementations.

1. RDBMS: Common Roots

Since the 1990s, relational databases (RDBMS) like Sybase, Oracle, and DB2 have dominated, offering ANSI SQL interfaces and ACID transaction guarantees. Subsequent distributed databases trade off these guarantees for other capabilities.

Distributed data storage is defined by two essential conditions: horizontal scalability for high performance and inexpensive hardware plus software for high reliability.

The five categories identified are:

NoSQL

NewSQL

MPP

Hadoop ecosystem

Like‑Mesa

Note: Technologies such as Kafka and Zookeeper are distributed storage but are excluded from the "database" discussion due to distinct use cases.

2. Distributed Databases for OLAP Scenarios

During the 1990‑2000s, banks built nationwide centralized systems using RDBMS. As data accumulated, analytical (OLAP) needs emerged, prompting the rise of MPP solutions.

1. MPP

Massively Parallel Processing (MPP) systems use a Share‑Nothing architecture; typical products include Teradata (TD) and Greenplum (GPDB). GPDB follows a master‑segment design where the master handles metadata and scheduling, while segments perform parallel data processing.

Key architectural features:

Horizontal scalability across many nodes

Software‑based high reliability with standby master and mirrored segments

Performance limitations observed in large clusters include:

Straggler nodes that slow batch processing

Concurrency does not increase with node count, leading to performance drops at high query concurrency

Master node can become a bottleneck for query routing

Consequently, MPP (at least GPDB) faces scalability constraints in very large deployments.

2. Hadoop Ecosystem

With the advent of the big‑data era, Hadoop lowered the cost of analytical systems. Its stack (HDFS, Spark, Hive, Impala, etc.) emphasizes high‑throughput batch processing and provides SQL‑like interfaces.

Architectural strengths:

High data‑throughput processing

Key drawbacks:

Lower batch processing efficiency compared to MPP due to different data layout strategies

Inability to seamlessly integrate with traditional EDW implementation methodologies (e.g., difficulty handling slowly changing dimensions with “ladder‑table” designs)

Insufficient concurrency for interactive queries, similar to MPP limitations

Hadoop’s design favors scalability over low‑latency, high‑concurrency OLAP workloads.

3. Like‑Mesa

Google’s Mesa (2014) introduced near‑real‑time analytical warehousing using BigTable, Colossus, and MapReduce. Open‑source derivatives include ClickHouse and Palo.

Architecture highlights:

Frontend handles query compilation and coordination; Backend performs execution and storage

Data is pre‑aggregated (materialized roll‑ups) to accelerate query response

Focus on low‑latency, high‑concurrency online queries, diverging from traditional MPP batch‑centric designs

Both ClickHouse and Palo target advertising and time‑series analytics, with ongoing development.

References:

[1] Wikipedia: Massively Parallel Processing

[2] Greenplum architecture overview

[3] Apache HAWQ blog

[4] Comparison of MPP and batch frameworks

[5] Gupta et al., "Mesa: Geo‑replicated, near real‑time, scalable data warehousing"

[6] ClickHouse website

[7] Palo GitHub repository

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ClickHouse OLAP NewSQL Hadoop distributed databases MPP

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.