Big Data 19 min read

Comparison of Open-Source OLAP Engines for Real-Time Data Warehousing

This article reviews the concepts, criteria, and characteristics of major open‑source OLAP engines—including Hive, HAWQ, Spark SQL, Presto, Kylin, Impala, Druid, Greenplum, and ClickHouse—providing guidance on selecting the most suitable solution for various big‑data analytics scenarios.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Comparison of Open-Source OLAP Engines for Real-Time Data Warehousing

Scene Description: Real‑time data warehouse construction has attracted sudden attention, and the author has previously written about it.

Keywords: Real‑time data warehouse, OLAP technology selection

OLAP Overview

OLAP (Online Analytical Processing) is a data‑warehouse technology distinct from OLTP (Online Transaction Processing). It was first proposed by E.F. Codd in 1993 to address the need for multidimensional analysis beyond simple SQL queries.

The OLAP committee defines it as transforming raw data into information data that can be quickly, consistently, and interactively accessed from multiple perspectives.

OLAP Principles (Codd's 12 Rules)

Multidimensional view

Transparency

Access capability

Stable reporting

Client/server architecture

Dimensional homogeneity

Dynamic sparse matrix handling

Multi‑user support

Unrestricted cross‑dimensional operations

Intuitive data manipulation

Flexible reporting

Unrestricted dimension and aggregation hierarchy

Open‑Source OLAP Engines

Hive

Hive is a Hadoop‑based data‑warehouse tool that maps structured files to tables and translates SQL to MapReduce jobs. It offers low learning cost and extensive SQL support but suffers from high latency due to full‑table scans.

HAWQ

HAWQ is a native Hadoop MPP SQL engine with a cost‑based optimizer, supporting external data sources via PXF and offering SQL UDFs for analytics.

Spark SQL

Spark SQL integrates SQL queries with Spark RDDs, providing ANSI‑SQL support, optimizer‑driven execution, and compatibility with Hive data sources.

Presto

Presto is an open‑source distributed SQL query engine for interactive analytics across heterogeneous data sources.

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow "free" solution that requires excessive hardware.

Kylin

Kylin is a MOLAP system that builds pre‑aggregated cubes for massive datasets (hundreds of billions of rows), offering ANSI‑SQL interfaces and seamless BI integration.

Impala

Impala provides fast, interactive SQL on Hadoop using MPP architecture, sharing metadata with Hive and supporting various file formats, compression codecs, UDF/UDAF, and advanced query features.

Druid

Druid delivers sub‑second queries on both historical and real‑time data, excelling at high‑throughput ingestion and time‑series analysis, though it lacks join support and complex query flexibility.

Greenplum

Greenplum is an open‑source MPP data‑analysis engine based on PostgreSQL, supporting ANSI‑SQL 2008, ACID transactions, and extensive ecosystem integrations.

ClickHouse

ClickHouse is an open‑source column‑oriented DBMS designed for real‑time analytical reporting using SQL.

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries.

Key features include columnar storage, data compression, sharding, distributed parallel execution, high availability, PB‑scale capacity, real‑time updates, and indexing. Limitations involve lack of fine‑grained row‑level modifications, incomplete transaction support, no secondary indexes, limited SQL (especially joins and window functions), and manual metadata management.

Summary

The engines can be grouped as follows:

Hive, HAWQ, Impala – SQL on Hadoop

Presto, Spark SQL – In‑memory SQL execution

Kylin – Pre‑computed cubes (space‑for‑time)

Druid – Real‑time ingestion and query

ClickHouse – High‑performance single‑table analytics

Greenplum – PostgreSQL‑style MPP relational analytics

Choosing the right engine depends on workload characteristics: offline Hadoop batch (Hive/HAWQ/Impala), distributed low‑latency queries (Presto/Spark SQL), fixed‑dimension high‑speed aggregation (Kylin/Druid), or ultra‑fast single‑table queries (ClickHouse). No single OLAP system excels in data volume, performance, and flexibility simultaneously.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSQLData WarehouseOLAPOpen-Sourceperformance comparison
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.