Big Data 27 min read

Types of OLAP Data Warehouses and Performance Optimization Techniques

This article explains the various classifications of OLAP data warehouses—including MOLAP, ROLAP, HOLAP, and HTAP—based on data volume and modeling, reviews common open‑source ROLAP products, and details performance‑boosting techniques such as MPP architecture, cost‑based optimization, vectorized execution, and storage optimizations.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Types of OLAP Data Warehouses and Performance Optimization Techniques

Types of OLAP Data Warehouses

Data warehouses can be categorized by data volume. Small‑scale warehouses (e.g., MySQL) suit low‑volume analytical workloads, medium‑scale warehouses (hundreds of millions to tens of billions of rows) such as Cloudera Impala, Facebook Presto, and Pivotal Greenplum provide near‑real‑time analytics, while very large‑scale warehouses (hundreds of billions of rows) typically use offline engines like Hive or SparkSQL.

Classification by Modeling Type

OLAP systems are also divided by modeling approach into three classic types: MOLAP (multidimensional), ROLAP (relational), and HOLAP (hybrid). An emerging fourth type, HTAP (Hybrid Transaction/Analytical Processing), extends ROLAP with transactional capabilities.

1. MOLAP

MOLAP is the traditional multidimensional approach introduced by Codd in 1993. It pre‑computes all possible query results and stores them in optimized multidimensional arrays (cubes). Because results are pre‑materialized, queries are fast and can be heavily compressed, but pre‑computation is time‑consuming and inflexible when business requirements change. MOLAP works best for fixed, large‑scale workloads; Apache Kylin is a notable open‑source MOLAP engine that can query billions of rows in sub‑second latency.

2. ROLAP

ROLAP operates directly on relational fact and dimension tables, avoiding pre‑computation. It scales well and can ingest incremental data without full recomputation, but query latency can vary widely—seconds for simple TPCH queries, minutes or hours for complex TPC‑DS workloads. ROLAP’s lower barrier to entry makes it popular; users only need to build star or snowflake schemas and write SQL.

3. HOLAP

HOLAP combines the strengths of MOLAP and ROLAP: frequently accessed, stable queries use pre‑computed cubes, while ad‑hoc or less frequent queries fall back to relational scans. No widely adopted open‑source HOLAP system exists yet, though some large‑scale data service providers have proprietary implementations.

4. HTAP

HTAP (Hybrid Transaction/Analytical Processing) is considered a variant of ROLAP that adds transactional capabilities, often by weakening the transaction component (T) to a serving component (S) and moving toward HSAP. The article does not delve deeper into HTAP.

Common Open‑Source ROLAP Products

Open‑source ROLAP systems fall into two categories: wide‑table models and multi‑table (star/snowflake) models.

Wide‑table models (e.g., Druid, ClickHouse, Elasticsearch, Solr) provide excellent query performance for certain workloads but have limited SQL support, especially for complex joins. Druid excels at massive data scales with pre‑aggregation and inverted indexes, while ClickHouse offers simple deployment and strong vectorized query performance.

Multi‑table models use star or snowflake schemas and include Greenplum, Presto, and Impala. Built on MPP architectures, they support large data volumes, extensibility, and rich SQL features, though they may have higher latency compared to specialized wide‑table engines.

Performance‑Boosting Techniques for ROLAP

MPP Architecture

Massively Parallel Processing (MPP) distributes query execution across many nodes, delivering query latencies in seconds or milliseconds, unlike MapReduce‑based systems that suffer from heavy I/O and lack of cross‑stage optimization.

Impala’s architecture illustrates a typical MPP workflow: a coordinator node parses the SQL, generates a plan, fragments it into plan fragments (PFs), distributes PFs to executor nodes, each executor scans storage (HDFS/HBase), performs local joins and aggregations, and finally the coordinator aggregates results.

Cost‑Based Optimization (CBO)

CBO relies on detailed statistics—row counts, min/max values, histograms, distinct counts—to estimate the cost of alternative execution plans and choose the most efficient one, especially for join ordering and method selection (broadcast vs. partitioned hash join).

Join strategies include broadcast joins for small‑table‑large‑table scenarios and partitioned joins for large‑table‑large‑table cases. The optimizer may also decide between left‑deep trees (LDT) and bushy trees (BYT) based on data characteristics.

Vectorized Execution Engine

Vectorized execution processes columns in batches rather than one tuple at a time, reducing virtual function call overhead, improving CPU cache utilization, and enabling SIMD instructions. This approach dramatically speeds up analytical queries on columnar data.

Dynamic Code Generation

Dynamic code generation emits native code at runtime to eliminate virtual function dispatch and type‑boxing overhead. Impala uses LLVM to generate machine code, while Spark SQL may use reflection or generated Java bytecode. The generated code operates directly on primitive types, further accelerating execution.

Storage and I/O Optimizations

Key techniques include columnar storage formats (Parquet, ORC), compression algorithms (zlib, Snappy, LZ4), and lightweight encodings such as Run‑Length Encoding (RLE) and dictionary encoding. Rich metadata (min/max, histograms, row counts) enables predicate push‑down and early data pruning.

Fine‑grained storage structures like row groups and local indexes allow the engine to skip irrelevant data blocks, reducing I/O and CPU work.

Runtime Data Filtering

Impala’s Runtime Filters (RF) and SparkSQL’s Dynamic Partition Pruning broadcast bloom‑filter information from the build side of a join to the probe side, allowing early elimination of non‑matching rows during scan.

Cluster Resource Management

YARN’s application‑master and container allocation introduce latency. Techniques such as long‑living ApplicationMasters and container reuse mitigate this overhead, ensuring resources are ready when queries start.

Conclusion

The article presents the author’s understanding of data warehouses and OLAP systems, organized as a Q&A style to reflect the questions that guided their research. The author invites readers to point out any inaccuracies.

<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
</property>
<property>
    <name>dfs.domain.socket.path</name>
    <value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
Big DataQuery OptimizationData WarehouseOLAPVectorizationMPPCost-Based Optimization
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.