Databases 25 min read

Understanding OLAP Data Warehouse Types, Architectures, and Performance Optimizations

This article provides a comprehensive overview of OLAP data warehouses, covering classification by data volume and modeling, detailed explanations of MOLAP, ROLAP, HOLAP and HTAP, common open‑source implementations, and a deep dive into performance‑boosting techniques such as MPP architectures, cost‑based optimization, vectorized execution, dynamic code generation, storage compression, runtime filters and resource management.

Big Data Technology Architecture

Jun 28, 2020

Understanding OLAP Data Warehouse Types, Architectures, and Performance Optimizations

What types of OLAP data warehouses exist?

Data warehouses can be classified by data volume, ranging from small relational databases (e.g., MySQL) for modest workloads to large‑scale, near‑real‑time analytical warehouses such as Cloudera Impala, Facebook Presto, and Pivotal Greenplum for hundreds of millions to billions of rows. When data exceeds the hundred‑billion‑row threshold, offline warehouses like Hive or Spark are typically chosen.

Classification by modeling type

OLAP systems are commonly divided into three modeling approaches: MOLAP (Multidimensional), ROLAP (Relational), and HOLAP (Hybrid). Each has distinct trade‑offs in pre‑computation, flexibility, and performance.

1. MOLAP

MOLAP stores pre‑computed multidimensional cubes, enabling fast, index‑free queries but requiring costly pre‑aggregation and limited adaptability to new query patterns. Apache Kylin is a notable open‑source MOLAP engine that supports sub‑second queries on billions of rows.

2. ROLAP

ROLAP operates directly on relational fact and dimension tables, offering greater scalability and easier schema evolution. However, query latency can vary widely, especially for complex analytical SQL. Popular open‑source ROLAP systems include Druid, ClickHouse (wide‑table models) and Greenplum, Presto, Impala (star‑schema models).

3. HOLAP

HOLAP combines the strengths of MOLAP and ROLAP: frequently accessed, expensive queries are served from pre‑computed cubes, while ad‑hoc or less frequent queries are executed relationally.

4. HTAP

HTAP (Hybrid Transaction/Analytical Processing) extends ROLAP with transactional capabilities, and some cloud vendors are evolving it toward HSAP (Hybrid Serving/Analytical Processing).

Common open‑source ROLAP products

Wide‑table models (e.g., Druid, ClickHouse) provide excellent query speed for time‑series and monitoring workloads but have limited join capabilities. Star‑schema or snowflake models (e.g., Greenplum, Presto, Impala) support complex analytical queries at the cost of higher latency.

Performance‑boosting techniques for ROLAP

Modern ROLAP systems employ a suite of optimizations:

Massively Parallel Processing (MPP) architectures replace MapReduce, delivering sub‑second query responses.

Cost‑Based Optimizer (CBO) uses detailed statistics (min/max, histograms, distinct counts) to choose join methods, join order, and execution plans.

Join strategies include broadcast (small‑table) and partitioned joins, selected based on table size estimates.

Vectorized execution engines process columns in batches, reducing virtual‑function overhead and improving CPU cache utilization.

Dynamic code generation (e.g., LLVM in Impala, reflection in Spark SQL) eliminates interpreter overhead and further speeds up tight loops.

Storage optimizations such as columnar formats, compression (zlib, Snappy, LZ4) and encoding (RLE, dictionary) reduce I/O and enable predicate push‑down.

Fine‑grained metadata (row‑group statistics, local indexes) allows early data pruning.

Data locality (HDFS short‑circuit reads) minimizes network hops.

Runtime filters (Impala) and dynamic partition pruning (Spark SQL) apply bloom‑filter predicates early to cut scanned data.

Resource management improvements (long‑lived ApplicationMasters, container reuse) reduce query startup latency.

Example of HDFS short‑circuit configuration

<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
</property>
<property>
    <name>dfs.domain.socket.path</name>
    <value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>

Conclusion

The article summarizes the author's understanding of data warehouses and OLAP systems, presenting a Q&A style overview of types, architectures, and a wide range of optimization techniques, while inviting readers to point out any inaccuracies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse OLAP MPP Dynamic Code Generation Vectorized Execution Cost-Based Optimization

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.