Understanding OLAP Data Warehouse Types, Architectures, and Performance Optimizations
This article provides a comprehensive overview of OLAP data warehouses, covering classification by data volume and modeling, detailed explanations of MOLAP, ROLAP, HOLAP and HTAP, common open‑source implementations, and a deep dive into performance‑boosting techniques such as MPP architectures, cost‑based optimization, vectorized execution, dynamic code generation, storage compression, runtime filters and resource management.
What types of OLAP data warehouses exist?
Data warehouses can be classified by data volume, ranging from small relational databases (e.g., MySQL) for modest workloads to large‑scale, near‑real‑time analytical warehouses such as Cloudera Impala, Facebook Presto, and Pivotal Greenplum for hundreds of millions to billions of rows. When data exceeds the hundred‑billion‑row threshold, offline warehouses like Hive or Spark are typically chosen.
Classification by modeling type
OLAP systems are commonly divided into three modeling approaches: MOLAP (Multidimensional), ROLAP (Relational), and HOLAP (Hybrid). Each has distinct trade‑offs in pre‑computation, flexibility, and performance.
1. MOLAP
MOLAP stores pre‑computed multidimensional cubes, enabling fast, index‑free queries but requiring costly pre‑aggregation and limited adaptability to new query patterns. Apache Kylin is a notable open‑source MOLAP engine that supports sub‑second queries on billions of rows.
2. ROLAP
ROLAP operates directly on relational fact and dimension tables, offering greater scalability and easier schema evolution. However, query latency can vary widely, especially for complex analytical SQL. Popular open‑source ROLAP systems include Druid, ClickHouse (wide‑table models) and Greenplum, Presto, Impala (star‑schema models).
3. HOLAP
HOLAP combines the strengths of MOLAP and ROLAP: frequently accessed, expensive queries are served from pre‑computed cubes, while ad‑hoc or less frequent queries are executed relationally.
4. HTAP
HTAP (Hybrid Transaction/Analytical Processing) extends ROLAP with transactional capabilities, and some cloud vendors are evolving it toward HSAP (Hybrid Serving/Analytical Processing).
Common open‑source ROLAP products
Wide‑table models (e.g., Druid, ClickHouse) provide excellent query speed for time‑series and monitoring workloads but have limited join capabilities. Star‑schema or snowflake models (e.g., Greenplum, Presto, Impala) support complex analytical queries at the cost of higher latency.
Performance‑boosting techniques for ROLAP
Modern ROLAP systems employ a suite of optimizations:
Massively Parallel Processing (MPP) architectures replace MapReduce, delivering sub‑second query responses.
Cost‑Based Optimizer (CBO) uses detailed statistics (min/max, histograms, distinct counts) to choose join methods, join order, and execution plans.
Join strategies include broadcast (small‑table) and partitioned joins, selected based on table size estimates.
Vectorized execution engines process columns in batches, reducing virtual‑function overhead and improving CPU cache utilization.
Dynamic code generation (e.g., LLVM in Impala, reflection in Spark SQL) eliminates interpreter overhead and further speeds up tight loops.
Storage optimizations such as columnar formats, compression (zlib, Snappy, LZ4) and encoding (RLE, dictionary) reduce I/O and enable predicate push‑down.
Fine‑grained metadata (row‑group statistics, local indexes) allows early data pruning.
Data locality (HDFS short‑circuit reads) minimizes network hops.
Runtime filters (Impala) and dynamic partition pruning (Spark SQL) apply bloom‑filter predicates early to cut scanned data.
Resource management improvements (long‑lived ApplicationMasters, container reuse) reduce query startup latency.
Example of HDFS short‑circuit configuration
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>Conclusion
The article summarizes the author's understanding of data warehouses and OLAP systems, presenting a Q&A style overview of types, architectures, and a wide range of optimization techniques, while inviting readers to point out any inaccuracies.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.