Databases 21 min read

Understanding OLAP vs. OLTP and the Fundamentals of Data Warehousing

This article explains the core differences between OLTP and OLAP, evaluates whether traditional OLTP databases like MySQL can handle analytical workloads, introduces benchmark queries, and provides a comprehensive overview of data‑warehouse concepts such as data sources, fact and dimension tables, multi‑dimensional modeling, and common cube operations.

DataFunTalk
DataFunTalk
DataFunTalk
Understanding OLAP vs. OLTP and the Fundamentals of Data Warehousing

Over the past seven years the author has worked on MySQL, MongoDB, and Neo4j at NetEase Hangzhou R&D, focusing on OLTP workloads, and recently began studying OLAP technologies, summarizing the findings as a personal note.

1. What is OLAP and how does it differ from OLTP? OLTP (Online Transaction Processing) handles simple, low‑volume transactions such as banking or e‑commerce operations that modify a few rows in one or two tables. OLAP (Online Analytical Processing) performs complex analytical queries on large data sets, often aggregating across many tables without indexes, and is typically used by analysts, marketers, and business users.

Key distinctions include data volume, query complexity, and the target system: OLTP uses a transactional database, while OLAP operates on a data warehouse.

2. Can MySQL (an OLTP database) handle OLAP workloads? MySQL can execute some analytical queries, but it lacks the sophisticated optimizer needed for large‑scale aggregations. Recent MySQL 8.0 improvements (window functions, CTEs, better joins) make it more capable for small‑to‑medium analytical tasks, especially when run on read‑only replicas.

3. How do OLAP queries differ from OLTP queries? OLTP queries are usually point selects or simple range scans with limits. OLAP queries involve aggregations, multi‑table joins, and non‑indexed predicates, making them CPU‑bound rather than I/O‑bound.

Benchmark examples illustrate typical OLTP (sysbench) and OLAP (TPC‑DS) queries.

local stmt_defs = {
   point_selects = {"SELECT c FROM sbtest%u WHERE id=?", t.INT},
   simple_ranges = {"SELECT c FROM sbtest%u WHERE id BETWEEN ? AND ?", t.INT, t.INT},
   sum_ranges = {"SELECT SUM(k) FROM sbtest%u WHERE id BETWEEN ? AND ?", t.INT, t.INT},
   order_ranges = {"SELECT c FROM sbtest%u WHERE id BETWEEN ? AND ? ORDER BY c", t.INT, t.INT},
   distinct_ranges = {"SELECT DISTINCT c FROM sbtest%u WHERE id BETWEEN ? AND ? ORDER BY c", t.INT, t.INT}
}
SELECT c FROM sbtest10 WHERE id=4352
SELECT c FROM sbtest10 WHERE id BETWEEN 5046 AND 5046+99 ORDER BY c
SELECT c FROM sbtest3 WHERE id BETWEEN 4983 AND 4983+99
SELECT SUM(K) FROM sbtest1 WHERE id BETWEEN 4981 AND 4981+99
SELECT DISTINCT c FROM sbtest3 WHERE id BETWEEN 4989 AND 4989+99 ORDER BY c

TPCC example (simplified):

-- SELECT c_id FROM customer WHERE c_w_id = :c_w_id AND c_d_id = :c_d_id AND c_last = :c_last ORDER BY c_first;
-- SELECT c_balance, c_first, c_middle, c_last FROM customer WHERE c_w_id = :c_w_id AND c_d_id = :c_d_id AND c_id = :c_id;
-- SELECT c_discount, c_last, c_credit, w_tax FROM customer, warehouse WHERE w_id = :w_id AND c_w_id = w_id AND c_d_id = :d_id AND c_id = :c_id;

TP‑DS query example (partial):

-- query68
SELECT "c_last_name", "c_first_name", "ca_city", "bought_city", "ss_ticket_number", "extended_price", "extended_tax", "list_price"
FROM (
   SELECT "ss_ticket_number", "ss_customer_sk", "ca_city" "bought_city",
          SUM("ss_ext_sales_price") "extended_price",
          SUM("ss_ext_list_price") "list_price",
          SUM("ss_ext_tax") "extended_tax"
   FROM ${database}.${schema}.store_sales,
        ${database}.${schema}.date_dim,
        ${database}.${schema}.store,
        ${database}.${schema}.household_demographics,
        ${database}.${schema}.customer_address
   WHERE (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
     AND (store_sales.ss_store_sk = store.s_store_sk)
     AND (store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk)
     AND (store_sales.ss_addr_sk = customer_address.ca_address_sk)
     AND (date_dim.d_dom BETWEEN 1 AND 2)
     AND ((household_demographics.hd_dep_count = 4) OR (household_demographics.hd_vehicle_count = 3))
     AND (date_dim.d_year IN (1999, 2000, 2001))
     AND (store.s_city IN ('Midway', 'Fairview'))
   GROUP BY "ss_ticket_number", "ss_customer_sk", "ss_addr_sk", "ca_city"
) dn, ${database}.${schema}.customer, ${database}.${schema}.customer_address current_addr
WHERE (ss_customer_sk = c_customer_sk)
  AND (customer.c_current_addr_sk = current_addr.ca_address_sk)
  AND (current_addr.ca_city <> bought_city)
ORDER BY "c_last_name" ASC, "ss_ticket_number" ASC
LIMIT 100;

4. Can OLTP and OLAP be unified? HTAP (Hybrid Transaction/Analytical Processing) systems such as PingCAP TiDB attempt to serve both workloads by storing data in multiple replicas (row‑store for TP, column‑store for AP) and separating compute layers, but full unification remains challenging.

5. Basics of a data warehouse – Data originates from three sources: structured data (relational tables), semi‑structured data (JSON, logs, NoSQL docs), and unstructured data (images, audio, video) which are stored as metadata. A warehouse supports ad‑hoc queries, BI reporting, analytics, and data mining.

The warehouse model is multi‑dimensional (star schema) consisting of a central fact table surrounded by dimension tables. Fact tables record events (e.g., sales), while dimension tables describe attributes (e.g., product, time, location). This structure enables efficient aggregation across dimensions.

Data cube operations include Drill‑down (finer granularity), Roll‑up (aggregate to higher level), Slice (select a single dimension value), Dice (select a range or set of values), and Pivot (swap axes). These operations allow analysts to explore data from different perspectives.

The article concludes with a preview of the next installment, which will discuss data‑warehouse types, representative products, and core technologies.

analyticsSQLdata modelingdata warehouseHTAPOLAPOLTP
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.