Why SQL Still Dominates Data Analysis: From Relational Algebra to Modern OLAP
This article explains how SQL, built on relational algebra, became the standard analysis language for OLAP engines, covering its history, data models, syntax, functions, aggregation techniques, window functions, subqueries, and practical optimization considerations for modern data warehouses.
Introduction
Data analysis relies on a programming language to instruct an OLAP engine what data to read and how to compute results. Languages can be imperative (e.g., C/C++/Java) or declarative (e.g., SQL). Declarative languages let the engine decide the optimal execution plan.
Features of a Good Analysis Language
Simple syntax and low learning curve
Clear, unambiguous semantics
Rich learning resources
Vibrant ecosystem and tooling
Extensible for complex logic
Why SQL Became the De Facto Standard
Early data analysis used MapReduce, a cumbersome imperative language that required both algorithmic and engineering skills. SQL on Hadoop (e.g., Hive) translated user queries into MapReduce jobs, dramatically improving usability and allowing BI teams to perform self‑service analysis.
Since the 1970s, SQL has evolved through ANSI standards (SQL‑86, SQL‑92, SQL‑99, …, SQL‑2016), adding features such as window functions, JSON support, and advanced aggregation.
Data Models
Common models include relational, key/value, graph, document, and column‑family. Relational databases use tables (relations) with rows and columns, while OLAP focuses on analytical queries over large tables.
Relational Algebra Foundations
Relational algebra provides operators such as σ (select), Π (project), ∪ (union), ∩ (intersection), − (difference), × (product), ⋈ (join), ρ (rename), δ (duplicate elimination), γ (aggregation), τ (sorting), etc. These operators underpin SQL query planning and optimization.
SQL Syntax Overview
A typical SELECT statement follows this logical order:
WITH with_query [...]
SELECT expr
FROM table
WHERE bool_expr
GROUP BY columns
HAVING condition
ORDER BY expr
LIMIT countThe engine parses the query, builds an abstract syntax tree, generates a logical plan, applies optimizations, and finally produces a high‑performance operator graph.
SELECT Clause Details
Column reference: SELECT column_name Scalar function: SELECT round(key,1) Aggregate function: SELECT avg(value) Scalar functions operate row‑by‑row without changing row count, while aggregate functions collapse multiple rows into a single result (or one result per GROUP BY bucket).
Aggregate Functions and Extensions
Custom UDAFs can be registered for specialized aggregation. Selective aggregation allows each aggregate to have its own filter:
SELECT key,
agg1(x) FILTER (WHERE cond1),
agg2(y) FILTER (WHERE cond2)
FROM tblDistinct aggregation (e.g., COUNT(DISTINCT key)) removes duplicates before aggregation. Null handling: aggregates ignore nulls; COUNT(*) counts all rows regardless of nulls.
GROUP BY and Grouping Sets
GROUP BY partitions rows into buckets; each bucket produces its own aggregates. Grouping sets, CUBE, and ROLLUP enable multiple grouping combinations in a single query:
SELECT grade, class, COUNT(1)
FROM log
GROUP BY GROUPING SETS ((grade, class), (grade), (class))Window Functions
Window functions behave like scalar functions but can use aggregate logic over a defined window, preserving the original row count. Example:
SELECT avg(latency) OVER (PARTITION BY host) AS host_avg
FROM accesslogFrames can be bounded (e.g., RANGE BETWEEN 1 PRECEDING AND 2 FOLLOWING) to limit the rows considered.
Subqueries
Subqueries can appear in SELECT, FROM, WHERE, HAVING, and ORDER BY clauses. Types include scalar subqueries (single value), multi‑row subqueries, and EXISTS/NOT EXISTS checks. Correlated subqueries reference outer query columns and are often rewritten as joins during optimization.
Null and Unknown Handling
When a scalar function receives null, it returns null. Boolean expressions with null produce an UNKNOWN state, which propagates through logical operators (AND, OR, NOT) according to three‑valued logic. IS NULL / IS NOT NULL tests are used to detect nulls.
UNNEST Syntax
UNNEST expands a single row containing an array into multiple rows:
SELECT element FROM (VALUES (ARRAY[1,2,3])) AS t(element)Other SQL Statements
Beyond SELECT, DDL statements such as CREATE and INSERT are part of the full SQL language.
Conclusion
The article provides a comprehensive overview of SQL’s core concepts, from relational algebra foundations to modern extensions like window functions and selective aggregation, equipping readers with the knowledge needed to write efficient analytical queries in contemporary data‑warehouse environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
