Databases 46 min read

Why SQL Still Dominates Data Analysis: From Relational Algebra to Modern OLAP

This article explains how SQL, built on relational algebra, became the standard analysis language for OLAP engines, covering its history, data models, syntax, functions, aggregation techniques, window functions, subqueries, and practical optimization considerations for modern data warehouses.

ITPUB

Apr 23, 2023

Why SQL Still Dominates Data Analysis: From Relational Algebra to Modern OLAP

Introduction

Data analysis relies on a programming language to instruct an OLAP engine what data to read and how to compute results. Languages can be imperative (e.g., C/C++/Java) or declarative (e.g., SQL). Declarative languages let the engine decide the optimal execution plan.

Features of a Good Analysis Language

Simple syntax and low learning curve

Clear, unambiguous semantics

Rich learning resources

Vibrant ecosystem and tooling

Extensible for complex logic

Why SQL Became the De Facto Standard

Early data analysis used MapReduce, a cumbersome imperative language that required both algorithmic and engineering skills. SQL on Hadoop (e.g., Hive) translated user queries into MapReduce jobs, dramatically improving usability and allowing BI teams to perform self‑service analysis.

Since the 1970s, SQL has evolved through ANSI standards (SQL‑86, SQL‑92, SQL‑99, …, SQL‑2016), adding features such as window functions, JSON support, and advanced aggregation.

Data Models

Common models include relational, key/value, graph, document, and column‑family. Relational databases use tables (relations) with rows and columns, while OLAP focuses on analytical queries over large tables.

Relational Algebra Foundations

Relational algebra provides operators such as σ (select), Π (project), ∪ (union), ∩ (intersection), − (difference), × (product), ⋈ (join), ρ (rename), δ (duplicate elimination), γ (aggregation), τ (sorting), etc. These operators underpin SQL query planning and optimization.

SQL Syntax Overview

A typical SELECT statement follows this logical order:

WITH with_query [...]
SELECT expr
FROM table
WHERE bool_expr
GROUP BY columns
HAVING condition
ORDER BY expr
LIMIT count

The engine parses the query, builds an abstract syntax tree, generates a logical plan, applies optimizations, and finally produces a high‑performance operator graph.

SELECT Clause Details

Column reference: SELECT column_name Scalar function: SELECT round(key,1) Aggregate function: SELECT avg(value) Scalar functions operate row‑by‑row without changing row count, while aggregate functions collapse multiple rows into a single result (or one result per GROUP BY bucket).

Aggregate Functions and Extensions

Custom UDAFs can be registered for specialized aggregation. Selective aggregation allows each aggregate to have its own filter:

SELECT key,
       agg1(x) FILTER (WHERE cond1),
       agg2(y) FILTER (WHERE cond2)
FROM tbl

Distinct aggregation (e.g., COUNT(DISTINCT key)) removes duplicates before aggregation. Null handling: aggregates ignore nulls; COUNT(*) counts all rows regardless of nulls.

GROUP BY and Grouping Sets

GROUP BY partitions rows into buckets; each bucket produces its own aggregates. Grouping sets, CUBE, and ROLLUP enable multiple grouping combinations in a single query:

SELECT grade, class, COUNT(1)
FROM log
GROUP BY GROUPING SETS ((grade, class), (grade), (class))

Window Functions

Window functions behave like scalar functions but can use aggregate logic over a defined window, preserving the original row count. Example:

SELECT avg(latency) OVER (PARTITION BY host) AS host_avg
FROM accesslog

Frames can be bounded (e.g., RANGE BETWEEN 1 PRECEDING AND 2 FOLLOWING) to limit the rows considered.

Subqueries

Subqueries can appear in SELECT, FROM, WHERE, HAVING, and ORDER BY clauses. Types include scalar subqueries (single value), multi‑row subqueries, and EXISTS/NOT EXISTS checks. Correlated subqueries reference outer query columns and are often rewritten as joins during optimization.

Null and Unknown Handling

When a scalar function receives null, it returns null. Boolean expressions with null produce an UNKNOWN state, which propagates through logical operators (AND, OR, NOT) according to three‑valued logic. IS NULL / IS NOT NULL tests are used to detect nulls.

UNNEST Syntax

UNNEST expands a single row containing an array into multiple rows:

SELECT element FROM (VALUES (ARRAY[1,2,3])) AS t(element)

Other SQL Statements

Beyond SELECT, DDL statements such as CREATE and INSERT are part of the full SQL language.

Conclusion

The article provides a comprehensive overview of SQL’s core concepts, from relational algebra foundations to modern extensions like window functions and selective aggregation, equipping readers with the knowledge needed to write efficient analytical queries in contemporary data‑warehouse environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL database data analysis OLAP Window Functions Aggregation Subqueries Relational Algebra

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.