Big Data 24 min read

A Unified View of SQL‑on‑Hadoop Systems: Architecture, Execution Plans, Optimizations, and Storage Formats

The article provides a comprehensive overview of SQL‑on‑Hadoop query engines such as Hive, Impala, Presto and Spark SQL, comparing their runtime frameworks, core components, compilation steps, optimizer strategies, CPU/IO efficiency techniques, storage formats like ORC and Parquet, and resource management in a unified perspective.

Big Data Technology & Architecture

Jun 7, 2020

A Unified View of SQL‑on‑Hadoop Systems: Architecture, Execution Plans, Optimizations, and Storage Formats

Abstract

Since the emergence of Hive, SQL‑on‑Hadoop systems have proliferated and become faster and richer in features. This article does not rank interactive query engines, but instead offers a unified perspective to highlight the common technical aspects across different systems, using Hive and Impala as primary examples while also mentioning Spark SQL, Presto, and TAJO.

0x01 System Architecture

1.1 Runtime Framework vs. MPP

SQL‑on‑Hadoop systems adopt two main architectures:

Build a query engine on top of an existing runtime framework (e.g., Hive).

Design a native MPP‑style engine from scratch (e.g., Impala, Presto).

Speed is a crucial metric. As Hive became popular, interactive query demands arose, leading to the development of MPP‑style engines such as Impala and Presto.

The MPP advantage includes DAG‑based execution, pipeline processing, efficient I/O, and thread‑level concurrency, while its disadvantages are limited scalability and weaker fault tolerance.

Modern Hive can also run on DAG frameworks (Tez, Spark), narrowing the gap with MPP systems.

1.2 Core Components

Typical SQL‑on‑Hadoop engines share common components distributed across master/worker nodes:

UI layer – provides query entry points (Web/CLI/API).

QL layer – parses queries into executable plans.

Execution layer – manages job execution, resource allocation, and result aggregation.

IO layer – interfaces with storage (HDFS, NoSQL, RDBMS) using formats, serializers, and handlers.

Storage layer – usually HDFS, but can also query external stores.

Metadata service – manages table schemas.

0x02 Execution Plan

2.1 Compilation Process

The transformation from SQL to an executable plan consists of five steps:

Convert SQL to an Abstract Syntax Tree ( AST) using tools like Antlr.

Semantic analysis (type checking, existence of tables/columns, etc.).

Generate a logical plan – a DAG of logical operators (e.g., TableScanOperator, GroupByOperator).

Apply logical optimizations.

Convert the logical plan to a physical plan (e.g., MR/Tez jobs for Hive, plan fragments for Impala).

2.2 Examples

Two illustrative examples show the relationship between SQL, logical plan, and physical plan.

2.2.1 Hive on MR

select count(1) from status_updates where ds = '2009-08-01'

Corresponding compilation diagrams (omitted) illustrate the MR job generation.

2.2.2 Presto

Presto’s compilation also follows the five‑step process, producing SubPlans that are distributed to workers.

select c1.rank, count(*) from dim.city c1 join dim.city c2 on c1.id = c2.id where c1.id > 10 group by c1.rank limit 10;

Key SubPlan attributes include planDistribution, outputPartitioning, and partitionBy.

2.3 Optimizer

Early Hive used rule‑based optimizations (predicate push‑down, operator merging). Later, more sophisticated rules such as correlation optimizations reduced redundant scans. Cost‑Based Optimizer (CBO) techniques from relational databases are being introduced via Apache Calcite/Optiq.

0x03 Execution Efficiency

3.1 CPU

CPU‑bound execution can degrade performance. Causes include excessive virtual‑function calls, boxing of primitive types, branch instructions, and cache misses. Solutions involve dynamic code generation (e.g., LLVM for Impala, reflection for Spark/Presto) and vectorization with SIMD.

add(int vecNum, long[] result, long[] col1, int[] col2, int[] selVec) {
    if (selVec == null)
        for (int i = 0; i < vecNum; i++)
            result[i] = col1[i] + col2[i];
    else
        for (int i = 0; i < vecNum; i++) {
            int selIdx = selVec[i];
            result[selIdx] = col1[selIdx] + col2[selIdx];
        }
}

3.2 IO

Since data resides on HDFS, IO optimizations focus on HDFS features such as short‑circuit reads, zero‑copy, and disk‑aware scheduling.

0x04 Storage Formats

4.1 Overview

Columnar storage is optimal for analytical workloads. The two dominant columnar formats in the Hadoop ecosystem are ORC (by Hortonworks/Microsoft) and Parquet (by Cloudera/Twitter).

4.2 ORCFile

ORC improves upon RCFile by adding block statistics, efficient encodings (RLE, dictionary, bitmap, delta), and a hierarchical layout (row groups → strips → streams) with compression.

4.3 Parquet

Parquet shares ORC’s design goals but adds broader compatibility and Dremel‑style nested data support using definition and repetition levels.

0x05 Resource Control

5.1 Runtime Resource Adjustment

Frameworks like Tez introduce a vertex manager to dynamically adjust reduce tasks; TAJO provides progressive query optimization and dynamic join ordering.

5.2 Resource Integration

Integration with YARN introduces latency due to ApplicationMaster startup and container allocation. Tez reuses containers, while Impala waits for all containers before execution; long‑lived AM pools can mitigate startup overhead.

0x06 Other Topics

Additional advanced features include multi‑source queries (Presto, Impala, Hive), approximate query processing (HyperLogLog for distinct counts), and ongoing enhancements to functions, complex types, and optimizer capabilities.

0x07 Conclusion

SQL‑on‑Hadoop systems continue to evolve, adding analytical functions, richer data types, more encoding schemes, and stronger optimizers, promising faster and more convenient data analysis at massive scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Big Data Query Engine Execution Plan storage format SQL on Hadoop

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.