A Unified View of SQL‑on‑Hadoop Systems: Architecture, Execution Plans, Optimizations, and Storage Formats
The article provides a comprehensive overview of SQL‑on‑Hadoop query engines such as Hive, Impala, Presto and Spark SQL, comparing their runtime frameworks, core components, compilation steps, optimizer strategies, CPU/IO efficiency techniques, storage formats like ORC and Parquet, and resource management in a unified perspective.
Abstract
Since the emergence of Hive, SQL‑on‑Hadoop systems have proliferated and become faster and richer in features. This article does not rank interactive query engines, but instead offers a unified perspective to highlight the common technical aspects across different systems, using Hive and Impala as primary examples while also mentioning Spark SQL, Presto, and TAJO.
0x01 System Architecture
1.1 Runtime Framework vs. MPP
SQL‑on‑Hadoop systems adopt two main architectures:
Build a query engine on top of an existing runtime framework (e.g., Hive).
Design a native MPP‑style engine from scratch (e.g., Impala, Presto).
Speed is a crucial metric. As Hive became popular, interactive query demands arose, leading to the development of MPP‑style engines such as Impala and Presto.
The MPP advantage includes DAG‑based execution, pipeline processing, efficient I/O, and thread‑level concurrency, while its disadvantages are limited scalability and weaker fault tolerance.
Modern Hive can also run on DAG frameworks (Tez, Spark), narrowing the gap with MPP systems.
1.2 Core Components
Typical SQL‑on‑Hadoop engines share common components distributed across master/worker nodes:
UI layer – provides query entry points (Web/CLI/API).
QL layer – parses queries into executable plans.
Execution layer – manages job execution, resource allocation, and result aggregation.
IO layer – interfaces with storage (HDFS, NoSQL, RDBMS) using formats, serializers, and handlers.
Storage layer – usually HDFS, but can also query external stores.
Metadata service – manages table schemas.
0x02 Execution Plan
2.1 Compilation Process
The transformation from SQL to an executable plan consists of five steps:
Convert SQL to an Abstract Syntax Tree ( AST) using tools like Antlr.
Semantic analysis (type checking, existence of tables/columns, etc.).
Generate a logical plan – a DAG of logical operators (e.g., TableScanOperator, GroupByOperator).
Apply logical optimizations.
Convert the logical plan to a physical plan (e.g., MR/Tez jobs for Hive, plan fragments for Impala).
2.2 Examples
Two illustrative examples show the relationship between SQL, logical plan, and physical plan.
2.2.1 Hive on MR
select count(1) from status_updates where ds = '2009-08-01'Corresponding compilation diagrams (omitted) illustrate the MR job generation.
2.2.2 Presto
Presto’s compilation also follows the five‑step process, producing SubPlans that are distributed to workers.
select c1.rank, count(*) from dim.city c1 join dim.city c2 on c1.id = c2.id where c1.id > 10 group by c1.rank limit 10;Key SubPlan attributes include planDistribution, outputPartitioning, and partitionBy.
2.3 Optimizer
Early Hive used rule‑based optimizations (predicate push‑down, operator merging). Later, more sophisticated rules such as correlation optimizations reduced redundant scans. Cost‑Based Optimizer (CBO) techniques from relational databases are being introduced via Apache Calcite/Optiq.
0x03 Execution Efficiency
3.1 CPU
CPU‑bound execution can degrade performance. Causes include excessive virtual‑function calls, boxing of primitive types, branch instructions, and cache misses. Solutions involve dynamic code generation (e.g., LLVM for Impala, reflection for Spark/Presto) and vectorization with SIMD.
add(int vecNum, long[] result, long[] col1, int[] col2, int[] selVec) {
if (selVec == null)
for (int i = 0; i < vecNum; i++)
result[i] = col1[i] + col2[i];
else
for (int i = 0; i < vecNum; i++) {
int selIdx = selVec[i];
result[selIdx] = col1[selIdx] + col2[selIdx];
}
}3.2 IO
Since data resides on HDFS, IO optimizations focus on HDFS features such as short‑circuit reads, zero‑copy, and disk‑aware scheduling.
0x04 Storage Formats
4.1 Overview
Columnar storage is optimal for analytical workloads. The two dominant columnar formats in the Hadoop ecosystem are ORC (by Hortonworks/Microsoft) and Parquet (by Cloudera/Twitter).
4.2 ORCFile
ORC improves upon RCFile by adding block statistics, efficient encodings (RLE, dictionary, bitmap, delta), and a hierarchical layout (row groups → strips → streams) with compression.
4.3 Parquet
Parquet shares ORC’s design goals but adds broader compatibility and Dremel‑style nested data support using definition and repetition levels.
0x05 Resource Control
5.1 Runtime Resource Adjustment
Frameworks like Tez introduce a vertex manager to dynamically adjust reduce tasks; TAJO provides progressive query optimization and dynamic join ordering.
5.2 Resource Integration
Integration with YARN introduces latency due to ApplicationMaster startup and container allocation. Tez reuses containers, while Impala waits for all containers before execution; long‑lived AM pools can mitigate startup overhead.
0x06 Other Topics
Additional advanced features include multi‑source queries (Presto, Impala, Hive), approximate query processing (HyperLogLog for distinct counts), and ongoing enhancements to functions, complex types, and optimizer capabilities.
0x07 Conclusion
SQL‑on‑Hadoop systems continue to evolve, adding analytical functions, richer data types, more encoding schemes, and stronger optimizers, promising faster and more convenient data analysis at massive scale.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
