Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage
This article provides a comprehensive overview of Hive's execution engines—including MapReduce, Tez, and Spark—detailing their architectures, the six-stage Hive SQL compilation process, practical Explain syntax examples, and extensive tuning parameters for each engine to improve performance in big‑data environments.
The article begins with a brief history of Hive, noting its evolution from a slow MapReduce‑based system to a flexible platform that now supports multiple execution engines such as MapReduce, Tez, and Spark.
MapReduce Engine : It describes Hive's five core components (UI, DRIVER, COMPILER, METASTORE, EXECUTION ENGINE) and outlines the six compilation stages—lexical/semantic parsing, logical plan generation, logical optimization, physical plan generation, and physical optimization—illustrated with sample SQL and corresponding abstract syntax trees.
Example code for the first compilation stage is shown as
词法、语法解析: Antlr 定义 SQL 的语法规则,完成 SQL 词法,语法解析,将 SQL 转化为抽象语法树 AST Tree, and a full SELECT statement is presented in select * from dim.dim_region where dt = '2021-05-23';.
Explain Syntax : The article explains Hive's EXPLAIN [EXTENDED|CBO|AST|...] query command, its output sections (stage dependencies and stage plans), and demonstrates how to interpret operator trees, predicates, and join behavior.
Tez Engine : It introduces Tez as a DAG‑based framework that splits Map and Reduce tasks into finer‑grained components, lists its six programmable interfaces (Input, Output, Partitioner, Processor, Task, Master), and compares Tez to traditional MapReduce, highlighting reduced I/O and faster execution. Configuration examples such as
<property><br/> <name>hive.execution.engine</name><br/> <value>tez</value><br/></property>are provided, along with memory and container tuning parameters.
Spark Engine : The article covers Hive on Spark, describing how Hive tables are converted to RDDs, the role of SparkCompiler in building MapWork and ReduceWork, and the execution flow using foreachAsync. It details Spark‑specific settings (executor memory, cores, dynamic allocation) and Hive‑Spark tuning flags such as hive.auto.convert.join.noconditionaltask.size and hive.stats.collect.rawdatasize.
Throughout, practical SQL examples (joins, group‑by, sub‑queries) and their corresponding Explain outputs illustrate how Hive optimizes queries across different engines, and the article concludes with best‑practice recommendations for performance tuning and container pre‑warming.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
