Big Data 9 min read

Understanding Hive Architecture, Execution Flow, and the Shift to Tez and Spark

This article explains Hive's core components, execution architecture, how HiveQL is transformed into MapReduce jobs, the advantages of Tez over MapReduce in Hive 3.0+, and the integration of Spark with Hive for modern big‑data processing.

Big Data Technology & Architecture

Jul 15, 2021

Understanding Hive Architecture, Execution Flow, and the Shift to Tez and Spark

The author introduces a supplemental guide to Hive, following previous extensive Hive optimization articles.

Hive Working Principle and Architecture

You can find Hive's architecture diagram on the official website. The global architecture includes components such as CLI (replaced by Beeline in Hive 3.0), JDBC/ODBC, Thrift Server, Hive Web Interface, Metastore, and Driver (Compiler, Optimizer).

Metastore component : stores metadata (table names, databases, owners, columns/partitions, table types, data locations) in relational databases like Derby or MySQL. It can be deployed remotely to decouple from Hive services.

Driver component : consists of Parser, Compiler, Optimizer, and Executor, converting HiveQL into execution plans and invoking the underlying MapReduce framework.

Parser: converts SQL string to an abstract syntax tree (AST).

Compiler: compiles AST into a logical execution plan.

Optimizer: optimizes the logical plan.

Executor: transforms the logical plan into a physical plan (e.g., MR/Spark).

CLI : command line interface.

ThriftServers : provide JDBC/ODBC access, enabling cross‑language services to call Hive.

Hive execution workflow steps:

ExecuteQuery: interface sends query to Driver.

GetPlan: Driver parses query and checks syntax.

GetMetaData: Compiler requests metadata from Metastore.

SendMetaData: Metastore returns metadata.

SendPlan: Compiler sends optimized plan back to Driver.

ExecutePlan: Driver sends plan to execution engine.

ExecuteJob: MapReduce job is submitted to ResourceManager.

Metadata Ops: execution engine may perform metadata operations via Metastore.

jobDone: MapReduce job completes.

dfs operations: interact with NameNode for data.

FetchResult: execution engine retrieves result set from DataNodes.

SendResults: results are sent to Driver.

SendResults (Driver to interface): Driver returns results to Hive UI.

HiveSQL to MR Task Conversion Process

The compilation of SQL into tasks occurs in the Compiler component and consists of six stages:

Lexical and syntactic parsing: Antlr defines SQL grammar, producing an AST.

Semantic analysis: traverses AST to extract QueryBlock.

Logical plan generation: translates QueryBlock into an OperatorTree.

Logical plan optimization: merges operators to reduce MapReduce jobs and data shuffle.

Physical plan generation: translates OperatorTree into MapReduce tasks.

Physical plan optimization: refines MapReduce tasks to produce the final execution plan.

A complex Hive SQL may be converted into multiple MapReduce tasks.

HiveSQL to MR? What About Hive 3.0's Tez?

The above conversion applies to Hive versions below 3.0. Starting with Hive 3.0, the default execution engine switches to Tez because MapReduce is slow.

Tez is an Apache open‑source DAG execution framework derived from MapReduce. It splits Map and Reduce into finer‑grained stages (Input, Processor, Sort, Merge, Output) that can be flexibly combined into complex DAG jobs, dramatically improving performance (up to ~100× in some tests).

Spark on Hive Support

Spark uses Spark‑SQL to run Hive statements, leveraging Hive's metadata and configuration while executing the actual computation via Spark RDDs. The typical workflow is:

Load Hive configuration and retrieve metadata via SparkSQL.

Access Hive tables and data using the retrieved metadata.

Operate on Hive tables through SparkSQL queries.

Further details can be found in the referenced article "Spark on Hive & Hive on Spark, can't tell them apart".

一、基于Hadoop的数据仓库Hive基础知识<br/>二、HiveSQL语法<br/>三、Hive性能优化<br/>四、Hive性能优化之数据倾斜专题<br/>五、HiveSQL优化十二板斧<br/>六、Hive面试题(一)<br/>七、Hive/Hadoop高频面试点集合(二)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse Hive MapReduce Spark Tez

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.