An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases
This article provides a concise overview of Apache Hive, covering its definition, Hadoop background, architecture, query workflow, storage model, advantages, disadvantages, and a comparison with traditional relational databases, helping readers understand how Hive enables SQL-like queries on data stored in HDFS.
During my previous work I encountered Hive, so I made a brief summary and present a simple introduction.
Before introducing Hive, let's define a few terms.
Hadoop : Distributed system infrastructure whose core components are HDFS and MapReduce.
HDFS : Hadoop Distributed File System, used for storing and processing data sets.
MapReduce : Programming model for parallel computation on large data sets.
What is Hive? Hive is a data warehouse tool built on Hadoop.
Background of Hive Facebook found that using MySQL‑based data warehouses for reporting could not handle the growing data volume, so they stored data in Hadoop. Querying HDFS required MapReduce jobs, which were costly for users, leading to the development of a framework that translates SQL into MapReduce tasks, thus Hive was born.
Problems Hive Solves It maps structured data files to database tables and defines a simple SQL‑like query language to query data in HDFS.
Hive Query Workflow
Step 1 (executeQuery): Hive interface (CLI or Web UI) receives query from driver (JDBC, ODBC, etc.).
Step 2 (getPlan): Query compiler parses and validates the query, generating a logical plan.
Step 3 (getMetadata): Compiler requests metadata from the Metastore.
Step 4 (sendMetadata): Metastore returns metadata to the compiler.
Step 5 (sendPlan): Compiler finalizes the plan and sends it back to the driver; query parsing and compilation are complete.
Step 6 (executePlan): Driver sends the execution plan to the execution engine.
executeJob: Execution engine runs MapReduce jobs via JobTracker and TaskTrackers.
metadataOps: Execution engine can perform metadata operations on the Metastore during job execution.
jobDone: Execution engine receives results from data nodes.
Step 7 (sendResults): Execution engine returns result values to the driver.
Step 8 (fetchResults): Driver sends the results to the Hive interface.
Hive Storage Hive itself does not store data; all data resides in HDFS, typically as plain text unless otherwise specified.
Advantages and Disadvantages of Hive
1) Easy to learn and use.
2) Good extensibility.
3) Unified metadata management.
4) Because it reads data from HDFS, Hive does not support partial updates or deletes; only whole‑table overwrite or delete operations are possible.
Thank you for taking the time to read this article! The author’s knowledge is limited, so this introduction to Hive is brief; further discussion and knowledge sharing are welcome.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Quality & Efficiency
360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
