Big Data 66 min read

Comprehensive Guide to Hive: Fundamentals, SQL Syntax, Performance Tuning, and Interview Preparation

This extensive article introduces Hive as a Hadoop‑based data warehouse, explains its architecture, core concepts, DDL/DML syntax, functions, performance‑optimization techniques, data‑skew handling, and provides a collection of common interview questions for Hive practitioners.

Big Data Technology & Architecture

May 23, 2021

Comprehensive Guide to Hive: Fundamentals, SQL Syntax, Performance Tuning, and Interview Preparation

Hive Basics and Architecture

Hive is a data‑warehouse tool built on Hadoop that enables SQL‑like queries (HiveQL) over data stored in HDFS, translating HQL statements into MapReduce jobs for batch processing.

Key components include the user interface (CLI, JDBC, Thrift Server), the driver (compiler, optimizer, executor), and the Metastore (metadata repository).

Hive SQL Syntax

DDL examples:

create database if not exists myhive;
create external table student (s_id string, s_name string) row format delimited fields terminated by '\t';
create table score(s_id string, s_score int) partitioned by (month string);
create table course (c_id string,c_name string) clustered by(c_id) into 3 buckets;

DQL example:

SELECT [ALL|DISTINCT] col1, col2 FROM table_name WHERE condition GROUP BY col_list HAVING condition ORDER BY col LIMIT 10;

Common functions include aggregation (count, sum, avg), conditional (if, coalesce, case), date (unix_timestamp, from_unixtime, date_add), string (length, reverse, concat), and window functions (lag, lead, first_value, last_value).

Performance Optimization Techniques

Optimization areas cover SQL rewriting (e.g., using FROM ... INSERT INTO ... instead of multiple UNION ALL), reducing data skew, choosing appropriate file formats (ORC/Parquet vs. TextFile), merging small files with CombineHiveInputFormat, adjusting map/reduce counts, enabling JVM reuse, and leveraging local mode for small datasets.

Example of merging small files:

set mapred.max.split.size=100000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

Handling Data Skew

Strategies include filtering null keys, randomizing skewed keys, using map‑side joins for small tables, enabling hive.groupby.skewindata=true, and adjusting reducer memory.

Interview Questions

The article concludes with a curated list of Hive interview questions covering internal vs. external tables, indexing, metadata storage, UDF/UDAF/UDTF differences, bucket tables, fetch optimization, join strategies, and common pitfalls such as cartesian products and count‑distinct performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization SQL Data Warehouse Hive Hadoop

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.