Big Data 41 min read

Hive and Hadoop Interview Questions and Answers

This article provides a comprehensive collection of interview-style questions and detailed answers covering Hive concepts, Hadoop architecture, MapReduce mechanics, HDFS operations, and performance optimization techniques for big‑data processing environments.

Big Data Technology & Architecture

Apr 15, 2021

Hive and Hadoop Interview Questions and Answers

1. Hive Table Join Using MapReduce

If one table is small, use a map‑side join by loading the small table into memory during the map phase. For two large tables, create a composite key (join field + flag) to distinguish records from each table, partition by this key, and perform aggregation in the reducer.

2. Characteristics of Hive and Differences from RDBMS

Hive is a data‑warehouse tool built on Hadoop that maps structured files to tables and translates SQL queries into MapReduce jobs. It offers low learning cost and SQL‑like syntax but does not support real‑time queries. Unlike relational databases, Hive stores data on HDFS, lacks transaction support, and is optimized for batch analytics.

3. Meaning of Sort By, Order By, Cluster By, Distribute By

Order By : Global sorting requiring a single reducer, which can be slow for large datasets.

Sort By : Partial sorting performed before data reaches the reducer.

Distribute By : Partitions data across reducers based on the specified column.

Cluster By : Combines the functions of Distribute By and Sort By.

4. Usage of split, coalesce, and collect_list Functions

split('a,b,c,d', ',')

returns an array ['a','b','c','d']. coalesce(v1, v2, …) returns the first non‑null argument, or NULL if all are null. collect_list(col) aggregates all values of a column into an array without deduplication, e.g., SELECT collect_list(id) FROM table.

5. Hive Metastore Storage Options

Hive supports three metastore servers: embedded (Derby, for unit tests), local, and remote (communicating via Thrift).

6. Difference Between Internal and External Tables

Creating an internal table moves data into Hive’s warehouse directory; dropping the table deletes both metadata and data. An external table only records the data location; dropping it removes metadata while preserving the data.

7. UDF, UDAF, and UDTF Differences

UDF processes one input row and produces one output row.

UDAF processes multiple input rows and produces a single aggregated output row.

UDTF processes one input row and can emit multiple output rows.

8. Does Every Hive Query Run as a MapReduce Job?

Not always. Since Hive 0.10.0, simple queries such as SELECT … LIMIT n can be executed using a fetch task without launching a MapReduce job.

9. Understanding Hive Bucket Tables

Bucket tables hash a specified column and store rows in separate files (buckets). The hash value modulo the number of buckets determines the target file. Bucketing is useful for sampling queries but is not intended for general storage.

10. Hive Interaction with Databases

Hive queries are executed on HDFS using MapReduce. Hive uses MySQL only to store its metastore information; it does not store actual data in MySQL.

11. Hive Local Mode

For small input datasets, Hive can run in local mode on a single machine, reducing job startup overhead. Enable it by setting hive.exec.mode.local.auto=true.

12. File Formats in Hive

TextFile : Row‑oriented, uncompressed by default; high disk and parsing overhead.

SequenceFile : Binary, row‑oriented, supports NONE, RECORD, and BLOCK compression (BLOCK is recommended).

RCFile : Row‑partitioned, column‑oriented storage; good compression but higher load cost.

ORCFile : Column‑oriented like RCFile, but with faster compression and query performance.

13. Handling Data Skew in Hive Joins

Skew occurs when key distribution is uneven. Mitigation strategies include assigning random values to null keys, adjusting parameters ( hive.map.aggr=true, hive.groupby.skewindata=true), using map‑side joins for small tables, and redistributing skewed keys.

14. Fetch Task

For queries such as SELECT * FROM table, Hive can bypass MapReduce and directly read files using a fetch task. The default conversion mode is more, which enables fetch for many query patterns.

15. Hadoop Cluster Bottlenecks

Disk I/O and network bandwidth are the primary performance constraints.

16. Hadoop Running Modes

Standalone, pseudo‑distributed, and fully distributed modes.

17. Hadoop Ecosystem Components

Zookeeper (coordination), Flume (log collection), HBase (column‑family store), Hive (SQL‑on‑Hadoop), Sqoop (data transfer between RDBMS and HDFS).

18. Hadoop vs. Hadoop Ecosystem

Hadoop refers to the core framework; the ecosystem includes auxiliary projects like Zookeeper, Flume, HBase, Hive, and Sqoop.

19. Hadoop Daemons

NameNode, SecondaryNameNode, DataNode, ResourceManager (JobTracker), NodeManager (TaskTracker), DFSZKFailoverController, JournalNode.

20. Hadoop Serialization

Hadoop uses its own Writable serialization for efficiency. Custom beans must implement Writable, provide a no‑arg constructor, and define read/write methods; keys also need to implement Comparable.

21. FileInputFormat Splitting Mechanism

Input splits are logical partitions of input files based on block size; they are recorded as metadata and used by the JobTracker to determine the number of map tasks.

22. Determining Map and Reduce Task Numbers

Map task count is derived from total data size divided by split size; reduce task count is set via job.setNumReduceTasks(x), defaulting to 1.

23. MapTask Workflow

Read → Map → Collect (partition & buffer) → Spill (sort & write to disk) → Combine (optional) → Final output.

24. ReduceTask Workflow

Copy → Merge → Sort → Reduce → Write to HDFS.

25. Types of Sorting in MapReduce

Partial (per‑mapper), total (global), auxiliary (grouping comparator), and secondary sorting (custom comparator).

26. Shuffle Phase Optimization

Optimize by partitioning, sorting, spilling, copying to reducers, adding combiners, and compressing spill files.

27. Combiner Role

Performs local aggregation on mapper output to reduce network traffic; it must not alter final results and must emit the same key/value types as the reducer.

28. Default Partitioning

If no custom partitioner is defined, Hadoop uses hash(key) % numReducers to assign partitions.

29. Load Balancing for Skewed Data

Implement custom partitioners or use the hive.groupby.skewindata setting to distribute skewed keys across reducers.

30. Implementing Top‑N in MapReduce

Define a custom GroupingComparator to sort keys in descending order and emit only the first N records in the reducer.

31. DistributedCache for Small Table Broadcast

Cache small tables on each node to enable map‑side joins, dramatically improving performance for joins where one side is tiny.

32. MapReduce Join Strategies

Reduce‑side join tags records from each source and aggregates them in the reducer; map‑side join loads the small table into memory and joins during the map phase.

33. When MapReduce Is Not Suitable

Small data volumes, many tiny files, index‑driven access, transactional workloads, or single‑node environments.

34. ETL Acronym

Extraction‑Transformation‑Loading.

35. HDFS Replication and Block Size

Default replication factor is 3. Block size was 64 MB in Hadoop 1.x and 128 MB in Hadoop 2.x and later.

36. HDFS Storage Mechanism

Clients request block placement from the NameNode, which returns DataNode locations. Data is streamed through a pipeline of DataNodes, and blocks are written in packets. Reads follow a similar lookup and streaming process.

37. Secondary NameNode Operation

Periodically checkpoints the namespace by merging the edit log with the fsimage and copying the result back to the NameNode.

38. NameNode vs. Secondary NameNode

NameNode manages the filesystem namespace; Secondary NameNode assists with checkpointing and can aid recovery.

39. HDFS Architecture

Consists of HDFS Client, NameNode, DataNode, and Secondary NameNode.

40. HA NameNode (ZKFailoverController)

Monitors NameNode health via Zookeeper, manages active/standby states, and performs leader election.

41. Reasons for Slow MapReduce Jobs

Hardware limitations, I/O inefficiencies, data skew, improper map/reduce counts, excessive small files, large unsplittable files, and frequent spills/merges.

42. MapReduce Optimization Techniques

Merge small files, use CombineFileInputFormat, tune io.sort.mb and io.sort.spill.percent, adjust io.sort.factor, enable combiners, set appropriate map/reduce counts, enable map‑reduce overlap ( slowstart.completedmaps), compress intermediate data, and address data skew via sampling, custom partitioners, or combiners.

43. HDFS Small‑File Optimization

Use Hadoop Archive (HAR), SequenceFiles, or CombineFileInputFormat to pack many small files into larger containers, reducing NameNode memory usage.

44. Hadoop 1 vs. Hadoop 2 Architecture

Hadoop 2 introduced YARN for resource management and added Zookeeper support for high availability.

45. Why YARN Was Created

To decouple application execution from resource management, allowing multiple processing frameworks (MapReduce, Spark, Storm, etc.) to run on the same cluster.

46. HDFS Compression Algorithms

Common codecs: bzip2, gzip, LZO, Snappy (Snappy is widely used in production).

47. Hadoop Scheduler Overview

FIFO (default), Capacity Scheduler (resource‑aware queues), and Fair Scheduler (share‑based queues).

48. MapReduce 2.0 Fault Tolerance

MRAppMaster restarts failed jobs (default 2 retries). Individual map/reduce tasks are retried up to 4 times by default.

49. Speculative Execution Algorithm

Estimates task completion time based on progress; launches backup tasks for stragglers, selecting the fastest result while limiting concurrent speculative tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse Hive MapReduce Hadoop

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.