Hive and Hadoop Interview Questions and Answers
This article provides a comprehensive collection of interview-style questions and detailed answers covering Hive concepts, Hadoop architecture, MapReduce mechanics, HDFS operations, and performance optimization techniques for big‑data processing environments.
1. Hive Table Join Using MapReduce
If one table is small, use a map‑side join by loading the small table into memory during the map phase. For two large tables, create a composite key (join field + flag) to distinguish records from each table, partition by this key, and perform aggregation in the reducer.
2. Characteristics of Hive and Differences from RDBMS
Hive is a data‑warehouse tool built on Hadoop that maps structured files to tables and translates SQL queries into MapReduce jobs. It offers low learning cost and SQL‑like syntax but does not support real‑time queries. Unlike relational databases, Hive stores data on HDFS, lacks transaction support, and is optimized for batch analytics.
3. Meaning of Sort By, Order By, Cluster By, Distribute By
Order By : Global sorting requiring a single reducer, which can be slow for large datasets.
Sort By : Partial sorting performed before data reaches the reducer.
Distribute By : Partitions data across reducers based on the specified column.
Cluster By : Combines the functions of Distribute By and Sort By.
4. Usage of split, coalesce, and collect_list Functions
split('a,b,c,d', ',')returns an array ['a','b','c','d']. coalesce(v1, v2, …) returns the first non‑null argument, or NULL if all are null. collect_list(col) aggregates all values of a column into an array without deduplication, e.g., SELECT collect_list(id) FROM table.
5. Hive Metastore Storage Options
Hive supports three metastore servers: embedded (Derby, for unit tests), local, and remote (communicating via Thrift).
6. Difference Between Internal and External Tables
Creating an internal table moves data into Hive’s warehouse directory; dropping the table deletes both metadata and data. An external table only records the data location; dropping it removes metadata while preserving the data.
7. UDF, UDAF, and UDTF Differences
UDF processes one input row and produces one output row.
UDAF processes multiple input rows and produces a single aggregated output row.
UDTF processes one input row and can emit multiple output rows.
8. Does Every Hive Query Run as a MapReduce Job?
Not always. Since Hive 0.10.0, simple queries such as SELECT … LIMIT n can be executed using a fetch task without launching a MapReduce job.
9. Understanding Hive Bucket Tables
Bucket tables hash a specified column and store rows in separate files (buckets). The hash value modulo the number of buckets determines the target file. Bucketing is useful for sampling queries but is not intended for general storage.
10. Hive Interaction with Databases
Hive queries are executed on HDFS using MapReduce. Hive uses MySQL only to store its metastore information; it does not store actual data in MySQL.
11. Hive Local Mode
For small input datasets, Hive can run in local mode on a single machine, reducing job startup overhead. Enable it by setting hive.exec.mode.local.auto=true.
12. File Formats in Hive
TextFile : Row‑oriented, uncompressed by default; high disk and parsing overhead.
SequenceFile : Binary, row‑oriented, supports NONE, RECORD, and BLOCK compression (BLOCK is recommended).
RCFile : Row‑partitioned, column‑oriented storage; good compression but higher load cost.
ORCFile : Column‑oriented like RCFile, but with faster compression and query performance.
13. Handling Data Skew in Hive Joins
Skew occurs when key distribution is uneven. Mitigation strategies include assigning random values to null keys, adjusting parameters ( hive.map.aggr=true, hive.groupby.skewindata=true), using map‑side joins for small tables, and redistributing skewed keys.
14. Fetch Task
For queries such as SELECT * FROM table, Hive can bypass MapReduce and directly read files using a fetch task. The default conversion mode is more, which enables fetch for many query patterns.
15. Hadoop Cluster Bottlenecks
Disk I/O and network bandwidth are the primary performance constraints.
16. Hadoop Running Modes
Standalone, pseudo‑distributed, and fully distributed modes.
17. Hadoop Ecosystem Components
Zookeeper (coordination), Flume (log collection), HBase (column‑family store), Hive (SQL‑on‑Hadoop), Sqoop (data transfer between RDBMS and HDFS).
18. Hadoop vs. Hadoop Ecosystem
Hadoop refers to the core framework; the ecosystem includes auxiliary projects like Zookeeper, Flume, HBase, Hive, and Sqoop.
19. Hadoop Daemons
NameNode, SecondaryNameNode, DataNode, ResourceManager (JobTracker), NodeManager (TaskTracker), DFSZKFailoverController, JournalNode.
20. Hadoop Serialization
Hadoop uses its own Writable serialization for efficiency. Custom beans must implement Writable, provide a no‑arg constructor, and define read/write methods; keys also need to implement Comparable.
21. FileInputFormat Splitting Mechanism
Input splits are logical partitions of input files based on block size; they are recorded as metadata and used by the JobTracker to determine the number of map tasks.
22. Determining Map and Reduce Task Numbers
Map task count is derived from total data size divided by split size; reduce task count is set via job.setNumReduceTasks(x), defaulting to 1.
23. MapTask Workflow
Read → Map → Collect (partition & buffer) → Spill (sort & write to disk) → Combine (optional) → Final output.
24. ReduceTask Workflow
Copy → Merge → Sort → Reduce → Write to HDFS.
25. Types of Sorting in MapReduce
Partial (per‑mapper), total (global), auxiliary (grouping comparator), and secondary sorting (custom comparator).
26. Shuffle Phase Optimization
Optimize by partitioning, sorting, spilling, copying to reducers, adding combiners, and compressing spill files.
27. Combiner Role
Performs local aggregation on mapper output to reduce network traffic; it must not alter final results and must emit the same key/value types as the reducer.
28. Default Partitioning
If no custom partitioner is defined, Hadoop uses hash(key) % numReducers to assign partitions.
29. Load Balancing for Skewed Data
Implement custom partitioners or use the hive.groupby.skewindata setting to distribute skewed keys across reducers.
30. Implementing Top‑N in MapReduce
Define a custom GroupingComparator to sort keys in descending order and emit only the first N records in the reducer.
31. DistributedCache for Small Table Broadcast
Cache small tables on each node to enable map‑side joins, dramatically improving performance for joins where one side is tiny.
32. MapReduce Join Strategies
Reduce‑side join tags records from each source and aggregates them in the reducer; map‑side join loads the small table into memory and joins during the map phase.
33. When MapReduce Is Not Suitable
Small data volumes, many tiny files, index‑driven access, transactional workloads, or single‑node environments.
34. ETL Acronym
Extraction‑Transformation‑Loading.
35. HDFS Replication and Block Size
Default replication factor is 3. Block size was 64 MB in Hadoop 1.x and 128 MB in Hadoop 2.x and later.
36. HDFS Storage Mechanism
Clients request block placement from the NameNode, which returns DataNode locations. Data is streamed through a pipeline of DataNodes, and blocks are written in packets. Reads follow a similar lookup and streaming process.
37. Secondary NameNode Operation
Periodically checkpoints the namespace by merging the edit log with the fsimage and copying the result back to the NameNode.
38. NameNode vs. Secondary NameNode
NameNode manages the filesystem namespace; Secondary NameNode assists with checkpointing and can aid recovery.
39. HDFS Architecture
Consists of HDFS Client, NameNode, DataNode, and Secondary NameNode.
40. HA NameNode (ZKFailoverController)
Monitors NameNode health via Zookeeper, manages active/standby states, and performs leader election.
41. Reasons for Slow MapReduce Jobs
Hardware limitations, I/O inefficiencies, data skew, improper map/reduce counts, excessive small files, large unsplittable files, and frequent spills/merges.
42. MapReduce Optimization Techniques
Merge small files, use CombineFileInputFormat, tune io.sort.mb and io.sort.spill.percent, adjust io.sort.factor, enable combiners, set appropriate map/reduce counts, enable map‑reduce overlap ( slowstart.completedmaps), compress intermediate data, and address data skew via sampling, custom partitioners, or combiners.
43. HDFS Small‑File Optimization
Use Hadoop Archive (HAR), SequenceFiles, or CombineFileInputFormat to pack many small files into larger containers, reducing NameNode memory usage.
44. Hadoop 1 vs. Hadoop 2 Architecture
Hadoop 2 introduced YARN for resource management and added Zookeeper support for high availability.
45. Why YARN Was Created
To decouple application execution from resource management, allowing multiple processing frameworks (MapReduce, Spark, Storm, etc.) to run on the same cluster.
46. HDFS Compression Algorithms
Common codecs: bzip2, gzip, LZO, Snappy (Snappy is widely used in production).
47. Hadoop Scheduler Overview
FIFO (default), Capacity Scheduler (resource‑aware queues), and Fair Scheduler (share‑based queues).
48. MapReduce 2.0 Fault Tolerance
MRAppMaster restarts failed jobs (default 2 retries). Individual map/reduce tasks are retried up to 4 times by default.
49. Speculative Execution Algorithm
Estimates task completion time based on progress; launches backup tasks for stragglers, selecting the fastest result while limiting concurrent speculative tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
