Big Data 20 min read

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

This article presents a detailed interview-style walkthrough covering Hadoop cluster setup, HDFS components, MapReduce workflow, YARN advantages, Spark fundamentals, Kafka replication, Hive table types, and related big‑data concepts, providing concise explanations and practical insights for data engineers.

DataFunTalk
DataFunTalk
DataFunTalk
Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

1. Self Introduction

Keep the self‑introduction within 4.5–5 minutes, covering personal basic information, work experience (duration, company name, position, main duties, achievements, reason for leaving), and deep‑communication (pressure interview) techniques.

工作时间、公司名称、任职岗位、主要工作内容、工作业绩、离职原因

2. Hadoop Cluster Configuration Files

The three essential XML files for a Hadoop cluster are core-site.xml , hdfs-site.xml , and mapred-site.xml .

3. Core Hadoop Daemons

Typical Hadoop processes include NameNode, DataNode, and Secondary NameNode.

4. Roles of Daemons

NameNode maintains the HDFS namespace and metadata; DataNode stores actual block data and replicates blocks; Secondary NameNode periodically merges the edit log with the filesystem image for checkpointing.

5. Detailed Secondary NameNode Function

Secondary NameNode regularly contacts NameNode, performs checkpointing by merging the edit log with the current fsimage, writes a new fsimage, and uploads it back to NameNode.

6. HDFS Block Replication and Size

By default each block is replicated three times and the block size is 128 MB (previously 64 MB in Hadoop 1.0, changed to 128 MB in Hadoop 2.0).

7. Changing Block Size

Block size is influenced by disk storage characteristics; adjusting disk throughput can affect the effective block size.

8. HDFS Read/Write Process

The client first contacts NameNode to request file creation, then obtains DataNode locations for each block, establishes pipelines, and streams data block by block through the DataNodes.

9. MapReduce Workflow

The client requests a job ID from JobTracker, uploads required resources to HDFS, JobTracker schedules map tasks, TaskTrackers send heartbeats, and upon completion JobTracker marks the job successful.

10. Map Phase Details

Map tasks perform partitioning, sorting, optional combiner aggregation, and shuffling of intermediate key/value pairs to reducers.

11. Data Skew

Data skew typically occurs on the reducer side when a particular key has disproportionately large data, causing some reducers to run much longer.

12. Skew Mitigation

Introduce a random prefix to heavy keys during the map phase, perform local aggregation, then remove the prefix before the final reduce to balance load.

13. Effect of Combiner

Combiner reduces the amount of data transferred from mapper to reducer by locally aggregating identical keys.

14. Map Output Spill

When map output exceeds the in‑memory buffer, it is spilled to local disk before being sent to reducers.

15. Default Partitioning

Map output keys are hashed and the hash value modulo the number of reducers determines the target partition.

16. Hadoop Tuning Areas

Performance tuning focuses on I/O buffer sizes, disk pre‑read settings, and configuration parameters in core-site.xml , hdfs-site.xml , and mapred-site.xml such as buffer.size and block.size .

17. Example MapReduce Job on a 1 GB File

With a 128 MB block size, the file is split into eight blocks, creating eight mappers; a custom partitioner groups records by the class field, and the job sorts by id before reducing.

18. YARN Overview

YARN separates resource management (ResourceManager) from node management (NodeManager); resources are allocated in containers, enabling better multi‑tenant scheduling for MapReduce and other frameworks.

19. Spark Advantages over MapReduce

Spark leverages in‑memory computation, DAG‑based scheduling, RDD transformations and actions, and lineage‑based fault tolerance, resulting in faster execution.

20. RDD Definition

Resilient Distributed Dataset (RDD) is an immutable, partitioned collection of records that supports parallel operations.

21. Common RDD Operations

Transformations include map() , filter() , flatMap() , distinct() ; actions include collect() , reduce() .

22. reduceByKey vs. groupByKey

reduceByKey performs local aggregation before shuffling, reducing network traffic; groupByKey shuffles all values for each key, which can cause high memory usage.

23. Spark Streaming Fault Tolerance

Checkpointing stores Kafka offsets and streaming state, allowing the application to resume from the last checkpoint after a failure.

24. Alternative Fault‑Tolerance

Write the consumed Kafka data to HDFS as a write‑ahead log, enabling recovery of lost data.

25. Broadcast Variables

Broadcast variables are defined on the driver and read‑only on executors, allowing large read‑only data to be efficiently shared across tasks.

26. Accumulators

Accumulators are write‑only variables used for aggregating counters or sums across tasks.

27. Spark Job, Stage, Task

A job is a user‑submitted computation; a stage is a set of tasks that can be executed concurrently; a task is the smallest unit of work run on an executor.

28. Zookeeper Basics

Zookeeper provides distributed coordination with a leader‑follower architecture.

29. Zookeeper Leader Election (Example)

Each server proposes itself as leader, exchanges votes, and the server with the highest logical clock becomes the leader.

30. Hive Overview

Hive is a data‑warehouse system on Hadoop that maps structured data to tables and offers SQL‑like queries.

31. Internal vs. External Tables

Internal tables are managed by Hive (data deleted with the table); external tables reference data stored in HDFS (data remains after dropping the table).

32. User‑Defined Functions (UDF)

UDFs extend Hive’s built‑in functions to handle custom business logic.

33. Join Strategy

Place the smaller table on the left side of a join to reduce data movement.

34. Kafka Replication

Kafka replicates partitions across brokers; one broker acts as the leader, others as followers, ensuring fault tolerance.

35. Consumer Read Path

Consumers read from the leader of a partition; followers only serve as replicas.

36. ISR Mechanism

In‑Sync Replicas (ISR) are the set of followers that are fully caught up with the leader; only when all ISR acknowledge a write does the leader commit the record.

37. Kafka Data Retention

Kafka retains logs for a configurable period (default 7 days) or until size limits are reached; old segments are marked for deletion and eventually removed.

38. Message Format

A Kafka message consists of a fixed‑length header (magic byte, CRC32, optional attributes) and a variable‑length body containing the key/value payload.

39. MySQL Leftmost Prefix Principle

When multiple columns are indexed, the leftmost column(s) are used first for query optimization; placing the most frequently filtered column on the left yields better performance.

Big DataKafkaHiveMapReduceYARNSparkHadoop
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.