Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies
This article presents a detailed interview-style walkthrough covering Hadoop cluster setup, HDFS components, MapReduce workflow, YARN advantages, Spark fundamentals, Kafka replication, Hive table types, and related big‑data concepts, providing concise explanations and practical insights for data engineers.
1. Self Introduction
Keep the self‑introduction within 4.5–5 minutes, covering personal basic information, work experience (duration, company name, position, main duties, achievements, reason for leaving), and deep‑communication (pressure interview) techniques.
工作时间、公司名称、任职岗位、主要工作内容、工作业绩、离职原因2. Hadoop Cluster Configuration Files
The three essential XML files for a Hadoop cluster are core-site.xml , hdfs-site.xml , and mapred-site.xml .
3. Core Hadoop Daemons
Typical Hadoop processes include NameNode, DataNode, and Secondary NameNode.
4. Roles of Daemons
NameNode maintains the HDFS namespace and metadata; DataNode stores actual block data and replicates blocks; Secondary NameNode periodically merges the edit log with the filesystem image for checkpointing.
5. Detailed Secondary NameNode Function
Secondary NameNode regularly contacts NameNode, performs checkpointing by merging the edit log with the current fsimage, writes a new fsimage, and uploads it back to NameNode.
6. HDFS Block Replication and Size
By default each block is replicated three times and the block size is 128 MB (previously 64 MB in Hadoop 1.0, changed to 128 MB in Hadoop 2.0).
7. Changing Block Size
Block size is influenced by disk storage characteristics; adjusting disk throughput can affect the effective block size.
8. HDFS Read/Write Process
The client first contacts NameNode to request file creation, then obtains DataNode locations for each block, establishes pipelines, and streams data block by block through the DataNodes.
9. MapReduce Workflow
The client requests a job ID from JobTracker, uploads required resources to HDFS, JobTracker schedules map tasks, TaskTrackers send heartbeats, and upon completion JobTracker marks the job successful.
10. Map Phase Details
Map tasks perform partitioning, sorting, optional combiner aggregation, and shuffling of intermediate key/value pairs to reducers.
11. Data Skew
Data skew typically occurs on the reducer side when a particular key has disproportionately large data, causing some reducers to run much longer.
12. Skew Mitigation
Introduce a random prefix to heavy keys during the map phase, perform local aggregation, then remove the prefix before the final reduce to balance load.
13. Effect of Combiner
Combiner reduces the amount of data transferred from mapper to reducer by locally aggregating identical keys.
14. Map Output Spill
When map output exceeds the in‑memory buffer, it is spilled to local disk before being sent to reducers.
15. Default Partitioning
Map output keys are hashed and the hash value modulo the number of reducers determines the target partition.
16. Hadoop Tuning Areas
Performance tuning focuses on I/O buffer sizes, disk pre‑read settings, and configuration parameters in core-site.xml , hdfs-site.xml , and mapred-site.xml such as buffer.size and block.size .
17. Example MapReduce Job on a 1 GB File
With a 128 MB block size, the file is split into eight blocks, creating eight mappers; a custom partitioner groups records by the class field, and the job sorts by id before reducing.
18. YARN Overview
YARN separates resource management (ResourceManager) from node management (NodeManager); resources are allocated in containers, enabling better multi‑tenant scheduling for MapReduce and other frameworks.
19. Spark Advantages over MapReduce
Spark leverages in‑memory computation, DAG‑based scheduling, RDD transformations and actions, and lineage‑based fault tolerance, resulting in faster execution.
20. RDD Definition
Resilient Distributed Dataset (RDD) is an immutable, partitioned collection of records that supports parallel operations.
21. Common RDD Operations
Transformations include map() , filter() , flatMap() , distinct() ; actions include collect() , reduce() .
22. reduceByKey vs. groupByKey
reduceByKey performs local aggregation before shuffling, reducing network traffic; groupByKey shuffles all values for each key, which can cause high memory usage.
23. Spark Streaming Fault Tolerance
Checkpointing stores Kafka offsets and streaming state, allowing the application to resume from the last checkpoint after a failure.
24. Alternative Fault‑Tolerance
Write the consumed Kafka data to HDFS as a write‑ahead log, enabling recovery of lost data.
25. Broadcast Variables
Broadcast variables are defined on the driver and read‑only on executors, allowing large read‑only data to be efficiently shared across tasks.
26. Accumulators
Accumulators are write‑only variables used for aggregating counters or sums across tasks.
27. Spark Job, Stage, Task
A job is a user‑submitted computation; a stage is a set of tasks that can be executed concurrently; a task is the smallest unit of work run on an executor.
28. Zookeeper Basics
Zookeeper provides distributed coordination with a leader‑follower architecture.
29. Zookeeper Leader Election (Example)
Each server proposes itself as leader, exchanges votes, and the server with the highest logical clock becomes the leader.
30. Hive Overview
Hive is a data‑warehouse system on Hadoop that maps structured data to tables and offers SQL‑like queries.
31. Internal vs. External Tables
Internal tables are managed by Hive (data deleted with the table); external tables reference data stored in HDFS (data remains after dropping the table).
32. User‑Defined Functions (UDF)
UDFs extend Hive’s built‑in functions to handle custom business logic.
33. Join Strategy
Place the smaller table on the left side of a join to reduce data movement.
34. Kafka Replication
Kafka replicates partitions across brokers; one broker acts as the leader, others as followers, ensuring fault tolerance.
35. Consumer Read Path
Consumers read from the leader of a partition; followers only serve as replicas.
36. ISR Mechanism
In‑Sync Replicas (ISR) are the set of followers that are fully caught up with the leader; only when all ISR acknowledge a write does the leader commit the record.
37. Kafka Data Retention
Kafka retains logs for a configurable period (default 7 days) or until size limits are reached; old segments are marked for deletion and eventually removed.
38. Message Format
A Kafka message consists of a fixed‑length header (magic byte, CRC32, optional attributes) and a variable‑length body containing the key/value payload.
39. MySQL Leftmost Prefix Principle
When multiple columns are indexed, the leftmost column(s) are used first for query optimization; placing the most frequently filtered column on the left yields better performance.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.