Big Data 21 min read

Comprehensive Spark Interview Questions and Answers

This article provides a detailed collection of Spark interview questions covering deployment modes, performance advantages over MapReduce, shuffle mechanisms, RDD characteristics, optimization techniques, resource management, and various practical aspects of Spark on YARN, Mesos, and Kubernetes.

Big Data Technology & Architecture

Jun 4, 2021

Comprehensive Spark Interview Questions and Answers

1. Spark deployment modes and their characteristics Local mode (local, local[k], local[*]), Standalone mode, Spark on YARN (cluster and client modes), and Spark on Mesos (coarse-grained and fine-grained) are described with their resource management and fault‑tolerance features.

2. Why Spark is faster than MapReduce In‑memory computation, DAG‑based scheduling, and lineage‑based fault tolerance give Spark a performance edge.

3. Hadoop vs. Spark shuffle differences High‑level similarity in partitioning, but Hadoop uses sort‑based shuffle while Spark defaults to hash‑based shuffle, with optional sort‑based configuration.

4. Spark execution mechanism Driver creates SparkContext, requests executors, builds DAG, DAGScheduler creates stages, TaskScheduler distributes tasks to executors, and tasks execute and release resources.

5. Spark optimization areas Platform‑level (jar distribution, data locality, storage format), application‑level (operator pruning, data skew handling, caching, parallelism), and JVM‑level (memory settings, serialization, off‑heap memory).

6. Data locality determination Determined during DAG stage planning based on where tasks will run.

7. RDD elasticity features Automatic memory‑disk storage switching, lineage‑based fault tolerance, task and stage retry, checkpoint/persist, flexible scheduling, and fine‑grained partitioning.

8. RDD limitations No fine‑grained write/update, lack of incremental iterative computation, and coarse‑grained write semantics.

9. Spark shuffle process Described in three parts: shuffle partitioning, intermediate result storage, and data fetching, with a reference link for details.

10. Spark data locality types (Not detailed in source, but referenced).

11. Persistence in Spark When and why to use persist, illustrated with images.

12. Join operation optimization Discussed with visual examples.

13. Yarn task execution flow Client submits application to ResourceManager, which launches ApplicationMaster (driver) on a selected node.

14. Advantages of Spark on YARN mode Illustrated with diagrams.

15. Understanding containers Visual explanation provided.

16. Benefits of Parquet storage format Parquet is the preferred file format for big‑data storage.

17. Relationship between partitions and blocks Visual illustration.

18. Spark application execution process Diagrammatic overview.

19. Hash vs. sort shuffle performance Hash shuffle is faster for small data; sort shuffle scales better for large data.

20. Sort‑based shuffle drawbacks Small file explosion and double sorting overhead.

21. spark.storage.memoryFraction meaning and tuning Controls memory fraction for persisted RDDs; adjust based on workload.

22. Unified Memory Management model Illustrated with an image.

23. Two types of Spark operators Transformations and Actions.

24. Aggregation operators to avoid ReduceByKey, join, distinct, repartition cause shuffle and should be minimized.

25. Getting data from Kafka Receiver‑based and Direct approaches explained.

26. Ways to create RDDs From collections, local files, HDFS, databases, NoSQL stores, S3, and streams.

27. Setting Spark parallelism Recommended 24 partitions per core, e.g., 64‑128 partitions for 32 cores.

28. Handling non‑serializable objects Wrap them in an object.

29. collect operation Driver gathers results from executors into an array.

30. Insufficient resources causing early job start May lead to resource shortage; adjust scheduler settings to wait for full allocation.

31. map vs. flatMap map transforms each element; flatMap also flattens the result.

32. Coarse‑grained vs. fine‑grained allocation in Mesos Trade‑offs between resource utilization and flexibility.

33. Driver responsibilities Resource acquisition, registration, DAG creation, stage generation, and task scheduling.

34. Spark technology stack components Core, Streaming, SQL, BlinkDB, MLBase, GraphX, each with use cases.

35. Worker role Manages node resources, launches executors, and reports heartbeats.

36. MapReduce vs. Spark similarities and differences Both parallel, but Spark offers in‑memory processing and richer APIs.

37. RDD mechanism Distributed immutable dataset with lineage for fault tolerance.

38. Wide vs. narrow dependencies Narrow: one parent partition per child; Wide: multiple child partitions share a parent.

39. cache vs. persist cache uses default MEMORY_ONLY; persist allows different storage levels.

40. Cache chaining with other operators Cache can be followed by operators but may trigger recomputation; cache is not an action.

41. reduceByKey classification Not an action; reduce is an action.

42. Lineage efficiency Enables recomputation of lost partitions without full data reload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization interview YARN Spark Shuffle RDD

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.