Understanding Data Skew in Big Data Processing and Mitigation Strategies
Data skew, a common challenge in large-scale data processing where uneven key distribution leads to performance bottlenecks, is explored with examples from Hadoop, Spark, and Flink, alongside practical mitigation techniques such as hotspot key redesign, map‑side joins, and tuning framework parameters.
Data Skew
Data skew is an unavoidable issue when handling large data volumes and is a near‑mandatory interview topic. In theory, data distributions are skewed, following the “80/20 rule”: 80% of wealth is held by 20% of people, 80% of users use only 20% of features, and 20% of users generate 80% of traffic. In short, data skew means that the keys are highly unevenly distributed, causing some partitions to hold a lot of data while others hold very little.
Manifestations
Most data engineers have encountered data skew, which can appear at various stages of data development, for example: during Hive calculations the reduce phase stalls at 99.99%, or in Spark Streaming real‑time algorithms executors repeatedly encounter OOM errors while other executors have low memory usage.
Hadoop
When a job’s progress stays at 99% for a long time, detailed logs or monitoring dashboards reveal:
One or more reducers are stuck.
Various containers report OOM.
The read/write volume is extremely large, far exceeding that of normal reducers.
Data skew may cause tasks to be killed and other odd behaviors.
Spark
Data skew is also common in Spark; a stage’s execution time is limited by the slowest task, so a single slow task drags down the whole program. Excessive data in one task can overload the executor, causing OOM and program termination.
Flink
When using Window, GroupBy, Distinct and other aggregation functions, back‑pressure frequently occurs, consumption speed becomes very slow, and some tasks encounter OOM; increasing resources does not help.
Principles and Solutions for Data Skew
Operations such as count‑distinct, group‑by, and join trigger shuffle actions. Once shuffled, all values with the same key are pulled to one or a few nodes, creating a hotspot.
Consider a simple scenario: in an orders table, the regions Beijing and Shanghai have order counts several orders of magnitude higher than other regions, leading to data hotspots during aggregation.
Several approaches to resolve data skew:
Business level: Avoid designing hotspot keys or disperse them, e.g., split Beijing and Shanghai into sub‑regions before aggregation.
Technical level: When hotspots appear, adjust the solution to avoid direct aggregation, leveraging framework capabilities such as map‑side join.
Parameter level: Hadoop, Spark, and Flink all provide many tunable parameters.
Hadoop/Hive Parameters
mapside‑join
Set hive.groupby.skewindata=true for group‑by or distinct.
Merge small files.
Compress files.
Spark Parameters
Use map join instead of reduce join.
Increase shuffle parallelism.
Flink Parameters
MiniBatch settings.
Parallelism settings.
Other solutions often involve redesigning business keys to avoid hotspots. Handling data skew is an ongoing process, and the ideas presented here aim to provide helpful guidance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
