Big Data 6 min read

Understanding Data Skew in Big Data Processing and Mitigation Strategies

Data skew, a common challenge in large-scale data processing where uneven key distribution leads to performance bottlenecks, is explored with examples from Hadoop, Spark, and Flink, alongside practical mitigation techniques such as hotspot key redesign, map‑side joins, and tuning framework parameters.

Big Data Technology & Architecture

Nov 17, 2019

Data Skew

Data skew is an unavoidable issue when handling large data volumes and is a near‑mandatory interview topic. In theory, data distributions are skewed, following the “80/20 rule”: 80% of wealth is held by 20% of people, 80% of users use only 20% of features, and 20% of users generate 80% of traffic. In short, data skew means that the keys are highly unevenly distributed, causing some partitions to hold a lot of data while others hold very little.

Manifestations

Most data engineers have encountered data skew, which can appear at various stages of data development, for example: during Hive calculations the reduce phase stalls at 99.99%, or in Spark Streaming real‑time algorithms executors repeatedly encounter OOM errors while other executors have low memory usage.

Hadoop

When a job’s progress stays at 99% for a long time, detailed logs or monitoring dashboards reveal:

One or more reducers are stuck.

Various containers report OOM.

The read/write volume is extremely large, far exceeding that of normal reducers.

Data skew may cause tasks to be killed and other odd behaviors.

Spark

Data skew is also common in Spark; a stage’s execution time is limited by the slowest task, so a single slow task drags down the whole program. Excessive data in one task can overload the executor, causing OOM and program termination.

Flink

When using Window, GroupBy, Distinct and other aggregation functions, back‑pressure frequently occurs, consumption speed becomes very slow, and some tasks encounter OOM; increasing resources does not help.

Principles and Solutions for Data Skew

Operations such as count‑distinct, group‑by, and join trigger shuffle actions. Once shuffled, all values with the same key are pulled to one or a few nodes, creating a hotspot.

Consider a simple scenario: in an orders table, the regions Beijing and Shanghai have order counts several orders of magnitude higher than other regions, leading to data hotspots during aggregation.

Several approaches to resolve data skew:

Business level: Avoid designing hotspot keys or disperse them, e.g., split Beijing and Shanghai into sub‑regions before aggregation.

Technical level: When hotspots appear, adjust the solution to avoid direct aggregation, leveraging framework capabilities such as map‑side join.

Parameter level: Hadoop, Spark, and Flink all provide many tunable parameters.

Hadoop/Hive Parameters

mapside‑join

Set hive.groupby.skewindata=true for group‑by or distinct.

Merge small files.

Compress files.

Spark Parameters

Use map join instead of reduce join.

Increase shuffle parallelism.

Flink Parameters

MiniBatch settings.

Parallelism settings.

Other solutions often involve redesigning business keys to avoid hotspots. Handling data skew is an ongoing process, and the ideas presented here aim to provide helpful guidance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Spark Hadoop

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.