Big Data 5 min read

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Data Thinking Notes

Nov 22, 2022

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

1. Phenomenon

Using Sqoop to sync a table from RDS to Hive normally takes less than 2.5 hours, but occasionally the sync exceeds 8 hours, sometimes over 12 hours, affecting downstream batch tasks.

2. Investigation Process

2.1 Check Sqoop sync task logs

MapReduce task 0%‑88% took 1 hour 9 minutes, 88%‑100% took 9 hours 20 minutes; the last map stage was abnormal, prompting a check of the RDS server.

2.2 Check RDS server

Slow query analysis:

Eight slow queries were generated; seven lasted about 40 minutes, one exceeded 10 hours.

Data skew: among eight parallel queries, only three returned rows (30 k, 38 rows, 9.9 billion rows); one query returned 99 % of the data, indicating severe skew.

Root cause analysis:

The split column incr_id is neither a primary key nor indexed.

Sqoop’s splitter converts Varchar columns to BigDecimal before splitting, which can produce garbled values and cause skew.

Sqoop documentation shows that for String/Varchar columns the splitter first converts the value to BigDecimal, splits, then converts back to String, which can introduce garbled data.

Source code inspection confirms IntegerSplitter for integer columns and TextSplitter (via BigDecimal) for Varchar columns, reproducing the garbled issue.

Splitting by incr_id leads to data skew.

3. Solution

3.1 Add a BigInt auto‑increment column in the RDS table

Adding an auto‑increment column as primary key or index and using it as the split column would solve the problem, but it requires significant changes, so it was not adopted.

3.2 Find a more evenly distributed column in the existing table

Using the value column (prefix 00‑FF) as the split column, setting parallelism to 6 or 7 yields more balanced splits. Tests show that setting parallelism to 7 produces evenly distributed ranges (0‑3, 3‑6, 6‑9, 9‑&, …).

Thus the Sqoop command was adjusted to --split-by value -m 7.

4. Result

After the change, the sync time from August 8‑11 dropped to 1‑2 hours, and the data split was uniform.

5. Optimization Recommendations

For large tables, add a BigInt auto‑increment column as primary key or index to facilitate future data synchronization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Performance Tuning Hive Data Skew RDS Sqoop

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.