Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It
A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.
1. Phenomenon
Using Sqoop to sync a table from RDS to Hive normally takes less than 2.5 hours, but occasionally the sync exceeds 8 hours, sometimes over 12 hours, affecting downstream batch tasks.
2. Investigation Process
2.1 Check Sqoop sync task logs
MapReduce task 0%‑88% took 1 hour 9 minutes, 88%‑100% took 9 hours 20 minutes; the last map stage was abnormal, prompting a check of the RDS server.
2.2 Check RDS server
Slow query analysis:
Eight slow queries were generated; seven lasted about 40 minutes, one exceeded 10 hours.
Data skew: among eight parallel queries, only three returned rows (30 k, 38 rows, 9.9 billion rows); one query returned 99 % of the data, indicating severe skew.
Root cause analysis:
The split column incr_id is neither a primary key nor indexed.
Sqoop’s splitter converts Varchar columns to BigDecimal before splitting, which can produce garbled values and cause skew.
Sqoop documentation shows that for String/Varchar columns the splitter first converts the value to BigDecimal, splits, then converts back to String, which can introduce garbled data.
Source code inspection confirms IntegerSplitter for integer columns and TextSplitter (via BigDecimal) for Varchar columns, reproducing the garbled issue.
Splitting by incr_id leads to data skew.
3. Solution
3.1 Add a BigInt auto‑increment column in the RDS table
Adding an auto‑increment column as primary key or index and using it as the split column would solve the problem, but it requires significant changes, so it was not adopted.
3.2 Find a more evenly distributed column in the existing table
Using the value column (prefix 00‑FF) as the split column, setting parallelism to 6 or 7 yields more balanced splits. Tests show that setting parallelism to 7 produces evenly distributed ranges (0‑3, 3‑6, 6‑9, 9‑&, …).
Thus the Sqoop command was adjusted to --split-by value -m 7.
4. Result
After the change, the sync time from August 8‑11 dropped to 1‑2 hours, and the data split was uniform.
5. Optimization Recommendations
For large tables, add a BigInt auto‑increment column as primary key or index to facilitate future data synchronization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
