Big Data 11 min read

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.

dbaplus Community
dbaplus Community
dbaplus Community
How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

1. Construction Background

The bank’s retail business system is being refactored into a distributed, micro‑service architecture. Legacy DB2 databases were split into multiple Oracle instances, each synchronized in real time via ADG technology. Data is collected using Hadoop’s Sqoop component, which pulls full and incremental data from Oracle into HDFS for downstream processing.

2. Challenges of the Distributed Architecture

Logical tables are sharded across many physical tables; downstream consumers need aggregated views.

Incremental extraction relies heavily on correctly defined rules; errors or schema changes can cause missed updates.

Real‑time master‑standby synchronization must still provide a consistent snapshot for downstream systems.

3. Tool Evaluation and Selection

Before the migration, data was extracted via JDBC queries, which became a performance bottleneck for large tables. Five common extraction tools were benchmarked. The results showed that Hadoop’s Sqoop component (v1.4.6) delivered the highest throughput, especially when using its direct (fast) mode, which bypasses Oracle indexes and reads data blocks directly.

Sqoop’s direct mode also writes output directly to HDFS, eliminating the extra upload step and simplifying downstream preprocessing.

4. Technical Architecture

The Sqoop ingestion module runs on a Hadoop cluster, pulling data from the Oracle standby databases of each split subsystem, compressing the files, and storing them in HDFS for downstream applications.

5. Key Technical Details

5.1 Sqoop Direct Mode

Sqoop v1.4.6 provides a direct import mode for Oracle that uses Oracle’s internal functions to locate data blocks and read them sequentially, achieving high throughput. This mode does not support WHERE clauses or column selection because it reads the entire table.

5.2 Data Consistency Guarantees

When tables are being updated during extraction, Sqoop can employ Oracle’s flashback query feature to obtain a consistent snapshot. Proper tuning of Oracle parameters such as undo_retention is required to avoid ORA‑01555 errors.

5.3 Permissions Required for Direct Mode

A dedicated Sqoop user (e.g., B) must be created with a password that excludes special characters. The user needs the following privileges on the source schema (e.g., A):

CREATE SESSION

ALTER SESSION

SELECT on v_$instance, dba_tables, dba_tab_columns, dba_objects, dba_extents, dba_tab_subpartitions, dba_segments, dba_constraints, v_$database, v_$parameter GRANT SELECT on A. Tab_A to B (or SELECT ANY TABLE)

GRANT FLASHBACK on A. Tab_A to B if consistency is required

5.4 Custom Delimiter Scheme

Sqoop’s default column delimiter is a single character and the row delimiter is a newline, which can cause issues when data contains these characters. To customize delimiters, generate the Java class for a table with sqoop codegen, modify the toString() method (fieldDelim at line 851, row delimiter at line 1220), re‑compile, and use the resulting JAR in the import command.

5.5 Import Command with Custom JAR

After building the custom JAR, the import is executed by specifying -Dsqoop.import.jar=<custom.jar> and the fully‑qualified class name.

5.6 Incremental Data Capture

Because direct mode is highly efficient for full loads, the team adopted a “full‑load‑then‑compare” strategy for incremental updates: each day a full snapshot of a table is imported, then primary‑key joins are used to detect row‑level changes, achieving 100% accurate deltas even when source tables lack update timestamps.

6. Performance Results

Benchmarking on several large tables showed an average extraction speed improvement of more than 50% after switching from the original DB2 source to the Oracle‑Sqoop pipeline.

7. Additional Pre‑Processing Capabilities

Since Sqoop’s output resides in HDFS, the bank can leverage the Hadoop ecosystem for further processing, such as merging sharded tables, filtering sensitive fields, decoding encoded columns, and splitting data for downstream consumption.

8. Conclusion

The practical experience confirms that Sqoop direct mode dramatically improves data ingestion efficiency for relational databases within a Hadoop ecosystem, eliminates the performance bottleneck of JDBC‑based extraction, and provides a seamless bridge to downstream data processing, thereby strengthening the bank’s data platform foundation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataOracleHadoopdata ingestionSqoopDirect ModeIncremental Capture
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.