How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop
This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.
1. Construction Background
The bank’s retail business system is being refactored into a distributed, micro‑service architecture. Legacy DB2 databases were split into multiple Oracle instances, each synchronized in real time via ADG technology. Data is collected using Hadoop’s Sqoop component, which pulls full and incremental data from Oracle into HDFS for downstream processing.
2. Challenges of the Distributed Architecture
Logical tables are sharded across many physical tables; downstream consumers need aggregated views.
Incremental extraction relies heavily on correctly defined rules; errors or schema changes can cause missed updates.
Real‑time master‑standby synchronization must still provide a consistent snapshot for downstream systems.
3. Tool Evaluation and Selection
Before the migration, data was extracted via JDBC queries, which became a performance bottleneck for large tables. Five common extraction tools were benchmarked. The results showed that Hadoop’s Sqoop component (v1.4.6) delivered the highest throughput, especially when using its direct (fast) mode, which bypasses Oracle indexes and reads data blocks directly.
Sqoop’s direct mode also writes output directly to HDFS, eliminating the extra upload step and simplifying downstream preprocessing.
4. Technical Architecture
The Sqoop ingestion module runs on a Hadoop cluster, pulling data from the Oracle standby databases of each split subsystem, compressing the files, and storing them in HDFS for downstream applications.
5. Key Technical Details
5.1 Sqoop Direct Mode
Sqoop v1.4.6 provides a direct import mode for Oracle that uses Oracle’s internal functions to locate data blocks and read them sequentially, achieving high throughput. This mode does not support WHERE clauses or column selection because it reads the entire table.
5.2 Data Consistency Guarantees
When tables are being updated during extraction, Sqoop can employ Oracle’s flashback query feature to obtain a consistent snapshot. Proper tuning of Oracle parameters such as undo_retention is required to avoid ORA‑01555 errors.
5.3 Permissions Required for Direct Mode
A dedicated Sqoop user (e.g., B) must be created with a password that excludes special characters. The user needs the following privileges on the source schema (e.g., A):
CREATE SESSION
ALTER SESSION
SELECT on v_$instance, dba_tables, dba_tab_columns, dba_objects, dba_extents, dba_tab_subpartitions, dba_segments, dba_constraints, v_$database, v_$parameter GRANT SELECT on A. Tab_A to B (or SELECT ANY TABLE)
GRANT FLASHBACK on A. Tab_A to B if consistency is required
5.4 Custom Delimiter Scheme
Sqoop’s default column delimiter is a single character and the row delimiter is a newline, which can cause issues when data contains these characters. To customize delimiters, generate the Java class for a table with sqoop codegen, modify the toString() method (fieldDelim at line 851, row delimiter at line 1220), re‑compile, and use the resulting JAR in the import command.
5.5 Import Command with Custom JAR
After building the custom JAR, the import is executed by specifying -Dsqoop.import.jar=<custom.jar> and the fully‑qualified class name.
5.6 Incremental Data Capture
Because direct mode is highly efficient for full loads, the team adopted a “full‑load‑then‑compare” strategy for incremental updates: each day a full snapshot of a table is imported, then primary‑key joins are used to detect row‑level changes, achieving 100% accurate deltas even when source tables lack update timestamps.
6. Performance Results
Benchmarking on several large tables showed an average extraction speed improvement of more than 50% after switching from the original DB2 source to the Oracle‑Sqoop pipeline.
7. Additional Pre‑Processing Capabilities
Since Sqoop’s output resides in HDFS, the bank can leverage the Hadoop ecosystem for further processing, such as merging sharded tables, filtering sensitive fields, decoding encoded columns, and splitting data for downstream consumption.
8. Conclusion
The practical experience confirms that Sqoop direct mode dramatically improves data ingestion efficiency for relational databases within a Hadoop ecosystem, eliminates the performance bottleneck of JDBC‑based extraction, and provides a seamless bridge to downstream data processing, thereby strengthening the bank’s data platform foundation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
