Big Data 22 min read

How DWS Uses Log‑Based Architecture for Real‑Time Data Integration

This article explains the design and implementation of the DWS platform, detailing its log‑driven architecture with Dbus, Wormhole, and Swifts, the technical choices behind real‑time data extraction, transformation, and delivery, and real‑world use cases in finance.

dbaplus Community

Dec 18, 2016

How DWS Uses Log‑Based Architecture for Real‑Time Data Integration

Background

DWS was created to solve the inconsistency and latency problems of traditional data extraction methods in an internet‑finance environment. Conventional approaches—periodic backup extraction, batch Hive loading, or trigger‑based CDC—either produce stale data, cause conflicts, or add heavy load to source systems.

Inspired by LinkedIn’s log‑based design, DWS treats the incremental log as the single source of truth, publishing it to Kafka so downstream consumers can subscribe in real time.

Overall Architecture

DWS consists of three sub‑projects:

Dbus (Data Bus) : extracts data from source systems, converts it to a self‑describing JSON schema (UMS), and writes it to Kafka.

Wormhole (Data Exchange Platform) : reads UMS from Kafka and writes it to target stores such as HDFS, JDBC databases, Elasticsearch, HBase, etc.

Swifts (Real‑Time Computing Platform) : consumes UMS from Kafka, performs real‑time calculations (filter, join, window aggregation), and writes results back to Kafka.

Dbus Solution

Dbus captures MySQL binlog changes using Alibaba’s open‑source Canal component, which mimics a MySQL slave to receive binary logs. It supports Row, Statement, and Mixed binlog formats, with Row mode preferred for reliable full‑log extraction.

Key steps:

Canal Server parses binlog (protobuf) and streams it to a Storm job.

The Storm job converts the data to UMS JSON and publishes it to Kafka.

Schema changes are captured and versioned.

Configuration and high‑availability metadata are stored in Zookeeper.

For full‑load scenarios, a separate Storm job performs parallel JDBC extraction from backup databases, splitting data into shards, storing shard metadata in Kafka, and merging small Parquet files on HDFS into larger ones.

Both incremental and full loads share a unified message format (UMS) that includes fields such as _ums_op_ (I/U/D), _ums_ts_ (timestamp), and _ums_id_ (unique 64‑bit identifier derived from binlog file number and offset).

Wormhole Solution

Wormhole decouples data delivery by consuming UMS from Kafka and persisting it to various sinks. It uses Spark Streaming for scalability and supports:

HDFS (Parquet files) for long‑term storage and replay.

JDBC databases and HBase for transactional use cases.

Elasticsearch for search services.

Each logical flow is defined by a namespace and processed by a shared Spark Streaming job. Spark was chosen for its native support of heterogeneous storage, higher throughput than Storm, and unified APIs (Spark SQL, Spark Streaming).

Wormhole ensures idempotent writes by comparing _ums_id_ values; larger IDs overwrite older rows. For HBase, the _ums_id_ is used as the version number, leveraging HBase’s multi‑version capability.

Use Cases

Real‑Time Marketing : Customer credit information entered via web or mobile is instantly captured, scored, and pushed to a CRM system, enabling sales teams to contact prospects within minutes.

Real‑Time Reporting : Multiple business systems feed data into DWS, which provides up‑to‑date dashboards for operations, eliminating the previous T+1 reporting delay.

Conclusion

DWS combines mainstream real‑time big‑data technologies (Kafka, Canal, Storm, Spark) to deliver a highly available, low‑latency, and fault‑tolerant data integration platform. It supports heterogeneous sources and targets, multiple data formats, and a unified message schema, enabling scenarios such as real‑time sync, computation, monitoring, reporting, and decision making.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Streaming Kafka Canal Spark CDC log-based data pipeline

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.