Big Data 13 min read

Implementation Principles and Architecture of DBus Data Sources (RDBMS and Log Types)

The article explains how DBus ingests data from relational databases and log sources by detailing its extractor, incremental conversion, and full‑pull modules, the use of Canal and Kafka, rule‑based log structuring, the unified UMS message format, and heartbeat monitoring for reliability.

Architecture Digest
Architecture Digest
Architecture Digest
Implementation Principles and Architecture of DBus Data Sources (RDBMS and Log Types)

Dbus supports two categories of data sources: RDBMS data sources and log‑type data sources.

1. Implementation of RDBMS Data Sources

Using MySQL as an example, the ingestion pipeline is divided into three parts: the log extraction module, the incremental conversion module, and the full‑pull module.

1.1 Log Extraction Module (Extractor)

The MySQL log extractor consists of two components:

Canal server – extracts incremental logs from MySQL.

MySQL‑extractor Storm program – streams the incremental logs to Kafka, filters unnecessary tables, and ensures high availability.

MySQL replication relies on binlog, which can operate in Row, Statement, or Mixed mode; Row mode is preferred for production to enable full‑log reading. The typical deployment uses two master nodes (VIP), one slave, and an optional backup node, with binlog reading performed from the slave to minimize impact on the source.

Dbus leverages Alibaba’s open‑source Canal to read MySQL binlogs, avoiding duplicate development and benefiting from Canal upgrades. Canal server high‑availability is achieved via Zookeeper, and the extractor Storm program also runs in a HA configuration.

1.2 Incremental Conversion Module (Stream)

This module processes incremental data according to the source format and performs several functions:

Dispatcher – routes logs to different Kafka topics based on schema, providing data isolation and load distribution.

Appender – converts Canal protobuf data to the unified UMS format, generates unique identifiers (ums_id, ums_ts), captures metadata changes, performs real‑time data masking, handles full‑pull pause/resume, and emits monitoring heart‑beat events.

1.3 Full‑Pull Module (FullPuller)

Used for initial loads or full re‑loads, the process follows a Sqoop‑like approach and consists of two steps:

Data sharding – queries max, min, count, calculates shard count, and stores shard info in a split topic. Primary‑key based sharding is required for efficient MySQL InnoDB reads.

Actual pulling – each shard is pulled concurrently from the slave, with progress written to Zookeeper for monitoring. Concurrency is limited to 6‑8 threads and is usually scheduled during low‑traffic periods.

1.4 Consistency Between Full and Incremental Pulls

Kafka uses a single partition per topic to preserve order; DBus guarantees at‑least‑once delivery with Storm’s retry mechanism. Unique identifiers (ums_id, ums_uid) are generated from Zookeeper or binlog file number plus offset, ensuring physical uniqueness and ordering. The timestamp field ums_ts records the event time for snapshot queries.

2. Implementation of Log‑Type Data Sources

Common log collection and analysis tools such as Logstash, Filebeat, Flume, Fluentd, Chukwa, Scribe, and Splunk are supported. Structured logs are extracted using configurable regular‑expression templates, while business‑specific fields (the "message" part) are handled via a visual rule‑operator interface.

The DBus log‑sync workflow includes:

Log collection using industry‑standard agents (e.g., Logstash, Flume, Filebeat) that push raw logs to Kafka.

A visual UI for configuring rule operators that transform raw logs into structured tables with defined schemas.

Rule operators (filter, split, merge, replace, etc.) can be chained to achieve complex transformations; they are orthogonal and reusable.

The configured operator chain runs in the execution engine, producing structured data that is sent to Kafka for downstream consumption.

Each raw log can be mapped to one or multiple structured tables; unmatched logs are routed to an unknown_table .

3. Unified Message Schema (UMS)

All output messages—whether incremental, full, or log—conform to the UMS format, which includes protocol version, namespace (type.datasource.schema.table.version.sharding), fields (ums_id_, ums_ts_, ums_op_, ums_uid_), and a payload containing the actual data. Multiple records can be packed into a single JSON payload.

4. Heartbeat Monitoring and Alerting

For RDBMS pipelines, a heartbeat module writes timestamped records to a dedicated heartbeat table in the source database; these records travel through the same pipeline, allowing the system to verify end‑to‑end connectivity even when no DML occurs. Heartbeat data is also sent to InfluxDB and visualized in Grafana, with latency‑based alerts.

Log pipelines generate heartbeat events at the source, which are processed by the extractor and rule modules and monitored similarly.

KafkaCanalRDBMSData ingestionlog-processingDBusUMS
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.