Big Data 13 min read

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

The article presents an in‑depth overview of the DATABUS data integration platform, detailing its background, current challenges, core capabilities such as data syncing, metadata automation, real‑time subscriptions, and its reliance on TiDB, TiSpark, Hudi, and related big‑data technologies to enable near‑real‑time data warehousing.

Beike Product & Technology

Feb 21, 2019

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

1 Background

With rapid growth of online platform business, data value has increased and is used more frequently, but data is scattered across many internal sources. Consolidating, computing, and breaking data silos to enable cross‑platform integration is crucial. Beike currently has nearly 1,000 databases and 25,000 business tables, making fast and accurate import to target destinations essential.

2 Current Situation and Pain Points

2.1 Long and Delayed Data Acquisition Cycle

Business teams need to import tables into Hive via Sqoop with a daily T+1 full load, resulting in a one‑day delay and no real‑time sync.

Sqoop imports are time‑consuming and put pressure on online databases, affecting production workloads.

2.2 Inability to Sync Metadata in Real Time

When new tables or schema changes occur, metadata updates are not synchronized automatically, requiring manual intervention and risking inaccurate data warehouse sync.

2.3 Lack of Data Subscription Functionality

Business users cannot subscribe to real‑time data change notifications with guaranteed order and accuracy.

2.4 No Traceability of Data Changes

Users cannot view detailed, real‑time change history for specific data.

3 DATABUS Capabilities

3.1 Data Integration

Supports hourly, T+1 full, and incremental sync of business tables to Hive, reducing redundant data and shortening acquisition cycles.

3.2 Automatic Metadata Update and Sync

DATABUS reads TiDB binlog to update metadata tables in real time and synchronizes schema changes to Hive.

3.3 Real‑time Data Warehouse Construction

Using Spark‑Streaming, DATABUS reads TiDB binlog, processes data, and writes to HDFS via HUDI to achieve near‑real‑time Hive tables.

3.4 Data Subscription

Consumers can configure subscription tasks for selected tables; DATABUS writes changes to ordered Kafka topics for downstream consumption.

3.5 Real‑time Data Change Query

All DML operations are streamed to Elasticsearch, enabling instant query of data changes.

3.6 Overall Architecture Diagram

DATABUS data platform architecture diagram

4 TiDB Ecosystem

4.1 Reasons for Introducing TiDB

Decouple from online replicas, reducing impact on production.

Unified metadata storage with horizontal scalability; TiDB supports MySQL protocol and can act as a MySQL slave.

Real‑time detection of metadata changes and automatic Hive schema updates.

Real‑time propagation of DML metadata to downstream consumers.

Future goal: consume binlog to build real‑time Hive warehouse.

Supports both T+1 offline and near‑real‑time OLAP/OLTP workloads via TiSpark.

4.2 Overview

TiDB is an open‑source distributed database from PingCAP, combining RDBMS and NoSQL strengths.

SQL compatibility with MySQL.

Horizontal elastic scaling.

Full ACID transaction support.

HTAP solution: simultaneous OLTP row‑store and OLAP performance with TiSpark.

TiDB architecture diagram

4.3 Component Introduction

4.3.1 TiDB

TiDB Server handles SQL requests, finds data via PD and TiKV, and is stateless, allowing unlimited horizontal scaling behind load balancers.

Our deployment runs three TiDB clusters covering half of online business data.

4.3.2 TiSpark

TiSpark extends TiDB for complex OLAP workloads, leveraging Spark SQL to process large datasets; DATABUS uses it to load tens of millions of rows into Hive within 1‑3 minutes.

TiSpark configuration diagram

4.3.3 TiDB‑Binlog

TiDB‑Binlog captures binlog events and provides synchronization; DATABUS streams binlog data as protobuf to Kafka, enabling real‑time warehouse updates, metadata sync, subscription, and change query.

TiDB Binlog architecture diagram

4.4 TiDB Summary

1) To avoid duplicate database names, we map original databases to "port_database" identifiers, allowing placement in different clusters.

2) To preserve order while improving Kafka consumption, we hash tables to multiple partitions instead of a single partition.

3) Metadata schema changes may cause TiDB sync failures; current workaround is re‑sync, with future improvements planned.

Changing varchar to long.

Reducing varchar length.

MySQL tables with foreign keys or partitions.

4) TiSpark currently does not support enum and some special types; support is forthcoming.

5 HUDI Introduction

5.1 Why Use HUDI

DATABUS aims for near‑real‑time data warehousing; by reading TiDB binlog and using HUDI to upsert Hadoop files, it achieves real‑time Hive tables while reducing storage costs for full‑partition data.

6 Summary & Outlook

Since its inception in August, DATABUS now supports 99% of company business tables for metadata management, T+1 and hourly Hive sync, processes 3‑4 billion rows daily from a TiDB cluster, and offers 40% of tables with subscription and real‑time change query.

Future work includes expanding heterogeneous data connectors and launching full real‑time warehousing using technologies like HUDI.

7 About Us

The Beike Big Data Architecture Team builds and optimizes the company’s data storage, compute, and streaming platforms, delivering stable, high‑performance big‑data components and platforms.

We invite internal teams to try DATABUS at http://databus.data.lianjia.com and provide feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Data Hive TiDB Data Integration Spark Hudi

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.