Big Data 10 min read

TiSpark Integration with TiDB/TiKV for Efficient Data Synchronization and OLAP in the Databus Project

This article introduces TiSpark—an extension of Spark that tightly integrates with TiDB/TiKV to enable high‑performance, scalable data synchronization and OLAP queries, details its architecture, key configuration, performance advantages over Spark SQL and Sqoop, and outlines its role in the Databus data‑integration platform.

Beike Product & Technology
Beike Product & Technology
Beike Product & Technology
TiSpark Integration with TiDB/TiKV for Efficient Data Synchronization and OLAP in the Databus Project

In the February 21, 2019 "DATABUS – Data Island Solution" article, TiSpark was highlighted as a crucial component for quickly and accurately importing specified database tables into target data sources, enabling T+1, daily, and hourly full‑ and incremental data tasks within the Databus project.

TiSpark is built on top of TiDB and TiKV, which together form a hybrid transactional/analytical processing (HTAP) database. TiDB provides horizontal scalability, strong consistency, distributed transactions, and real‑time OLAP capabilities, while TiKV serves as the underlying distributed key‑value store.

The TiDB server layer offers online transaction processing (OLTP) and analytical processing (OLAP) in a single system, supporting multi‑replica data safety and real‑time analytics.

The Placement Driver (PD) manages cluster metadata, performs scheduling and load balancing for TiKV nodes, and generates globally unique, monotonically increasing transaction IDs.

TiKV stores data as regions, each covering a contiguous key range. Regions are replicated using the Raft protocol, ensuring consistency and fault tolerance, while PD balances region distribution across nodes.

TiSpark extends Spark’s Catalyst engine, allowing precise computation control, efficient data reads from TiKV, index lookups, and push‑down of computation to TiKV, which reduces the amount of data Spark SQL must process and leverages TiDB’s statistics for query optimization.

In the Databus project, TiSpark is used to synchronize business data to a Hive data warehouse in a T+1 fashion. The runtime environment includes JDK 1.8, Spark 2.3.2, and Yarn deployment mode.

Key configuration parameters are set as follows: spark.sql.extensions org.apache.spark.sql.TiExtensions spark.tispark.pd.addresses 127.0.0.1:2379 spark.tispark.db_prefix tidb_ spark.tispark.request.command.priority Normal These settings enable Spark to load the TiSpark extension, connect to the PD cluster, apply a database prefix, and set the query priority to "Normal" to balance OLTP and OLAP workloads.

Performance tests comparing TiSpark, Spark SQL, and Sqoop for data synchronization to Hive show that TiSpark achieves roughly four times the throughput of Spark SQL and fifteen times that of Sqoop, demonstrating a significant efficiency gain.

TiSpark also addresses Spark SQL’s challenges with large, unevenly distributed primary keys by leveraging TiKV’s region‑based partitioning, which creates evenly sized Spark partitions and avoids OOM issues and resource waste.

For stability, TiDB’s Syncer can continuously replicate MySQL data to TiDB, while TiSpark reads directly from TiKV, eliminating heavy JDBC connections to MySQL that can overload source databases. TiKV’s region‑level Raft replication provides load balancing, horizontal scaling, and fault tolerance.

In summary, TiDB and TiSpark play a pivotal role in the Databus project by enabling real‑time data synchronization and offering future OLAP capabilities without the need for separate ETL pipelines, allowing up‑stream OLTP data to be instantly analyzed.

performance optimizationBig DataTiDBdata integrationSparkTiSpark
Beike Product & Technology
Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.