Big Data 11 min read

Overview of Taobao Cloud Computing Architecture and Data Synchronization Solutions

This article presents a comprehensive overview of Taobao's cloud computing architecture, detailing system components, various data synchronization methods such as TimeTunnel, Dbsync, and DataX, the scheduling system design, and metadata-driven analysis platforms for performance optimization and monitoring.

Architecture Digest

Jun 4, 2019

Overview of Taobao Cloud Computing Architecture and Data Synchronization Solutions

1. System Architecture

Data flows from top to bottom, moving from multiple data sources through a Gateway and the Cloud Ladder to various application scenarios.

2. Taobao Cloud Computing Overview

The platform is composed of three main parts: data sources, a data platform, and data clusters.

3. Data Synchronization Solutions

3.1 Overview

3.2 Real‑time vs. Non‑real‑time Synchronization

3.3 TimeTunnel2

TimeTunnel is a real‑time data transmission platform whose main functions are publishing data to the platform and subscribing to data of interest.

Key characteristics include high efficiency (single node can handle up to 40 k TPS), high reliability (M‑S mode guarantees no data loss), high availability (single‑node failure does not affect the cluster), and ordered delivery when no failures occur.

3.4 Dbsync

Dbsync synchronizes service‑library data to HDFS by analyzing database server log files, extracting database actions, and delivering incremental data to Hadoop.

Performance examples:

2 KB record size → 4 MB/s

9 KB record size → 10 MB/s

Typical scenario: 800 GB data, non‑real‑time sync completes in 55 minutes, real‑time sync in 25 minutes.

3.5 DataX

DataX is a tool for exchanging data between heterogeneous data stores (RDBMS, NoSQL, file systems). It uses a framework plus plugins; the framework handles high‑speed data exchange, while plugins provide access to specific systems.

Supported execution modes: stand‑alone or on Hadoop, with both Web UI and CLI interfaces. Configuration is highly efficient; for example, a sharded table with 32 databases and 1 024 tables can be configured in under one minute.

4. Scheduling System

4.1 Production‑rate Silver Bullet

4.2 Modules / Sub‑systems

4.3 Task Trigger Methods

Flow control / Data Trigger and Time Trigger are illustrated below.

4.4 Scheduling Modes

4.5 Gateway Definition

A Gateway is a resource participating in the scheduling system, providing functions such as data synchronization (DataX, Dbsync, TimeTunnel2), data upload/download (hadoop fs –put/get/getmerge), log collection, Hive SQL execution, MapReduce job submission, and inter‑cluster data sync (hadoop distcp).

4.6 Gateway Scale and Planning

Approximately 30 Gateways are used in production, managed centrally for task distribution and parallel control.

4.7 Gateway Standardization

4.8 Dynamic Load Balancing Implementation

4.9 Priority Strategy Implementation

4.10 Priority Strategy Significance

4.11 Monitoring Panorama

5. Metadata Applications

Key questions include whether to rely on experienced architects or intelligent analysis systems.

5.1 Mining Metadata Goldmine

5.2 Metadata‑Based Development Platform

Features include automatic code generation, input location, code optimization, automated deployment and scheduling, pairwise analysis, hotspot detection, field‑change impact analysis, and transformation tracing.

5.3 Metadata‑Based Analysis Platform – Runtime Analysis System

5.4 Analysis Strategy Overview

5.5 Runtime Data Collection

5.6 Macro Analysis Strategy

5.7 Bottleneck Localization

Each stage’s throughput varies dynamically; the overall system throughput is limited by the stage with the smallest capacity. Visualizing throughput curves for each stage helps identify and address bottlenecks, and buffering queues can be added for stages with high variance.

Methods include plotting per‑stage throughput curves, buffer queue status between stages, and normalizing metrics to task level.

5.8 Most Worthwhile Optimization Targets

From a critical‑path perspective, tasks that have long runtimes, appear on multiple critical paths, and have high variability are prioritized for optimization.

Source: Compiled from internet resources titled “Taobao Cloud Ladder Distributed Computing Platform Overall Architecture”. Original link: https://www.afenxi.com/64409.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data Metadata Scheduling Data synchronization

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.