Operations 16 min read

Design and Practice of Qunar Data Synchronization Platform: ES Multi‑Version Migration, High Availability, and Data Consistency

The article details Qunar's data synchronization platform that aggregates MySQL data into Elasticsearch, covering its architecture, component choices, ES5‑to‑ES7 migration, hot‑plugging, reindexing, high‑availability design, consistency guarantees, operational optimizations, and future roadmap.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Design and Practice of Qunar Data Synchronization Platform: ES Multi‑Version Migration, High Availability, and Data Consistency

Qunar's domestic ticket after‑sale services require complex queries across many MySQL tables; to serve these scenarios a data synchronization platform was built that aggregates data from MySQL into Elasticsearch, providing low‑latency, eventually consistent query capabilities.

Platform Overview

The platform consists of three layers: a data‑sync module (using Otter, Canal, and a custom DTS system), a data middle‑platform called crab that offers unified ES read/write, authentication, hystrix‑based flow control, and a management module for configuration, node lifecycle, and traffic distribution.

Key Components

Otter : an Alibaba open‑source distributed DB sync system, extended to publish messages to Kafka.

DTS : implements a SFTL pipeline (Source → Filter → Transform → Load) with Node, task, and DB reverse‑lookup components.

Crab : provides ES read/write APIs, unified auth, circuit‑breaker, and traffic‑shaping based on appcode+index dimensions.

Management : maintains sync configurations, DTS node status, crab auth/limit‑rate settings, and ES cluster flow control.

Technical Evolution Background

With over ten business lines and fourteen ES indices, the platform faced four major pain points: ES cluster single‑point failures, inability of Otter to switch DB IPs automatically, unclear end‑to‑end monitoring, and index‑level fault propagation. The goals were flexible scaling for ES5‑to‑ES7 migration and high‑availability across the whole sync chain.

Evolution Practices

ES 5.x and 7.x Parallel Support : The gateway now detects the ES version and routes requests to the appropriate endpoint, supporting both versions simultaneously.

Hot‑Plugging ES Clusters : A step‑by‑step procedure (reindex, crab write, diff补数, validation, query switch) enables smooth migration and failover.

REST API and DSL Differences : Elastic low‑level Java REST client was chosen for compatibility; Search, Document, and Script APIs were aligned across versions, with special handling for nested types and script storage.

Reindex Example :

curl -H "Content-Type: application/json" -XPOST http://ip:port/_reindex -d'{
    "source": {
        "remote": {
            "host": "http://ip:port"
        },
        "index": "order_info_beta_tts8"
    },
    "dest": {
        "index": "order_info_beta_tts8"
    }
}'

Canal Offset Migration and Diff Scheduled Tasks are used for partial or full back‑fill when recent data needs to be synchronized.

High Availability Design

The sync chain (Otter → Kafka → DTS → Crab → ES) achieves HA at each layer: Otter runs pipelines on multiple nodes with master‑slave Canal; Kafka uses replicated partitions; DTS runs multiple consumer nodes; Crab isolates traffic per index via Hystrix thread pools; ES indices are stored in multiple clusters with load‑balancing.

Data Consistency Guarantees

Ordered processing across the chain ensures per‑order data stays sequential.

Failed writes are sent to a retry Kafka topic and reprocessed after DB reverse‑lookup.

Diff‑based back‑fill tasks periodically reconcile minute‑level discrepancies for critical indices.

Operational Optimizations

Binlog deduplication by service‑order key reduces redundant writes.

Custom MySQL master‑slave switch for PXC architecture.

Dynamic batch size adjustment in Otter to avoid network saturation during large DDL operations.

Summary and Future Plans

The migration reduced query latency from 68 ms to 21 ms and write latency from 34 ms to 6 ms. Incident handling (ES node failure, Kafka disk full) demonstrated the platform's resilience. Future work includes making DTS aggregation fully configurable and automating failover migrations.

Recruitment Notice

Qunar is hiring interns to senior engineers across multiple positions; interested candidates are invited to apply.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchSystem Designmysqldata synchronizationETL
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.