How Qunar Built a High‑Availability, Multi‑Version Elasticsearch Sync Platform

This article details Qunar's design and implementation of a data synchronization platform that aggregates MySQL data into Elasticsearch, supporting parallel ES5.x/7.x clusters, hot‑swap upgrades, flexible scaling, and end‑to‑end consistency with high availability mechanisms.

dbaplus Community
dbaplus Community
dbaplus Community
How Qunar Built a High‑Availability, Multi‑Version Elasticsearch Sync Platform

Platform Overview

The data synchronization platform aggregates MySQL data into Elasticsearch (ES) and provides a unified query gateway for complex after‑sale queries on orders, tickets, PNRs, itineraries, and airlines.

Architecture

Data Sync Module : Uses Alibaba Otter to capture MySQL binlog, Kafka as the message queue, and a custom Data Transfer Service (DTS) for mapping and filtering.

Data Middle‑Platform (Crab) : Java‑based crab-client abstracts ES DSL, handles read/write, unified authentication, Hystrix circuit breaking, and traffic routing by appcode+ES index.

Management Platform : Optional UI for configuring sync jobs, managing DTS nodes, and handling ES cluster authentication, rate limiting, and traffic distribution.

Motivation for ES 7.x Migration

By Q2 2021 the platform served 10+ business lines and 14+ ES indices. Upgrading from ES 5.x to ES 7.x revealed four pain points: single‑node ES failure, inflexible sync links, lack of monitoring/circuit breaking, and cross‑index impact during failures. Goals were smooth ES scaling/migration and full‑link high availability.

Key Design and Practices

1. Parallel ES 5.x/7.x Deployment & Hot‑Swap

Elastic Rest Client : Adopted the low‑level Java client compatible with all ES versions.

REST APIs : Implemented version‑aware endpoints for Search, Scroll, Document, and Script APIs.

Query DSL & Scripting : Mapped DSL differences (e.g., match behavior), handled nested type updates, and switched to inline JSON scripts for ES 7.x.

Response Normalization : Added ?track_total_hits=true to ES 7.x queries to obtain total hit counts and reshape the response to ES 5.x style.

Hot‑swap procedure: reindex old cluster → configure crab-gateway → incremental补数 → verification → query cut‑over.

2. Flexible Reindex & Incremental补数

Three补数 strategies:

Reindex : Use ES _reindex API to copy full data.

Canal Position Shift : Adjust Canal offsets to replay recent binlog for near‑real‑time补数.

Diff补数 Task : Periodically query source DB, compare with ES, and write missing records via Crab.

curl -H "Content-Type: application/json" -XPOST http://ip:port/_reindex -d'{
  "source": {"remote": {"host": "http://ip:port"}, "index": "order_info_beta_tts8"},
  "dest": {"index": "order_info_beta_tts8"}
}'

3. Data Consistency Guarantees

Maintain order of single‑dimension data across the pipeline (Otter → Kafka → DTS → Crab → ES) using consistent partition keys such as db_name+order_id.

Failed writes are routed to a retry Kafka topic; DTS reprocesses them after fetching the latest DB state.

Critical indices run minute‑level diff补数 jobs that reconcile discrepancies.

4. High Availability of the Sync Chain

Otter runs on multiple nodes with pipeline isolation and master‑slave Canal.

Kafka topics have multiple partitions and replicas.

DTS spawns multiple consumer nodes per topic for failover.

Crab uses Hystrix thread pools per appcode+index to isolate load.

ES indices are replicated across clusters; routing is managed by the manager component.

Operational incidents (e.g., ES node failure, Kafka disk full) are mitigated via the hot‑swap补数 approach and manual failover.

Results

After migration, query latency for domestic ticket searches dropped from 68 ms to 21 ms and write latency from 34 ms to 6 ms. The platform now supports seamless ES version upgrades, parallel clusters, and automated补数, while still requiring occasional manual intervention for fault recovery.

Future Work

Planned improvements include configuration‑driven DTS aggregation, automated fault migration, and further reduction of manual operations to enhance scalability and ease of integration for new business lines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchhigh availabilityKafkadata synchronizationOtterReindex
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.