Big Data 11 min read

Design and Implementation of an Elasticsearch Data Synchronization Framework at Ctrip

This article describes Ctrip's experience building a flexible, high‑performance Elasticsearch synchronization framework that supports full‑load, MQ‑based, ID‑based, and time‑based incremental indexing for diverse data sources such as Hive, MySQL, SOA services, and message queues.

Ctrip Technology

Nov 5, 2020

Design and Implementation of an Elasticsearch Data Synchronization Framework at Ctrip

Elasticsearch has become a popular distributed search and analytics engine, and Ctrip uses it extensively for large‑scale logging, search, and recommendation scenarios. The article focuses on the challenges and solutions for synchronizing business data to Elasticsearch.

The synchronization requirements include full and incremental extraction from various sources (Hive, MySQL, SOA services, MQ), optional data transformation, and near‑real‑time indexing so that users can search the latest information.

In the article's example, article data and associated tags are stored in article and article_tag tables, and tag names must also be indexed. A naive implementation would require cumbersome code such as:

List<Long> tagIds = articleTagDao.query("select tagId from article_tags where articleId=?", articleId);

List<TagPojo> tags = tagDao.query("select id, name from tags where id in (?)");

ArticleInEs articleInEs = new ArticleInEs();

articleInEs.setTagIds(tagIds);

articleInEs.setTagNames(tags.stream().filter(tag -> tagIds.contains(tag.getId())).map(TagPojo::getName).collect(Collectors.toList()));

Because real‑world scenarios involve many more fields and complex business logic, Ctrip needed a simple yet powerful framework to avoid repetitive code.

The article evaluates existing open‑source tools (elasticsearch‑jdbc, go‑mysql‑elasticsearch, Logstash) and identifies limitations such as reliance on plain configuration, inability to handle secure DB credentials, lack of data transformation capabilities, and poor integration with Java‑based jobs.

The proposed solution introduces a modular synchronization component with the following design:

Runner : entry point that parses parameters, creates executors, and initializes rules.

Query : core module that assembles SQL, reads from databases, and dispatches to executors; supports Groovy scripts for lightweight data processing.

PluginManager : loads and manages plugins for common data assembly patterns (e.g., association, mapping, filtering) and provides caching.

Executor : two built‑in executors – IndexExecutor pushes data to Elasticsearch via bulk API and handles index creation, alias switching, etc.; PersistExecutor writes data to flat tables for scenarios where a DB table is used as an intermediate store.

RuleManager/Rule Loader : loads configuration rules from centralized QConfig or local resources, validates them, and reports errors.

The framework supports four indexing strategies:

Full‑load: creates a new index from scratch, optimizes settings (replicas=0, refresh_interval=-1), adds an _indexTime field, performs force merge, and validates document counts before swapping.

MQ incremental: consumes change events from QMQ (derived from Otter MySQL binlog) and applies fine‑grained updates.

ID incremental: receives a list of IDs to re‑index; missing IDs trigger deletions in Elasticsearch.

Time incremental: tracks the last update timestamp per index and periodically pulls new rows; not recommended for heavy DB load.

Implementation details also cover data source handling: SQL‑based plugins for simple joins, code‑based plugins for complex SOA calls or transformations, and the ability to mix both.

In practice, the component has been used to maintain dozens of indices across multiple business lines, allowing developers to focus on business logic rather than repetitive data‑assembly code. It also supports exporting assembled data to flat tables, enabling hybrid sync scenarios such as Hive‑to‑Elasticsearch pipelines.

The article concludes with a brief note on related work (e.g., Hive‑to‑ES via ES‑Hadoop) and points to additional reading on Ctrip's microservice framework, coroutine NIO practice, Redis cross‑region sync, and other engineering topics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Indexing Elasticsearch Data synchronization

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.