How to Build a Scalable Elasticsearch Sync Framework for Real-Time Business Search

This article explains Ctrip's design and implementation of a flexible Elasticsearch data‑synchronization framework that handles full, incremental, ID‑based, and time‑based syncing from multiple sources, addressing the shortcomings of existing tools and simplifying complex data assembly for business search.

dbaplus Community
dbaplus Community
dbaplus Community
How to Build a Scalable Elasticsearch Sync Framework for Real-Time Business Search

Background

Elasticsearch is widely used at Ctrip for log platforms and various business search and recommendation scenarios. This article shares the thinking and practice of data synchronization for business search.

Current Situation

Requirements include full and incremental data extraction from Hive, MySQL, SOA services, MQ, etc., optional transformations, and near‑real‑time sync to ES for user search. Example schema: article table, article_tag table, tag names also indexed.

Traditional hand‑coded assembly is cumbersome, especially when many fields and complex logic are involved.

Problems with Existing Tools

Configuration‑based tools cannot use encrypted DB credentials in production.

They lack flexible data transformation before pushing to ES.

Credentials for ES are exposed in config files.

Complex data assembly becomes more verbose than code.

Incremental sources like MQ cannot be handled.

Standalone CLI tools cannot be integrated into Java jobs.

Design Overview

The proposed framework addresses the above gaps while keeping a configuration‑driven approach for simple cases.

Framework architecture diagram
Framework architecture diagram

Synchronization Modes

Full sync – creates a new index from scratch, adjusts mapping (replicas=0, refresh_interval=-1), adds _indexTime, performs force merge, restores settings, validates document count, and swaps alias.

MQ incremental – consumes change events from QMQ (originating from Otter MySQL binlog) and updates ES accordingly.

ID incremental – receives a list of IDs, fetches full documents, indexes them, and deletes missing IDs from ES.

Time incremental – tracks last update timestamp per index and periodically pulls new rows; not recommended for heavy DB load.

Component Design

Components are organized by dimension:

Runner : entry point, parses parameters, creates Executor, loads Rules.

Query : builds SQL, reads DB, invokes Executors and Plugins; supports Groovy scripts for pre‑processing.

PluginManager : loads plugins; built‑in plugins include Assoc (join), Map (field mapping), Filter (data cleaning) with optional caching.

Executor : two implementations – IndexExecutor bulk‑writes to ES and handles alias switching; PersistExecutor writes to a flat DB table for later sync.

RuleManager/Rule Loader : loads validation rules from QConfig or local resources, providing early error detection.

Implementation Example

Typical Java code for assembling tag information:

List<Long> tagIds = articleTagDao.query("select tagId from article_tags where articleId=?", articleId);
List<TagPojo> tags = tagDao.query("select id, name from tags where id in (?)");
ArticleInEs articleInEs = new ArticleInEs();
articleInEs.setTagIds(tagIds);
articleInEs.setTagNames(
    tags.stream()
        .filter(tag -> tagIds.contains(tag.getId()))
        .map(TagPojo::getName)
        .collect(Collectors.toList()));

Conclusion

The framework has been deployed for dozens of business indices, allowing developers to focus on business logic rather than repetitive sync code. It supports both ES and flat‑table outputs, making it suitable for various data‑sync scenarios beyond the component’s core scope.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaindexingElasticsearchETL
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.