Backend Development 10 min read

How to Build a Scalable Elasticsearch Sync Framework for Real-Time Business Search

This article explains Ctrip's design and implementation of a flexible Elasticsearch data‑synchronization framework that handles full, incremental, ID‑based, and time‑based syncing from multiple sources, addressing the shortcomings of existing tools and simplifying complex data assembly for business search.

dbaplus Community

Jan 18, 2021

How to Build a Scalable Elasticsearch Sync Framework for Real-Time Business Search

Background

Elasticsearch is widely used at Ctrip for log platforms and various business search and recommendation scenarios. This article shares the thinking and practice of data synchronization for business search.

Current Situation

Requirements include full and incremental data extraction from Hive, MySQL, SOA services, MQ, etc., optional transformations, and near‑real‑time sync to ES for user search. Example schema: article table, article_tag table, tag names also indexed.

Traditional hand‑coded assembly is cumbersome, especially when many fields and complex logic are involved.

Problems with Existing Tools

Configuration‑based tools cannot use encrypted DB credentials in production.

They lack flexible data transformation before pushing to ES.

Credentials for ES are exposed in config files.

Complex data assembly becomes more verbose than code.

Incremental sources like MQ cannot be handled.

Standalone CLI tools cannot be integrated into Java jobs.

Design Overview

The proposed framework addresses the above gaps while keeping a configuration‑driven approach for simple cases.

Synchronization Modes

Full sync – creates a new index from scratch, adjusts mapping (replicas=0, refresh_interval=-1), adds _indexTime, performs force merge, restores settings, validates document count, and swaps alias.

MQ incremental – consumes change events from QMQ (originating from Otter MySQL binlog) and updates ES accordingly.

ID incremental – receives a list of IDs, fetches full documents, indexes them, and deletes missing IDs from ES.

Time incremental – tracks last update timestamp per index and periodically pulls new rows; not recommended for heavy DB load.

Component Design

Components are organized by dimension:

Runner : entry point, parses parameters, creates Executor, loads Rules.

Query : builds SQL, reads DB, invokes Executors and Plugins; supports Groovy scripts for pre‑processing.

PluginManager : loads plugins; built‑in plugins include Assoc (join), Map (field mapping), Filter (data cleaning) with optional caching.

Executor : two implementations – IndexExecutor bulk‑writes to ES and handles alias switching; PersistExecutor writes to a flat DB table for later sync.

RuleManager/Rule Loader : loads validation rules from QConfig or local resources, providing early error detection.

Implementation Example

Typical Java code for assembling tag information:

List<Long> tagIds = articleTagDao.query("select tagId from article_tags where articleId=?", articleId);
List<TagPojo> tags = tagDao.query("select id, name from tags where id in (?)");
ArticleInEs articleInEs = new ArticleInEs();
articleInEs.setTagIds(tagIds);
articleInEs.setTagNames(
    tags.stream()
        .filter(tag -> tagIds.contains(tag.getId()))
        .map(TagPojo::getName)
        .collect(Collectors.toList()));

Conclusion

The framework has been deployed for dozens of business indices, allowing developers to focus on business logic rather than repetitive sync code. It supports both ES and flat‑table outputs, making it suitable for various data‑sync scenarios beyond the component’s core scope.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java Indexing Elasticsearch ETL

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.