How to Build a Scalable Elasticsearch Sync Framework for Real-Time Business Search
This article explains Ctrip's design and implementation of a flexible Elasticsearch data‑synchronization framework that handles full, incremental, ID‑based, and time‑based syncing from multiple sources, addressing the shortcomings of existing tools and simplifying complex data assembly for business search.
Background
Elasticsearch is widely used at Ctrip for log platforms and various business search and recommendation scenarios. This article shares the thinking and practice of data synchronization for business search.
Current Situation
Requirements include full and incremental data extraction from Hive, MySQL, SOA services, MQ, etc., optional transformations, and near‑real‑time sync to ES for user search. Example schema: article table, article_tag table, tag names also indexed.
Traditional hand‑coded assembly is cumbersome, especially when many fields and complex logic are involved.
Problems with Existing Tools
Configuration‑based tools cannot use encrypted DB credentials in production.
They lack flexible data transformation before pushing to ES.
Credentials for ES are exposed in config files.
Complex data assembly becomes more verbose than code.
Incremental sources like MQ cannot be handled.
Standalone CLI tools cannot be integrated into Java jobs.
Design Overview
The proposed framework addresses the above gaps while keeping a configuration‑driven approach for simple cases.
Synchronization Modes
Full sync – creates a new index from scratch, adjusts mapping (replicas=0, refresh_interval=-1), adds _indexTime, performs force merge, restores settings, validates document count, and swaps alias.
MQ incremental – consumes change events from QMQ (originating from Otter MySQL binlog) and updates ES accordingly.
ID incremental – receives a list of IDs, fetches full documents, indexes them, and deletes missing IDs from ES.
Time incremental – tracks last update timestamp per index and periodically pulls new rows; not recommended for heavy DB load.
Component Design
Components are organized by dimension:
Runner : entry point, parses parameters, creates Executor, loads Rules.
Query : builds SQL, reads DB, invokes Executors and Plugins; supports Groovy scripts for pre‑processing.
PluginManager : loads plugins; built‑in plugins include Assoc (join), Map (field mapping), Filter (data cleaning) with optional caching.
Executor : two implementations – IndexExecutor bulk‑writes to ES and handles alias switching; PersistExecutor writes to a flat DB table for later sync.
RuleManager/Rule Loader : loads validation rules from QConfig or local resources, providing early error detection.
Implementation Example
Typical Java code for assembling tag information:
List<Long> tagIds = articleTagDao.query("select tagId from article_tags where articleId=?", articleId);
List<TagPojo> tags = tagDao.query("select id, name from tags where id in (?)");
ArticleInEs articleInEs = new ArticleInEs();
articleInEs.setTagIds(tagIds);
articleInEs.setTagNames(
tags.stream()
.filter(tag -> tagIds.contains(tag.getId()))
.map(TagPojo::getName)
.collect(Collectors.toList()));Conclusion
The framework has been deployed for dozens of business indices, allowing developers to focus on business logic rather than repetitive sync code. It supports both ES and flat‑table outputs, making it suitable for various data‑sync scenarios beyond the component’s core scope.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
