Design and Implementation of XFlink: A Flink‑Based Data Migration System on Yarn
The article describes the evolution from the legacy XDATA tool to the new XFlink system, detailing its architecture, core plugins, parser and deployment modules, resource management with Yarn, monitoring via Prometheus and Grafana, and planned enhancements such as Flink SQL configuration and modular plugins.
Background : In 2015 the data center built XDATA, a first‑generation data‑migration tool that later evolved into a master‑worker architecture with limited scalability.
Problems of the old system : Resource under‑utilization due to dedicated workers, poor parallelism causing long task runtimes (up to 20 hours), and heavy operational overhead because of static resource allocation and complex release procedures.
New system introduction : XFlink was created to address these issues by leveraging Yarn for dynamic resource isolation and scaling, and Flink as the execution engine for its native parallelism, fault‑tolerance, and monitoring capabilities. The system name follows the existing naming convention.
Core module – Plugin : Implements read/write plugins based on Flink's InputFormat and OutputFormat, ensuring write idempotency via primary keys and enhancing read parallelism with strategies for Elasticsearch and other sources. It also provides statistics via Flink accumulators, archives abnormal data, and applies rate limiting using Guava's RateLimiter.
Core module – Parser : Transforms user‑provided plugin configurations into concrete Flink job graphs, generating the full data‑migration logic that is later submitted to Yarn.
Deploy module : Wraps Flink's submission scripts into a long‑running service that handles task submission in single‑job or detached mode, supports multi‑Yarn‑cluster deployment by loading distinct configuration files, and creates Yarn clients with proper tenant impersonation.
Monitoring : Uses Prometheus + PushGateway + Grafana for both Flink task metrics (via Flink's built‑in metrics system) and submission‑module metrics (via Dropwizard), tracking job latency, queue sizes, GC, and memory usage.
Future work : Plans include configuring tasks via Flink SQL for easier user interaction and extending the plugin framework with SPI‑based modular plugins that allow business teams to develop custom connectors after review.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
