How Filing 1.0 Revolutionizes Heterogeneous Data Archiving for High‑Scale Transactions
Filing 1.0 is a no‑code, heterogeneous data‑archiving platform that unifies MySQL, HBase, Hive, and Elasticsearch, addressing massive order volumes, multi‑domain requirements, and hot‑cold data separation through a star‑shaped architecture, flexible scheduling, and a four‑component archiving engine.
1. Overview of Filing 1.0
Filing is a heterogeneous data‑source archiving platform that enables stable, efficient migration among relational databases (MySQL), HBase, Hive, Elasticsearch and other sources without writing code.
1.1 Background
Rapid business growth has caused order volumes to explode, putting pressure on MySQL storage and requiring hot‑cold data isolation. Key challenges include:
Massive data scale (billions of records).
Table‑level cascading archiving needs.
Limitations of homogeneous archiving on business continuity.
Diverse domains (orders, waybills, finance, invoices, work orders) each needing archiving.
Different archiving targets (HBase for orders/waybills, MySQL for invoices/work orders).
To meet these needs, a custom archiving platform—Filing—was built.
1.2 Design Philosophy
We researched mainstream ETL tools before designing Filing.
Compared with TurboDX (commercial), Kettle, and DataX, Filing offers:
Support for heterogeneous real‑time replication and read/write separation.
Web‑based graphical UI with field‑level mapping.
Breakpoint resume, flow control, distributed execution, peak‑time handling, and emergency mechanisms.
Filing transforms complex mesh‑style sync links into a star‑shaped data link, acting as a middle transport that seamlessly connects new data sources.
2. Framework Design
2.1 Plan Center
The Plan Center manages creation, modification, and activation of archiving plans.
2.2 Scheduler Center
The Scheduler Center handles node registration/discovery, task splitting, and binding. Registration Center: Uses Redis Zset for role registration/discovery and Pub/Sub for online/offline notifications.
2.2.1 Create Task
After a plan is created and activated, LaLaJob schedules and creates tasks based on a split strategy. Split strategy: Uses shardingCount (shardingTable) or time‑based conditions when no sharding table exists.
2.2.2 Bind Task
Tasks are evenly distributed across executors based on the shortest‑queue rule.
2.2.3 Trigger Task
Only timed triggers are supported; TaskManager scans for tasks whose trigger time is within the next 60 seconds and creates corresponding timed jobs.
2.2.4 Execute Task
TaskManager obtains a thread from the SubTaskWorkers pool, binds a SubTask object, and calls TaskHandler#handle to run the business logic (the core archiving code).
2.3 Archiving Engine
The engine consists of four components that abstract source reading and target writing.
Trigger: Builds data‑collection conditions for the scanner.
Scanner: Scans base tables on the source side.
Collector: Gathers scanned data and broadcasts it.
Executor: Pulls data from the collector and writes it to the destination.
3. Core Process
Filing runs archiving jobs in a distributed multithreaded mode. The lifecycle includes:
User creates and publishes an archiving plan.
The plan triggers LaLaJob’s OpenAPI to generate a job and initialize all data‑source configurations.
LaLaJob schedules tasks; the Scheduler splits the job into concurrent tasks, each with its own trigger conditions, and assigns them to nodes.
The engine’s scanner reads source data, passes it to the collector, transforms it, and the executor writes it to the target.
Post‑processors aggregate results and record any exceptions.
3.2 Plan Creation
Creating a plan involves selecting data‑source metadata (resource ID, schema, etc.).
3.3 Execution Results
3.3.1 Data Statistics
4. Challenges and Solutions
4.1 HBase RowKey Design
To ensure uniform region distribution despite varying order‑number lengths, a custom RowKey generation rule was devised.
4.2 Byte‑Array Equality
Identical byte[] contents with different references caused duplicate rows in HBase; overriding equals and hashCode to compare content resolved the issue.
4.3 Cache Consistency
Using a two‑level cache (Caffeine + Redis) improved performance but introduced stale data when plans changed; Redis Pub/Sub now notifies all nodes to clear local caches.
5. Archiving Data Query (ADQ)
The ADQ plugin provides a “plug‑and‑play” query interface for archived data, enabling non‑intrusive, reusable access across services.
5.1 Design Goals
Reusable across multiple query services.
One‑click integration with low onboarding cost.
Zero intrusion to business code.
5.2 Implementation Principle
5.2.1 Core Flow
5.2.2 Overall Flow (Order Query Example)
Demonstrates how an order query service interacts with ADQ.
5.3 Main Features
5.4 HBase‑GUI
A self‑developed visual tool for querying and troubleshooting archived data in HBase.
6. Practical Applications
Filing is deployed in the transaction domain for:
Transaction order data archiving.
Hot‑cold separation of Elasticsearch clusters.
Alleviates storage cost vs. query performance tension. Provides a simple, reusable archiving solution. Greatly boosts development efficiency. Enriches the company’s common tooling.
6.1 Development Efficiency
Supports archiving across all transaction domains, dramatically improving developer productivity.
Combined with ADQ, it relieves MySQL hot‑store pressure as order volume grows.
6.2 Enriching Common Tools
HBase‑GUI has been contributed to the big‑data team’s BQ platform.
ADQ can be used with or without Filing, reducing development workload.
7. Future Roadmap
Core goal: Prioritize developer efficiency, achieve reusability and scalability, and explore additional real‑world scenarios.
Vision: Make data storage no longer a bottleneck and build a stable, high‑performance, flexible transaction fulfillment system.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
