Big Data 14 min read

How Filing 1.0 Revolutionizes Heterogeneous Data Archiving for High‑Scale Transactions

Filing 1.0 is a no‑code, heterogeneous data‑archiving platform that unifies MySQL, HBase, Hive, and Elasticsearch, addressing massive order volumes, multi‑domain requirements, and hot‑cold data separation through a star‑shaped architecture, flexible scheduling, and a four‑component archiving engine.

Huolala Tech

Oct 17, 2024

How Filing 1.0 Revolutionizes Heterogeneous Data Archiving for High‑Scale Transactions

1. Overview of Filing 1.0

Filing is a heterogeneous data‑source archiving platform that enables stable, efficient migration among relational databases (MySQL), HBase, Hive, Elasticsearch and other sources without writing code.

1.1 Background

Rapid business growth has caused order volumes to explode, putting pressure on MySQL storage and requiring hot‑cold data isolation. Key challenges include:

Massive data scale (billions of records).

Table‑level cascading archiving needs.

Limitations of homogeneous archiving on business continuity.

Diverse domains (orders, waybills, finance, invoices, work orders) each needing archiving.

Different archiving targets (HBase for orders/waybills, MySQL for invoices/work orders).

To meet these needs, a custom archiving platform—Filing—was built.

1.2 Design Philosophy

We researched mainstream ETL tools before designing Filing.

Compared with TurboDX (commercial), Kettle, and DataX, Filing offers:

Support for heterogeneous real‑time replication and read/write separation.

Web‑based graphical UI with field‑level mapping.

Breakpoint resume, flow control, distributed execution, peak‑time handling, and emergency mechanisms.

Filing transforms complex mesh‑style sync links into a star‑shaped data link, acting as a middle transport that seamlessly connects new data sources.

2. Framework Design

2.1 Plan Center

The Plan Center manages creation, modification, and activation of archiving plans.

2.2 Scheduler Center

The Scheduler Center handles node registration/discovery, task splitting, and binding. Registration Center: Uses Redis Zset for role registration/discovery and Pub/Sub for online/offline notifications.

2.2.1 Create Task

After a plan is created and activated, LaLaJob schedules and creates tasks based on a split strategy. Split strategy: Uses shardingCount (shardingTable) or time‑based conditions when no sharding table exists.

2.2.2 Bind Task

Tasks are evenly distributed across executors based on the shortest‑queue rule.

2.2.3 Trigger Task

Only timed triggers are supported; TaskManager scans for tasks whose trigger time is within the next 60 seconds and creates corresponding timed jobs.

2.2.4 Execute Task

TaskManager obtains a thread from the SubTaskWorkers pool, binds a SubTask object, and calls TaskHandler#handle to run the business logic (the core archiving code).

2.3 Archiving Engine

The engine consists of four components that abstract source reading and target writing.

Trigger: Builds data‑collection conditions for the scanner.

Scanner: Scans base tables on the source side.

Collector: Gathers scanned data and broadcasts it.

Executor: Pulls data from the collector and writes it to the destination.

3. Core Process

Filing runs archiving jobs in a distributed multithreaded mode. The lifecycle includes:

User creates and publishes an archiving plan.

The plan triggers LaLaJob’s OpenAPI to generate a job and initialize all data‑source configurations.

LaLaJob schedules tasks; the Scheduler splits the job into concurrent tasks, each with its own trigger conditions, and assigns them to nodes.

The engine’s scanner reads source data, passes it to the collector, transforms it, and the executor writes it to the target.

Post‑processors aggregate results and record any exceptions.

3.2 Plan Creation

Creating a plan involves selecting data‑source metadata (resource ID, schema, etc.).

3.3 Execution Results

3.3.1 Data Statistics

4. Challenges and Solutions

4.1 HBase RowKey Design

To ensure uniform region distribution despite varying order‑number lengths, a custom RowKey generation rule was devised.

4.2 Byte‑Array Equality

Identical byte[] contents with different references caused duplicate rows in HBase; overriding equals and hashCode to compare content resolved the issue.

4.3 Cache Consistency

Using a two‑level cache (Caffeine + Redis) improved performance but introduced stale data when plans changed; Redis Pub/Sub now notifies all nodes to clear local caches.

5. Archiving Data Query (ADQ)

The ADQ plugin provides a “plug‑and‑play” query interface for archived data, enabling non‑intrusive, reusable access across services.

5.1 Design Goals

Reusable across multiple query services.

One‑click integration with low onboarding cost.

Zero intrusion to business code.

5.2 Implementation Principle

5.2.1 Core Flow

5.2.2 Overall Flow (Order Query Example)

Demonstrates how an order query service interacts with ADQ.

5.3 Main Features

5.4 HBase‑GUI

A self‑developed visual tool for querying and troubleshooting archived data in HBase.

6. Practical Applications

Filing is deployed in the transaction domain for:

Transaction order data archiving.

Hot‑cold separation of Elasticsearch clusters.

Alleviates storage cost vs. query performance tension. Provides a simple, reusable archiving solution. Greatly boosts development efficiency. Enriches the company’s common tooling.

6.1 Development Efficiency

Supports archiving across all transaction domains, dramatically improving developer productivity.

Combined with ADQ, it relieves MySQL hot‑store pressure as order volume grows.

6.2 Enriching Common Tools

HBase‑GUI has been contributed to the big‑data team’s BQ platform.

ADQ can be used with or without Filing, reducing development workload.

7. Future Roadmap

Core goal: Prioritize developer efficiency, achieve reusability and scalability, and explore additional real‑world scenarios.

Vision: Make data storage no longer a bottleneck and build a stable, high‑performance, flexible transaction fulfillment system.

ETL Distributed Processing data archiving heterogeneous databases

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.